Our HDF5 Format Explained

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

Internally, we make use of HDF5 because of its hierarchical structure which allows for efficient access of data attributes. It also allows for storing data in a flexible way which seems paramount to us in the current state of the machine learning community and its wild bunch of file formats used. We store two different types of files in hdf5 format. The dataset itself and the split file. More details on this in the sections below.

Please also have a look at the example page on how you can use the files you download from this site: example page. An example file is available for download as well.
You can download our converter tool from mloss.org and convert data files locally to check if everything is transformed appropriately, maybe even help debugging problems.

HDF5 Attributes and Datasets

The basic abstraction is that a data set is a large collection of objects having the same type. Each object is a fixed length array of features which may be of different types. For example, each object can be an array of mixed categorical and numerical data, but all objects in the dataset have the same mix.

HDF5 attributes on the root level

  • mldata: Internal mldata version; Integer, currently 0
  • name: Name to briefly describe the dataset; String
  • comment: Any extended comment or description of the dataset; String

HDF5 group 'data_descr'

  • names: Dataset with names of the variables; Array of Strings
  • ordering: Dataset with ordering of the variables in the original dataset, using short names for each variable; Array of Strings
  • types: Optional dataset (if source was an arff file) to denote the data types used. This will be one of 'numeric' or 'nominal:,,..' or 'string'; Array of Strings

HDF5 group 'data'

EITHER single sparse matrix when all attributes types are the same

  • data: The variables themselves; Array of Objects
  • indices: The indices of non-zero values; Array of Integers
  • indptr: The index pointers for the variables' columns; Array of Integers

This represents the Compressed Sparse Column format described at scipy.org and Wikipedia.

OR single dense matrix when all attribute types are the same

  • data: The variables themselves; Array of Objects

OR multiple datasets when variable types are mixed

  • <shortname>: Variable of given name; Array of Objects
  • OR
  • <type><no>: Variable of given type; Array of Objects

If possible short names are used for the dataset names for better recognition, otherwise the variable type (int, double, str) is used. In the latter case the datasets are also numbered for uniqueness

OPTIONAL vector/matrix with natural labels

  • label: The natural labels split from the original data; Array of Integers/Floats

Natural labels are currently considered for files in LibSVM format.

Split files and Tasks

Note that the distinction what is input/output, label/target depends on the TASK, not on the data set itself! We do have a mechanism in place to create automatic split files while slurping datasets from other repositories, though. These datasets may be defined in the split files:

HDF5 attributes on the root level

See above

HDF5 group 'task_descr' (derived from the Task object)

  • pub_date
  • version
  • slug
  • summary
  • description
  • urls
  • publications
  • is_public
  • is_deleted
  • is_current
  • user
  • downloads
  • hits
  • input
  • output
  • performance_measure
  • performance_ordering
  • type name
  • Data name
  • Data (held back) name
  • license
  • tags

HDF5 group 'task'

  • train_idx: Vector or matrix containing the indices of the examples to be used for learning the model. If a matrix, each column specifies one particular split for the purposes of cross-validation (so k-fold cross- validation results in a matrix with k columns).
  • validation_idx: If it exists, it has the same structure as train idx, containing the indices of the examples to be used for validation (selecting model order, etc). If NULL, solutions are free to use whatever part of train for validation purposes.
  • test_idx: This contains the indices to be used for testing, in the same format as train idx and validation idx (never NULL). For tasks with everything public, these data vectors in the data item look exactly like the training data vectors. For challenges with hidden answers, these data vectors has ‘NaN’ values for the target.
  • input_variables: List with indices of the variables to be used to predict the target(s).
  • output variables: The index of the variable to be predicted from the features.

Supported formats

The website (and its converter tool) currently supports conversion from and to the following data formats:

to HDF5 from HDF5
  • libsvm
  • arff
  • csv
  • matlab
  • octave
  • libsvm
  • arff
  • csv
  • matlab
  • octave
  • xml
  • rdata

When uploading data, files can be compressed by gzip, bzip2 or as a single file in zip files or tarballs. They will be decompressed automatically after upload.

You will find an implementation of a converter in Python in the source tarball, utils/hdf5conv (and scripts/hdf5conv.py as an example how to use the converter). You can also download an example file.

Please also have a look at the example page on how you can use the files you download from this site: example page.


Contents

Acknowledgements

This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
PASCAL Logo
http://www.pascal-network.org/.