mldata :: About

About

You are here: Home / About / HDF5

Our HDF5 Format Explained

HDF5 is a data model, library, and file format for storing and managing data. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable and is extensible, allowing applications to evolve in their use of HDF5. The HDF5 Technology suite includes tools and applications for managing, manipulating, viewing, and analyzing data in the HDF5 format.

Internally, we make use of HDF5 because of its hierarchical structure which allows for efficient access of data attributes. It also allows for storing data in a flexible way which seems paramount to us in the current state of the machine learning community and its wild bunch of file formats used. We store two different types of files in hdf5 format. The dataset itself and the split file. More details on this in the sections below.

Please also have a look at the example page on how you can use the files you download from this site: example page. An example file is available for download as well.
You can download our converter tool from mloss.org and convert data files locally to check if everything is transformed appropriately, maybe even help debugging problems.

HDF5 Attributes and Datasets

The basic abstraction is that a data set is a large collection of objects having the same type. Each object is a fixed length array of features which may be of different types. For example, each object can be an array of mixed categorical and numerical data, but all objects in the dataset have the same mix.

HDF5 attributes on the root level

mldata: Internal mldata version; Integer, currently 0
name: Name to briefly describe the dataset; String
comment: Any extended comment or description of the dataset; String

HDF5 group 'data_descr'

names: Dataset with names of the variables; Array of Strings
ordering: Dataset with ordering of the variables in the original dataset, using short names for each variable; Array of Strings
types: Optional dataset (if source was an arff file) to denote the data types used. This will be one of 'numeric' or 'nominal:,,..' or 'string'; Array of Strings

HDF5 group 'data'

EITHER single sparse matrix when all attributes types are the same

data: The variables themselves; Array of Objects
indices: The indices of non-zero values; Array of Integers
indptr: The index pointers for the variables' columns; Array of Integers

This represents the Compressed Sparse Column format described at scipy.org and Wikipedia.

OR single dense matrix when all attribute types are the same

data: The variables themselves; Array of Objects

OR multiple datasets when variable types are mixed

<shortname>: Variable of given name; Array of Objects
OR
<type><no>: Variable of given type; Array of Objects

If possible short names are used for the dataset names for better recognition, otherwise the variable type (int, double, str) is used. In the latter case the datasets are also numbered for uniqueness

OPTIONAL vector/matrix with natural labels

label: The natural labels split from the original data; Array of Integers/Floats

Natural labels are currently considered for files in LibSVM format.

Split files and Tasks

Note that the distinction what is input/output, label/target depends on the TASK, not on the data set itself! We do have a mechanism in place to create automatic split files while slurping datasets from other repositories, though. These datasets may be defined in the split files:

HDF5 attributes on the root level

See above

HDF5 group 'task_descr' (derived from the Task object)

pub_date
version
slug
summary
description
urls
publications
is_public
is_deleted
is_current
user
downloads
hits
input
output
performance_measure
performance_ordering
type name
Data name
Data (held back) name
license
tags

HDF5 group 'task'

train_idx: Vector or matrix containing the indices of the examples to be used for learning the model. If a matrix, each column specifies one particular split for the purposes of cross-validation (so k-fold cross- validation results in a matrix with k columns).
validation_idx: If it exists, it has the same structure as train idx, containing the indices of the examples to be used for validation (selecting model order, etc). If NULL, solutions are free to use whatever part of train for validation purposes.
test_idx: This contains the indices to be used for testing, in the same format as train idx and validation idx (never NULL). For tasks with everything public, these data vectors in the data item look exactly like the training data vectors. For challenges with hidden answers, these data vectors has ‘NaN’ values for the target.
input_variables: List with indices of the variables to be used to predict the target(s).
output variables: The index of the variable to be predicted from the features.

Supported formats

The website (and its converter tool) currently supports conversion from and to the following data formats:

to HDF5	from HDF5
libsvm arff csv matlab octave	libsvm arff csv matlab octave xml rdata

When uploading data, files can be compressed by gzip, bzip2 or as a single file in zip files or tarballs. They will be decompressed automatically after upload.

You will find an implementation of a converter in Python in the source tarball, utils/hdf5conv (and scripts/hdf5conv.py as an example how to use the converter). You can also download an example file.

Please also have a look at the example page on how you can use the files you download from this site: example page.

Acknowledgements

This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)

http://www.pascal-network.org/.

Our HDF5 Format Explained

HDF5 Attributes and Datasets

HDF5 attributes on the root level

HDF5 group 'data_descr'

HDF5 group 'data'

EITHER single sparse matrix when all attributes types are the same

OR single dense matrix when all attribute types are the same

OR multiple datasets when variable types are mixed

OPTIONAL vector/matrix with natural labels

Split files and Tasks

HDF5 attributes on the root level

HDF5 group 'task_descr' (derived from the Task object)

HDF5 group 'task'

Supported formats

Contents

Acknowledgements