mldata :: Repository

About

You are here: Home / Repository

Repository

The following kinds of objects are stored in the repository, we will illustrate each class with the MNIST5 data set:

Data: This is "raw data", available in any one of a number of standard formats (including, for example, CSV, ARFF, netCDF, HDF5, ODBC), together with a description of the data, sources, possibly publications. For MNIST, the data is the images with corresponding labels. These data files will be converted to our flavour of HDF5 . You can download our converter tool from mloss.org and convert locally to check if everything is transformed appropriately, maybe even help debugging problems.
Task: This is a learning task, consisting of a formal description of input-output relationships based on a data set and methods for evaluating the results. This might also include fixed partitionings into training and test data for the base data set. For the MNIST example, a learning task might be to perform multiclass classification for all ten classes, using the F1-score as performance measure.
Method: Essentially capturing the 'Materials and Methods' section of papers, Methods consists of a specific machine learning method applied to a learning task, but including all the other information, for example, what kind of feature processing has been applied, how parameters of the learning method have been selected, what operating system has been used, etc. Ideally, a method also provides a single parameterless piece of code for others to re-run the experiments, preferably with source code.
Challenge: Challenges are a collection of tasks that are to be solved.

These descriptions boil down to the following properties for the respective object classes:

Common to all: papers, external urls, free form description, version, time stamps, possibly also back links
Data: container format, source url, measurement details, usage scenario, links to tasks, links to methods
Task: input format, output format, performance measure, links to data, links to methods
Method: details of the computational pipeline

Between data sets and tasks, the relationship is one-to-many, as a data set can give rise to many different learning tasks. A method can also be applied to several different tasks, giving rise to solutions. On the other hand, a task can have many solutions, but each solution belongs to a certain learning task. These relationships are illustrated in the image below.

Given the described infrastructure, organising a challenge essentially boils down to selecting a group of datasets and appropriate data splits into training validation and test data, and choosing a machine learning task. As the data storage and evaluating mechanism is already taken care of by the basic data-repository one merely needs to add an administration interface and a challenge "view" on the data.

Disclaimer

We are acting in good faith to make datasets submitted for the use of the scientific community available to everybody, but if you are a copyright holder and would like us to remove a dataset please inform us and we will do it as soon as possible.

Data | Task | Method | Challenge

Acknowledgements

This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)

http://www.pascal-network.org/.