Machine Learning (ML) is a scientific field comprised of both theoretical and empirical results. For methodological advances, one key aspect of reproducible research is the ability to compare a proposed approach with the current state of the art. Such a comparison can be theoretical in nature, but often a detailed theoretical analysis is not possible or may not tell the whole story. In such cases, an empirical comparison is necessary. To produce reproducible machine learning research, there are three main required components that need to be easily available:

  • The paper describing the method clearly and comprehensively.
  • The data on which the results are computed.
  • Software (possibly source code) that implements the method and produces the figures and tables of results in the paper.

While the open access idea has already migrated to ML through journals such as the Journal of Machine Learning Research, the field of ML has fallen behind with regard to open source and data.

In the statistics and bioinformatics community, some headway has been seen. The success of R and Bioconductor as well as projects such as Sweave have resulted in the possibility to embed the code that produces the results of the paper in the source of the paper itself. This idea of having a unified computation and presentation has been pioneered by Jon Claerbout (in Redoc) and David Donoho (in Wavelab). This approach, however, may not be all that suitable for the machine learning community, as the datasets may be quite large and computations often take significant amounts of time. Instead, here we propose separate but interacting units of software, data and documentation of the scientific result.

Recently, has enabled machine learning researchers to register their software and allow other researchers to easily find, download, and reuse software matching their interests. Currently, more than 200 projects are listed. Furthermore, the Journal of Machine Learning Research now accepts papers to its special Open Source Software track, in which papers describing peer-reviewed software can be published, as a further incentive for researchers to publish their software under an open source license. Since its inception, in October 2007, seven papers have been published in this track with more papers currently under review. So far, the initiative has been highly successful, but has focused mostly on the ”method” side of the problem to make machine learning research more reproducible. Hence we see the need to initiate a companion project to which focuses on the free exchange and benchmarking of datasets. Additionally, this new repository will emphasise the precise specification of machine learning tasks: detailed definitions of datasets to be used (possibly including feature extraction or other preprocessing steps) together with the desired operation to be performed and the relevant performance metric. Finally, a solution to such a task would provide details of how to apply a general software package (such as on to this particular problem instance, as well as the obtained numerical performance measures. This project will thus focus on providing a platform for publishing, exchanging, collecting, and discussing such data sets, tasks, and methods for challenging machine learning problems.



This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)