Thread My experience navigating this site
- mgashler (14 years, 8 months ago)
I'm having a lot of difficulty navigating this site. Perhaps if I describe my experience, this might help to clarify what changes are needed.
My goal
I came here in search of some data that I could use to test a new algorithm for a recommendation system. I would be happy with any dataset that contains ratings of multiply items by multiple human users. I suspect that many such datasets exist, and there are probably several on this site. I don't need a dataset with a class label, but I could easily discard an extra attribute.
Hierarchical list of categories
I started searching for a hierarchical list of categories. I was so convinced that such a thing must certainly exist that I spent about 10 minutes clicking around in search of it. Apparently that's not how things are organized here.
The search field
The search field looked promising, so I tried it next. I was hoping to find a big list of results ranked by partial relevance. (Perhaps I have been spoiled by Google.) Instead, every query I attempted returned zero results. After several attempts, I concluded that it must only return exact matches, so I tried every possible grammatical variant of every relevant word I could think of. Still zero mathches. So, I decided the search bar was wasting my time.
Sort by
The "Sort by" section looks like it could have been useful. Unfortunatly, none of the features by which I could sort were relevant to what I wanted. I really don't know who submitted the data I want, or how many attributes it has. I don't care how many times it has been downloaded, etc. If there were many more attributes, then this could be useful. It would be ideal if I could specify multiple criteria and have it return all matching datasets, but even this would be worthless with such few attributes to work with.
Tag cloud
Next, I thought perhaps I could find it by navigating within the "tag cloud". Is this the nearest equivalent to the hierarchical categories that I was originally expecting? Unfortunately, I could not seem to comprehend the relationships between tags. When I click on one, the tags change (sometimes), but I could never seem to get any closer to relevant tags. Is that just because there is not enough data yet? I could never seem to become confident that I had fully searched the tag cloud. I really wish I could just see a complete alphabetized list of tags--then I could at least know for sure that there wasn't a "recommendations" tag sitting in the tag cloud just beyond the horizon of my current scope.
Brute force
Finally, I decided I was going to just find it the old fashioned way. I clicked on "Repository->Data", and started reading the descriptions of the data sets. My first impression was that they were all named "Friedman-datasets", had an unknown license, and had a summary of "(No information yet)". By page 8, I finally got past the Friedman datasets, but I couldn't find any information about any of them to indicate whether they were suitable for my needs. I started downloading random datasets, only to find that all comments and meta-data had been stripped. This is when I gave up and wrote this forum post.
Conclusion
The value of all this data is limited by the lack of meta-data.
P.S. I really appreciate the work you guys are doing with this site. I think this project has a lot of potential to do good in the future. Thanks for all your great efforts!
- sonne (14 years, 7 months ago)
Hi!
First of all thanks for you feedback. I think you haven't found anything because there isn't such a data set on mldata.org yet. The data sets that are currently there were automatically pulled from the libsvm and weka repository and are basically uci and some extra data sets. So naturally as these don't have a summary or a lot of meta data because they were not humanly uploaded. Well, the only humanly uploaded data sets are the IDA benchmark data sets, so far that I uploaded.
Now, any suggestions on how we could improve?
- Sure the search function could be better showing fuzzy matches and rank by relevance. But honestly who will use such a thing in the google age?
- Which further 'sort by' attributes do you expect.
- Tags, I agree some filtering based on multiple tags would be cool.
I am not sure what the conclusion should be. It seems to me that we are simply missing users that care for their data. I mean the content here is in big contrast to mloss.org (where each software package is humanly uploaded/maintained). So it is a chicken and egg problem. If we get users that care about their data then this will eventually resolve for good...
- mgashler (14 years, 7 months ago)
I see. When I saw that you had 786 datasets (which is many more than I could find elsewhere), I assumed that this site must also have a huge following of users who contributed them. This poor assumption caused me to conclude that the meta-data was just poorly indexed. I see now that it was never available to you in the first place.
Perhaps my confusion could have been avoided with a mechanism to filter out datasets that lacked a descriptive summary. I think if I had been able to filter these out, I would have quickly determined that what I wanted did not exist. I'd also be a lot more comfortable contributing data if I was confident that it wasn't already in there somewhere.
Unfortunately, mobilizing people is much more difficult than merely adding features. I'll try to encourage people to do so when I get the chance. Perhaps it will just take a lot of time for this idea to sink in with the public.
- mgashler (14 years, 7 months ago)
I found some nice data for recommender systems (http://www.grouplens.org/node/12). Unfortunately, they use a restrictive license that does not permit redistribution. (They want everyone to get the data from their site.) So, how would MLData feel about a feature that lets users submit links to externally-hosted data when they are not permitted to submit the actual data?
pros: It would expand the scope of MLData, and make it more of a one-stop-shop for ML data.
cons: It might help encourage proprietary attitudes that drag at research. There might be some potential to bury the more useful free data. MLData would be powerless to conveniently provide the data in the user's choice of formats. Theoretically, the external host could behave in obnoxious ways, such as change the data without notice, put up a pay-wall or a survey-wall, etc.
- sonne (14 years, 7 months ago)
Did you ask them whether they would provide their data set?
The problem I see is that data hosted under some url often vanishes (even more so than software) and that all the nice things about mldata (having data in a standard format, being able to define tasks, challenges, submit solutions) becomes impossible with data we cannot access.
- jaakkopeltonen (14 years, 7 months ago)
Dear MGashler, thanks for your valuable comments.
Our basic search function is currently indeed somewhat restrictive (this might change in future development), and I agree it would be useful to support more "permissive" searches for queries where exact matches are hard to find.
Currently, as Soeren points out, it is fairly easy to use Google for more advanced search purposes. For example, the following Google query would look for any data sets that involve the word "hierarchical": "site:mldata.org/repository/data/viewslug hierarchical". It could be possible to directly add a link to such searches from mldata.org, this is not decided. (Note that currently such Google searches return a few dead links too because mldata.org is under rapid development, but this issue will naturally become better once the site reaches its final form.)
We do not have a hierarchical classification of data/tasks/solutions/challenges. This is partly because any hierarchy we would use would be from one "viewpoint" only and might not represent well the different qualities of future data sets. Instead, the tag cloud will hopefully be sufficient for these purposes. For example, data sets suitable for recommender systems would hopefully be tagged "recommender system" or "user rating" etc. by the people who submit the data.
The tag cloud currently shows all the available tags - there are not many yet (this is why you were not able to find a good tag for your search), but this will change as more data sets are added and better commented by users. If the tag cloud becomes too big to show in full, we may add some way to browse the tags.
Many of the current data sets in mldata.org have been automatically extracted - while this is already useful for people who are familiar with the data sets, I hope that their descriptions will be improved in the future. Any registered user can help with this.
Thanks again for your comments. We are very interested in making it intuitive and efficient to browse mldata.org.
Best regards,
Jaakko Peltonen
- phoyer (14 years, 7 months ago)
Hi all,
It is definitely true that many researchers and groups who distribute data want people to get it from their own site. To some extent, I think it may be possible to convince such researchers that it is a good idea to host them on dedicated data repositories: More visibility of course, and most academics realize the dangers of key people changing interests/jobs and sites going down (and most people have seen this happen in practice!). As long as the data on mldata.org still can adequately ask people to cite key publications, and give credit to the people who created the data, I think convincing (some of the) data providers may be possible and is the best solution for all involved.
However, not everybody will be convinced (or even reachable), so as you say, should we allow objects which are simply 'links' to other sites providing data? I see, as all of you also, both pros and cons to that: Indeed, it would extend the scope (and number of datasets), but all of the 'bonus' features would be lost, and in particular there is the 'bad link' problem that sites change and the data is subsequently lost.
Anyway, one possibility is to allow such objects, and hope to provide 'bonus' features in terms of tools to facilitate working with that data. For instance, for a course I am currently teaching, I am providing the students with some basic functions (for Matlab/Octave and R) that can be used to read in the "20 newsgroups" bag of words dataset, and some example functions for looking at the data (as a sanity check that the data is ok and everything is working properly). Even if the data provider would not allow putting the actual data on mldata.org, it would be nice if users could upload such helpful functions for working with the data on a variety of platforms. Of course, at present such functions can be put in .tar.gz files into data objects, so this is already in principle possible, but one might ask whether something like this (i.e. not uploading data but just uploading a link to some data with some additional code that may be helpful in reading/processing it) should be encouraged or discouraged?
Patrik
- mgashler (14 years, 4 months ago)
Here's my follow-up to this thread:
I ended up finding three datasets that met my needs. I requested permission from the corresponding three owners to upload the data to this site. Two of them replied to my request. Of the two that replied, both of them were particularly concerned about ensuring that it was sufficiently clear that they must receive a citation from every publication that resulted from the use of their data.
Thus, the choice of license became the sticky issue. I mentioned the CC-BY license to one of them. He agreed that this license satisfied his requirements. Upon realizing that this license was not an option for this site, I tried to persuade him to be happy with the ODbL license instead. The ODbL license is nearly identical in spirit to the CC-BY-SA. Both require attribution as the only condition for use, but there is one notable difference between them: CC-BY-SA requires attribution in the manner specified by the author, whereas ODbL permits the user of the work to determine the manner of attribution. Since a citation was the only form of attribution that the authors found to be acceptable, the ODbL was not a perfect fit for their needs. One of them gave permission for me to upload his data under the ODbL, provided that I clearly indicate in the summary and comments that a citation was expected, and I did so.
This causes me to wonder, how specific can an author using the CC-BY-SA be about the manner of attribution? Can I require that my attribution be tattooed on the forehead of the first-born child of everyone who uses my work? (To be clear, I am not trying to suggest that CC-BY-SA needs to be added as an option, I'm just describing what happened.)
Finally, just a couple of usability comments. (Neither of these are a big deal--I'm just letting you know about them.) When I submitted data, the tabbed form confused me for some time. I could not see where to enter summary information. I would prefer a long form to one that spans tabs because forms that span multiple pages are uncommon, and I was not certain whether my data would be splinched because I couldn't tell if both tabs were really part of the same form. Also, after submission there was a "go back" option. I clicked on this because I thought of an additional statement I wanted to add to the summary. All the data was lost, so I had to re-enter it.
- ongchengsoon (14 years, 3 months ago)
Hi mgashler,
Thanks a lot for all the feedback and support of mldata.org.
As an aside, we are currently in dire need of someone to help with changes and maintenance. If you are interested, drop me an email.
Creative common (CC) licenses are not ideally suited to data, and are really designed for a more "document" kind of idea, rather than an "open data" kind of idea. http://sciencecommons.org/resources/faq/database-protocol/
How to attribute data? I'm not quite sure. There is a new project called datacite who is trying to create a DOI type identity. It would be cool to integrate this to mldata, but some programming help is needed. (hint)
The idea behind the tabbing is to not scare the user by the large number of empty fields. Most of the fields are optional anyway, and we don't want to give an impression of being overly structured and requiring lots of info for a dataset to be posted.
Keep those ideas and comments coming!
Thanks again, Cheng
Contents
Info
Acknowledgements
This project is supported by PASCAL (Pattern Analysis, Statistical Modelling and Computational Learning)
http://www.pascal-network.org/.