As I’m trying to learn more about machine learning I spent some time to look for data that I can use. While GitHub is the place to get open source code, there doesn’t seem to be a counterpart for open data. Below are a couple of websites that help finding data.
In Bluemix there is an Analytics Exchange which gives you access to free and open data in categories such as economy and business, leisure, transportation, and others. The screenshot shows a sample dataset which contains reviews from Airbnb.
There are several other websites that help finding datasets. Unfortunately a lot of datasets are not open. So don’t forget to check the licenses first.
mldata.org provides a repository with a lot of datasets that can be used for machine learning.
As part of the Apache Spark Makers Build hackathon a nice list of public datasets and city and regional public data (mostly US) is provided.
awesome-public-datasets contains a huge list of links to various datasets. There is also a specific category for machine learning with datasets that are often used in machine learning samples.
I’ve also been pointed to some other websites that I still have to check out: UCI Machine Learning Repository, datahub, re3data.org, Open Access Directory, kaggle.
One dataset I used for my first Spark applications is the Stack Exchange Data Dump for the various StackExchange sites like StackOverflow (download files, schema).