Title says it all, Some datasets for teaching data science
So, you have identified a fascinating new problem to solve with data. You correctly started with a problem and not the data. It seems both beneficial and interesting. Now where do you get the data? Here are 4 steps (in order) for how to find data.
1. Existing Data
The best place to start is the data you currently have. What data does your organization currently collect? How can you get access to that? Start there.
Then look for industry specific open data (data that is freely available). Many industries publish data monthly or yearly. Also, data is frequently available with government funded research. If industry specific data is not available, what other related data is openly available? It is often beneficial to augment your existing data with open data. Here are some lists of open data, Open Data, Part 1, Open Data, Part 2. There are also many others available.
Next, explore the opportunity of using an API to access data. Many application have existing API access. An API (Application Programming Interface) allows a person to write some computer code to pull machine-readable data from an existing system. Some are freely available, others have associated costs. Many allow the data to be available in near real-time. There are numerous API’s available where you can pull in data. Check with some of your existing applications. They are available for weather, stocks, news, social media, web analytics, and many more.
4. Create The Data
The last resort is to begin the creation of data. An obvious choice is to create a survey. Be careful because surveys can be trickier than initially thought. You often do not get good representation and the result is biased data. Another way to collect data is to change your application to begin collecting the desired information. You may even have to build a new application. Sometimes an entire process needs to be created or modified to include methods to collect the data. This last step usually takes the longest and costs the most money.
March 4, 2017 is Open Data Day.
Open Data Day is an annual celebration across the globe. Over 300 groups around the world schedule activities to use open data for their communities. See if there is a gathering in your area. Also, the focus this year is on:
- Open research data
- Tracking public money flows
- Open data for environment
- Open data for human rights
Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites.
Beware, the site is addicting, and you might spend a lot of time exploring data.
Recently, a number of resources for publicly available datasets have been announced.
- Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.
Are you aware of any other resources that have been recently announced? If so, please leave a comment.
I think this has been previously happening, but now Google has an official location for these public data sets stored in BigQuery. You can:
- Access and use the data in your applications
- Request Google to host your own public data set
It will be fun to watch this site expand with more public datasets. Happy Exploration!
Yahoo just released a 1.5 TB dataset of “anonymized user interactions on the news feeds”. If you have been looking for a new dataset to analyze, this just might be it. It contains approximately 110 billion rows of data regarding user-news interactions. Happy data exploring!
Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.
The second edition now includes 3 parts.
- Open Data Guide – The Why?, What? and How? of open data
- Value Stories – Stories of how open data is making a difference
- Resource Library – Videos, presentations, and publications about open data
Ben Wellington gives an excellent Ted Talk on open data. He argues that cities need to make more of an effort to release data in a standardized and machine-readable format. This could help cities be safer and fiscally responsible. He is hoping New York City sets the standards for open data for cities. As a bonus, he is a wonderful story teller.