Microsoft Research Open Data is a search engine for free datasets available from Microsoft Research. The datasets are primarily aimed at Natural Language Processing (NLP) and computer vision. Take a look if you are in need of a dataset for your next project.
Looking for datasets for your next project? You are in luck because Google just launched Dataset Search. The name is self-explanatory. Go try it out.
Title says it all, Some datasets for teaching data science
So, you have identified a fascinating new problem to solve with data. You correctly started with a problem and not the data. It seems both beneficial and interesting. Now where do you get the data? Here are 4 steps (in order) for how to find data.
1. Existing Data
The best place to start is the data you currently have. What data does your organization currently collect? How can you get access to that? Start there.
Then look for industry specific open data (data that is freely available). Many industries publish data monthly or yearly. Also, data is frequently available with government funded research. If industry specific data is not available, what other related data is openly available? It is often beneficial to augment your existing data with open data. Here are some lists of open data, Open Data, Part 1, Open Data, Part 2. There are also many others available.
Next, explore the opportunity of using an API to access data. Many application have existing API access. An API (Application Programming Interface) allows a person to write some computer code to pull machine-readable data from an existing system. Some are freely available, others have associated costs. Many allow the data to be available in near real-time. There are numerous API’s available where you can pull in data. Check with some of your existing applications. They are available for weather, stocks, news, social media, web analytics, and many more.
4. Create The Data
The last resort is to begin the creation of data. An obvious choice is to create a survey. Be careful because surveys can be trickier than initially thought. You often do not get good representation and the result is biased data. Another way to collect data is to change your application to begin collecting the desired information. You may even have to build a new application. Sometimes an entire process needs to be created or modified to include methods to collect the data. This last step usually takes the longest and costs the most money.
March 4, 2017 is Open Data Day.
Open Data Day is an annual celebration across the globe. Over 300 groups around the world schedule activities to use open data for their communities. See if there is a gathering in your area. Also, the focus this year is on:
- Open research data
- Tracking public money flows
- Open data for environment
- Open data for human rights
Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites.
Beware, the site is addicting, and you might spend a lot of time exploring data.
Recently, a number of resources for publicly available datasets have been announced.
- Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.
Are you aware of any other resources that have been recently announced? If so, please leave a comment.