March 4, 2017 is Open Data Day.
Open Data Day is an annual celebration across the globe. Over 300 groups around the world schedule activities to use open data for their communities. See if there is a gathering in your area. Also, the focus this year is on:
- Open research data
- Tracking public money flows
- Open data for environment
- Open data for human rights
Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites.
Beware, the site is addicting, and you might spend a lot of time exploring data.
Recently, a number of resources for publicly available datasets have been announced.
- Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.
Are you aware of any other resources that have been recently announced? If so, please leave a comment.
DataUSA.io a huge collection of visualizations displaying U.S. public data. It is fun to browse the visualizations, plus there is also an API.
Somewhat lost in the hype of Google’s Cloud Machine Learning announcement (which is itself neat), was the release of Google’s Public Data Sets.
I think this has been previously happening, but now Google has an official location for these public data sets stored in BigQuery. You can:
- Access and use the data in your applications
- Request Google to host your own public data set
It will be fun to watch this site expand with more public datasets. Happy Exploration!
Yahoo just released a 1.5 TB dataset of “anonymized user interactions on the news feeds”. If you have been looking for a new dataset to analyze, this just might be it. It contains approximately 110 billion rows of data regarding user-news interactions. Happy data exploring!
Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.
The second edition now includes 3 parts.
- Open Data Guide – The Why?, What? and How? of open data
- Value Stories – Stories of how open data is making a difference
- Resource Library – Videos, presentations, and publications about open data
Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.
Read the official announcement from the Open Knowledge Foundation.
Ben Wellington gives an excellent Ted Talk on open data. He argues that cities need to make more of an effort to release data in a standardized and machine-readable format. This could help cities be safer and fiscally responsible. He is hoping New York City sets the standards for open data for cities. As a bonus, he is a wonderful story teller.
Today, February 21, 2015 is Open Data Day.
What is it?
Around the globe, cities are hosting hackathons centered around open data. The rules are fairly open-ended as long as the event is open and uses open data.
Who is it for?
If you want to get involved, check the list of City’s hosting Open Data Day events.
If you are looking for some good datasets to use: try Data Sources for Cool Data Science Projects: Part 1 and Part 2.
I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.
At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:
Data With a Cause:
- Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
- Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.
- Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
- Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
- Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
- Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.
Metasources: these are great sources for other web pages.
- Stanford Network Data: http://snap.stanford.edu/index.html
- Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
- UCI maintains archives of data for machine learning.
- US Census Data
- Amazon is hosting Public Datasets on s3
- Kaggle hosts machine-learning challenges and many of their datasets are publicly available
- The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
- Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
- BigML is a blog that maintains a list of public datasets for the machine learning community.
- Finally, if there’s a website with data you are interested in, crawl for it!
While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!
Got any more data sources? Let us know or leave a comment and we’ll add them to the list!
Additional Sources (added via comments since the post was published)