Tag Archives: opendata

Open Data Day 2015

Today, February 21, 2015 is Open Data Day.

What is it?

Around the globe, cities are hosting hackathons centered around open data. The rules are fairly open-ended as long as the event is open and uses open data.

Who is it for?

  • Designers
  • Developers
  • Statisticians
  • Librarians
  • Citizens

If you want to get involved, check the list of City’s hosting Open Data Day events.

If you are looking for some good datasets to use: try Data Sources for Cool Data Science Projects: Part 1 and Part 2.

Data Sources for Cool Data Science Projects: Part 2 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:

Data With a Cause:

  1. Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
  2. Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

  1. Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
  2. Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
  3. Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
  4. Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

  1. Stanford Network Data: http://snap.stanford.edu/index.html
  2. Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
  3. UCI maintains archives of data for machine learning.
  4. US Census Data
  5. Amazon is hosting Public Datasets on s3
  6. Kaggle hosts machine-learning challenges and many of their datasets are publicly available
  7. The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
  8. Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
  9. BigML is a blog that maintains a list of public datasets for the machine learning community.
  10. Finally, if there’s a website with data you are interested in, crawl for it!

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

 

Additional Sources (added via comments since the post was published)

Data Sources for Cool Data Science Projects: Part 1 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 2) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are a few cool public data sources you can use for your next project:

Economic Data:

  1. Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
  2. Housing Price Data: You can use the Trulia API or the Zillow API.
  3. Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Club and Prosper, the two largest platforms in the space.
  4. Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.

Content Data:

  1. Review Content: You can get reviews of restaurant and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
  2. Web Content: Looking for web content? Wikipedia provides dumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
  3. Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGrams for books going back to past 1800.
  4. Communications Data: There’s access to public messages of the Apache Software Foundation and communications amongst former execs Enron

Government Data:

  1. Municipal Data: Crime Data is available for City of Chicago, and Washington DC. Restaurant Inspection Data is available for Chicago and New York City.
  2. Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA
  3. Census Data: Japanese Census Data. US Census data from 2010, 2000, 1990. From census data, the government has also derived time use data. EU Census Data. Checkout popular male / female baby names going back to the 19th Century from the Social Security Administration.
  4. World Bank: they have a lot of data available on their website.
  5. Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

 

List of Over 200 Data Science College Programs

My previous list of Colleges with Data Science Degrees has grown very large, and numerous people have requested the ability to sort and/or filter. Thus, I built a new list. It is available at: Data Science Colleges. As far as I know, this is the most comprehensive list of data science programs available. Here are some of the features it offers:

  • Over 200 Programs
  • Certificate, Bachelors, Masters, and Doctorate programs included
  • Sort and Filter Programs
  • US and International
  • Program Name
  • Location
  • Online Programs
  • Ability to download the raw data as CSV or JSON

Yes, you read that last one correctly. All the data is freely available for you. If you do use the data for something, I would love to know and potentially blog about it.

The list will continue to evolve. If you find any broken links or missing programs, please leave a comment. Also, please leave a comment if you can think of ways to improve the list.

Stanford Releases Large Network Datasets

Stanford University has just released a collection of large datasets of network data. When I say network data, I am referring to the mathematical term of networks (think of a collection of nodes and edges). Here are just a few of the possible categories.

  • Citation Networks
  • Road Networks
  • Web graphs
  • Social Networks such as twitter
  • and many more
  • If you are looking to study network data, or just want some practice analyzing big data, this just might be a good place to start.

An Organization for Opendata and Healthcare

Health Data Consortium is an advocacy group focused on helping the healthcare industry respond to the availability of health data. They are currently focused on innovation and the uses of open health data.

Healthcare is currently undergoing some radical changes and data science is going to play a key role in the future of healthcare. It is great to see the medical field building an official group to define the practice. I hope other industry will follow the lead of the medical field and begin forming their own groups around open data. I am eager to see how the Health Data Consortium progresses over the coming years and months.

Twitter Open Data Grants

Twitter_logo_blue

Twitter has just released the idea of a Data Grant. You have to login with your twitter account to see the details. The gist is: Twitter will provide you with historical twitter data for research purposes.

What could you do with this data?

US Government Open Data Initiative

The United States White House has developed Project Open Data to encourage government agencies to share and produce opendata.
According the Project:

The White House developed Project Open Data – this collection of code, tools, and case studies – to help agencies adopt the Open Data Policy and unlock the potential of government data.

Here are some other materials in the Project.

  • Standards
  • Open Data Licenses
  • Tools
  • Data Formats
  • Case Studies
  • Other Resources

Best of all, the entire project is available on GitHub and contributions are welcomed.

I would love to see other organizations start to do similar things. Some organizations, such as Rackspace, have created support around open source coding projects. However, I am unaware of organizations doing things to standardize and encourage the sharing of opendata.


Does any know of any organizations working to create policies for releasing opendata?
If so, please leave a comment.

Open Data Action Plan for France

This information goes along with the post last week, Open Data Could Be Worth $5.4 Trillion Annually. Just last week France released an action plan for open data. Honestly, I have not read the full report, but it is great to see a government create such a plan. See the full report below.

Open Data Could Be Worth $5.4 Trillion Annually

Michael Chui of McKinsey Global Institute provided some clear insights about the benefits of opendata. Here are the 4 characteristics of open data provided by Chui:

  1. Access by Everyone
  2. Formatted for Easy Reading by a Computer
  3. Free(no cost)
  4. Unlimited Rights to redistribute and reuse

Also, Chui describes how an organization can get the most from their open data. It is not enough to just make the data available, the organizations must provide an ecosystem focused around the open data. Here are some of the strategies he discussed.

  • Identify and Prioritize the Correct Data to open
  • Get Developers/Data Scientists (internal/external) Playing with the data
  • Talent
  • Privacy/Policy Issues
  • Platforms & Standards along with metadata

He also mentions the potential economic benefits of open data ranging from $3.2 billion to $5.4 trillion. For more information on open data see the latest Report from McKinsey Global about Open Data and/or watch the video below.