3 Questions When Choosing a Data Science Program

I was honored to write a guest blog post for Master’s in Data Science. The site contains a very detailed list of graduate programs in data science. The post I authored is title:

3 Questions to Ask Before Choosing a Data Science Program

Not to ruin the post, but the 3 questions are:

  1. What is my Background?
  2. What are my goals?
  3. Does location matter?

Head on over to Master’s in Data Science to see all the details about why those are 3 important questions.

3 Great Data Science Books You Can Read Now…for free

Just this week, I have become aware of 3 free online books for data science.

Data Visualization with Javascript

If you are looking for a tutorial to teach you how to make wonderful visualizations on the web, look no further. Data Visualization with JavaScript is a free online book for learning data visualization with Javascript. It provides tons of examples and step by step instructions for how to create the graphs, charts, and other visualizations. Here is a quick list of the topics:

  • Graphs
  • D3.js
  • Interactive Charts
  • Geographic Plots
  • Timelines

Frontiers in Massive Datasets

Frontiers in Massive Datasets is a report all about how science, business, communications, national security and others need to learn to handle massive amounts of data. Whether the data has been sitting in a database for years or it is now just screaming into the systems, massive data is now a problem for almost every industry. This report covers many of the topics that need to be addressed when dealing with big data. Here is a very brief overview of the topics:

  • Limitations
  • Sampling
  • Building Models from Massive Data
  • Real-time Algorithms
  • 7 Computational Giants of Massive Data Analysis

Foundations of Data Science

Foundations of Data Science is a draft of textbook written by John Hopcroft and Ravindran Kannan. It is intended to be a text for computer science with an emphasis more on probability and statistics rather than discrete mathematics. The authors argue that knowledge of working with data is a necessary skill for computer scientists of the future. This is clearly the most technical and academic of the 3 books, but if that is your thing, your should really enjoy browsing through this book. Here are some of the topics.

  • High-Dimensional Space
  • Clustering
  • Algorithms for Massive Data Problems
  • Singular Value Decomposition
  • Graphical Models

Strata + Hadoop World 2014 Videos

John Rauser from Pinterest gives one of the more popular talks from the Recent Strata Conference + Hadoop World. The following quote from his talk might peak your interest enough to get you to watch the entire video. Remember, he is speaking to a room with some of the leading data scientists in the world.

Many of the people in this audience are faking it….when it comes to statistics

Many other keynotes, talks, and interviews during the Strata + Hadoop World videos are available on the Youtube playlist.

Open Source Distributed Analytics Engine with SQL interface and OLAP on Hadoop by eBay – Kylin

Ryan Swanstrom:

Very promising open-source BI tool for use with Hadoop.

Originally posted on Big Data Analytics, Data Visualization and Infographics:

What is Kilyn?

  • Kylin is an open source Distributed Analytics Engine with SQL interface and multi-dimensional analysis (OLAP) to support extremely large datasets on Hadoop by eBay.

kylin

Key Features:

  • Extremely Fast OLAP Engine at Scale:
    • Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data
  • ANSI-SQL Interface on Hadoop:
    • Kylin offers ANSI-SQL on Hadoop and supports most ANSI-SQL query functions
  • Interactive Query Capability:
    • Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset
  • MOLAP Cube:
    • User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records
  • Seamless Integration with BI Tools:
    • Kylin currently offers integration capability with BI Tools like Tableau.
  • Other Highlights:
    • Job Management and Monitoring
    • Compression and Encoding Support
    • Incremental Refresh of Cubes
    • Leverage HBase Coprocessor for query latency
    • Approximate Query Capability for distinct Count (HyperLogLog)

View original 39 more words

Data Sources for Cool Data Science Projects: Part 2 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:

Data With a Cause:

  1. Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
  2. Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

  1. Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
  2. Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
  3. Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
  4. Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

  1. Stanford Network Data: http://snap.stanford.edu/index.html
  2. Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
  3. UCI maintains archives of data for machine learning.
  4. US Census Data
  5. Amazon is hosting Public Datasets on s3
  6. Kaggle hosts machine-learning challenges and many of their datasets are publicly available
  7. The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
  8. Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
  9. BigML is a blog that maintains a list of public datasets for the machine learning community.
  10. Finally, if there’s a website with data you are interested in, crawl for it!

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

 

Additional Sources (added via comments since the post was published)

Follow

Get every new post delivered to your Inbox.

Join 3,870 other followers

%d bloggers like this: