How To Learn Data Science? Part 2

Yesterday, I posted about some traditional strategies to acquire data science skills. Today, I will post a nontraditional strategy.

Internet Based

There is hoards of data science information available on the internet for free. With enough personal motivation, a person could learn all the skills necessary for free (or cheap) online. Coursera is probably a great place to start. There are also other good sites such as Udacity, the Kaggle Wiki, other blogs and websites.

The problem with this approach is knowing exactly what to learn. A course in machine learning is great, but data science is more than just machine learning. How do you know what to learn? It would be really nice to have a collection of data science topics and the associated online training materials.

Would this strategy work for you?


How To Learn Data Science?

Based upon the popularity of a previous post about a certificate program from the University of Washington, it appears that many people are interested in learning the skills necessary to become a data scientist. Thus, I decided to compile a list of some of the possible learning strategies.

Traditional College Education

The most obvious path would be to study at a traditional college or university. Colleges and universities are starting to notice the demand for data science skills, and many colleges are currently offering programs to prepare someone as a data scientist. This path is safe and predictable. Do the homework, complete the courses, and get the degree or certificate. Most people are familiar with the process, and it offers few surprises. The problems here are the costs, lack of flexibility, and time involved.

Corporate Training

Companies are now starting to offer training programs for data science. EMC is leading the way in this category with their data science training program. Cloudera also offers lots of training related to hadoop and big data. Wolfram offers data science training with Mathematica. One of the problems with this category is the cost. Another problem is the companies have the tendency to teach and promote their own products. This may leave the student with numerous gaps in the full data science spectrum.

Your Thoughts?

What are you thoughts about the above approaches? What are the positives and negatives? Also, later this week I will be posting some less-traditional approaches to learning data science.

Databases (NoSQL and Relational) are getting more interesting everyday.


With Facebook (s fb) engineers, it appears the high-performance database apple doesn’t far fall from the tree. On Monday, former Facebookers Eric Frenkiel and Nikita Shamgunov (who also spent six years as a senior engineer on Microsoft (s msft) SQL Server) launched a startup called MemSQL that seeks to speed relational databases by taking a page out of the Facebook playbook. The company has raised a $5 million in venture capital thus far from First Round Capital, IA Ventures, NEA, SV Angel, Y Combinator, Paul Buchheit, Ashton Kutcher, Max Levchin and Aaron Levie.

As its name implies, MemSQL achieves its fast performance in part by keeping data in memory, but it doesn’t use memcached like Facebook does to keep its massive MySQL deployment up to speed. Rather, MemSQL takes a lesson learned from HipHop — Facebook’s tool for converting PHP code into faster C++ — and converts SQL to C++.


View original post 258 more words

Profile Of A Data Scientist – Interview

Visit this excellent video interview with David Dietrich, creator of EMC’s data science curriculum. He talks about his experience helping people transition to becoming data scientists.
David lays out a list of 5 traits of a data scientist.

  • Quantitative
  • Technical
  • Skeptical
  • Communication and Collaboration
  • Creative and Curiosity

For a diagram of these 5 traits, see this brief writeup about the profile of a data scientist. Also, see the slides of his latest talk at EMC World 2012.

**Note: I removed the embedded video because it was set to automatically play the video

Kaggle Launches New Products

If you follow the blog, you probably know I am a big fan of Kaggle. Just last week, they announced the launch of 2 new products.

  1. Kaggle Recruit In this competition, the participants are not competing for a cash prize but rather a job interview with a specific company. Currently, Facebook is hosting the first such competition.
  2. Kaggle Prospect In this competition, the participants are trying to come up with the best question to ask. Participants are presented with various related datasets, and the goal is to find which data science question should be asked of the data. The winner gets a small cash prize, and the winning question becomes a regular kaggle competition.

What do you think? Are you excited to try out these new competitions?

New Data Science Certificate Program

Starting in the fall of 2012, the University of Washington will be offering a certificate in Data Science. The program has two sections: one located in Seattle and the other online. The certificate consists of three separate courses each lasting approximately 3 months. Thus the program can be completed in 9 months, and the cost is around $3000.

There are some information sessions later this summer. If you are in Seattle, there is an information session on July 19. If you are interested in the online program, a webinar is scheduled for August 29.

The program content looks quite good. Some of the topics to be covered include: hadoop, NoSQL, machine learning, statistics, graph algorithms, and more. If you are looking to become a data scientist, this just might be the program for you.

I also added this certificate program to my list of College’s offering data science degrees.