Top 5 Places to Get a Data Scientist job

  1. LinkedIn They turn data into products better than anyone else.
  2. Facebook If you are the type of person that loves to analyze people’s lives, there is no better place.
  3. Twitter Duh, It’s Twitter. lots of data and lots of possibilities
  4. Cloudera Cloudera is a successful Hadoop-based startup. Build tools and explore huge datasets for a variety of industries.
  5. Kaggle If optimizing algorithms and really diving into the data to get every last ounce of information is your thing, then Kaggle is it. Plus, there is nowhere else you will get to work on so many important problems in such a wide range of domains. Unfortunately, Kaggle is not currently hiring any data scientists, but they most likely will be seeking more in the future.

There are many other companies hiring data scientists. Where would you like to be a data scientist?


Top 5 Data Startups

  1. Kaggle They make data science a sport, enough said.
  2. DataKind DataKind may not technically be a startup because it is a nonprofit, but they are doing cool stuff.  They match nonprofit organizations with people that love to analyze data and create visualizations.
  3. Cloudera They call themselves “The Platform for Big Data”.  They are working hard to make hadoop easier to use.
  4. Coursera  Coursera is an education startup, but with 2 Computer Science Professors as founders, you can bet they are crunching a lot of data about how people learn.
  5. BigML They are trying to make machine learning available to everyone.  Machine Learning as a Service!

In 2013, Learn Data Science via Coursera (a curriculum)

Coursera has some excellent courses coming up in 2013. Here are some potential curriculum paths for someone looking to learn data science.


Either sequence requires/recommends some basic programming experience. If you are unfamiliar with programming, you still have a couple weeks to get familiar with some basic programming concepts. Some good places to start would be either Coursera’s Computer Science 101 or Codecademy’s Python tutorial.

Data Science Curriculum #1

If you are new to programming, this would be the recommend sequence. The first course focuses on programming.

Course Start Date Completion Date
Computing for Data Analysis Jan. 2, 2013 Jan. 25, 2013
Data Analysis Jan. 22, 2013 Mar. 15, 2013
Introduction to Data Science April 2013 June 2013

Data Science Curriculum #2

Course Start Date Completion Date
Computational Methods for Data Analysis Jan. 7, 2013 Mar. 15, 2013
Introduction to Data Science April 2013 June 2013

Additional Courses

Neither of the Coursera machine learning (Stanford or U of Washington) courses are scheduled for 2013, but either of them would be a great (maybe necessary) follow up course. Hopefully, one of those courses will be starting in July or shortly there after.

After completing one of the above sequences combined with a machine learning course, a person should be skilled enough to begin doing useful data science work. (Note: A new job as a data scientist is not guaranteed, but the courses won’t hurt your chances.) Plus, Coursera offers numerous other classes that could be taken at a later time to increase depth in certain areas of data science (Natural Language Processing, Image Processing, and more).

Happy Learning in 2013!

If you are interested in more ways to learn data science, please check out Data Science 201, coming in 2013.

2 Recently Released Open Source Graph-Related Projects

  1. GraphBuilder

    Intel Labs built a tool for constructing mathematical graphs out of large datasets. It is Java based and works with Hadoop and MapReduce. Intel has release a whitepaper explaining more about GraphBuilder. The code is available on Github. A big thanks to Mark Nickel for pointing out this project.

  2. ArangoDB

    ArangoDB is a flexible NoSQL database. It is a document database with the ability to add edges. Thus it can become a graph database. I had a fun time playing around with the online tutorial and demo. ArangoDB also claims to support being a key/value store. The code is available on Github.

I think the information is interesting, but I also think the charts do a good job of telling the story.

The Bayesian Observer

The average age of first time mothers in the developed countries of the world has been rising for the last ~40 years.

Here is another plot that shows the rate of occurrence of Down Syndrome, a chromosomal defect, as a function of the age of the mother at the time of child birth.

The curve really starts to shoot up at 30. In the UK, the average age of a first time mother is 30 years. It is well known that the fertility rate in women decreases after the age of 30 and drops rapidly after 35. Older mothers are likely to find it harder to have a baby and if they do, then they run a higher risk of chromosomal defects. Given the possibilities of all these negative consequences, the increase in the average age is a bit disturbing. It seems like there is a hidden cost to more women…

View original post 232 more words

Explain Data Science to Anyone

When telling friends and family that I blog about data science, I am frequently asked to explain more. I usually respond with an answer similar to this:

You know the world is generating huge amounts of data everyday due to financial transactions, medical records, social networks, and other internet uses. Data Science aims to make better decisions based upon that data. Here are some possibilities. What type of people buy TVs in October? Which patients will get better with this new drug? Who are some other people that you probably already know?

Data Science is all about answering these types of questions with real data instead of assumptions.

I think this explanation could use some refinement. What am I leaving out? What should I remove? How do you explain data science to other people (preferably non-technical or non-data people)?