This is not intended to be mapped to a set of college courses. It is intended to be a listing of necessary skills for a data scientist. For a definition of data scientist, see this previous post.
- Calculus – not directly important to data science, but the knowledge is important to understand the statistics and machine learning
- Matrix Operations
- Regression – Linear and Logistic
- Bayesian Statistics
- R – stats
- Octave – machine learning
- Basic Programming – Java, C/C++, and Python seem to be good language choices
- Machine Learning
- Database Knowledge – not limited to just relational databases
- Data Visualization – how to make data look good: maps, graphs, etc
- Presentation – story telling, be comfortable explaining data to others
Do you have anything to add/remove from the list?
OpenIntro is an organisation that was started to create a free and open source introductory statistics textbook. The book is available as a free PDF download, or it can be purchased in paperback from Amazon for less than $10. If you want to learn statistics or need a little refresher, check it out.
Previously I mentioned that online statistics learning resources are not abundant.
Well, here is a new online book for learning statistics. It is geared towards programmers, and it looks to be a great fit for people wanting to learn data science. Here is a small excerpt from the Preface:
It emphasizes the use of statistics to explore large datasets.
I have only had time to quickly browse the book, but it looks quite good.
Think Stats: Probability and Statistics for Programmers
(The book has a Creative Commons license, so it is free and OK to download)
The tag line for Kaggle is “We’re making data science a sport.” They have successfully created a way to turn data science into a competition. It is both fun, and it yields excellent results. There is also a portion of the site dedicated for classroom use. It is called Kaggle in Class.
Here is how it works. A company that needs some data analyzed can contact Kaggle and host a competition. Then data scientists all over the world can compete to find the best solution. The company benefits from having many algorithms and techniques applied to the same data set. Many more algorithms are applied than what the company could run without Kaggle. The contestants benefit from networking, pre-cleaned data, and learning from others. It is a win/win situation. Plus, the winner gets prize money.
Currently, the large featured competition is the Heritage Health Prize. It is a $3,000,000 competition to identify individuals that will be admitted to the hospital in the next year. The contest lasts until April 2013.
This is definitely a site I want to be involved with in the future. I just wish they could make it a spectator sport as well.
Statistics – This is a topic that could use some more attention from the online community.
I would love to see Stanford (or Coursera) offer a free statistics course online much like the other free courses online.
I did find a series of Youtube videos by Daniel Judge, a Professor in the East Los Angeles College Mathematics Department. The videos start at the very beginning of statistics. I have watched a couple of the videos, and they seem quite good. Daniel does a nice job of explaining the information. Here is the first video in the series.
Stay tuned to the blog in case other stats options appear online. Also, please leave a comment if you know of some good online statistics resources.