This is not intended to be mapped to a set of college courses. It is intended to be a listing of necessary skills for a data scientist. For a definition of data scientist, see this previous post.
- Calculus – not directly important to data science, but the knowledge is important to understand the statistics and machine learning
- Matrix Operations
- Regression – Linear and Logistic
- Bayesian Statistics
- R – stats
- Octave – machine learning
- Basic Programming – Java, C/C++, and Python seem to be good language choices
- Machine Learning
- Database Knowledge – not limited to just relational databases
- Data Visualization – how to make data look good: maps, graphs, etc
- Presentation – story telling, be comfortable explaining data to others
Do you have anything to add/remove from the list?
6 Machine Learning Algorithms
This posts provides a nice quick overview of 6 machine learning algorithms.
- Decision Trees
- Linear Regression
- Neural Networks
- Bayesian Networks
- Support Vector Machines (SVMs)
- Nearest Neighbor
Max Lin of Google Research provides a great slide deck. The title is self-explanitory.
A few days ago, I mentioned that the Stanford Machine Learning class will be starting soon. I thought I should quickly mention some of the topics covered. The list also serves as a great outline for machine learning.
In supervised learning, one has a set of data with features and labels.
- Linear Regression – one/multiple variables
- Gradient Descent – a general algorithm for minimizing a function
- Logistic Regression – This is useful when predicting classification type results. For example, are you looking for a yes or no result. Does the patient have cancer? Will the customer buy my new product? It can also be helpful for more than 2 results. What color will a person choose (red, blue, green, silver)?
- Neural Networks – A learning algorithm that is modeled after the brain. Think of neurons.
- Support Vector Machines
In unsupervised learning, one has a set of data with no features and labels. Can some structure be found for the data?
- Clustering – The most popular technique is K-means.
- PCA (Principal Components Analysis) – speed up a learning algorithm
This section covers methods to determine if data is bad. Bad data is considered an anomaly.
Like the name says, recommender systems are used to make recommendations. Companies like Netflix use recommender systems to recommend new movies to customers. LinkedIn also recommends people to connect with. This is a fairly hot topic in the tech world right now.
- Content Based(Features)
- Modified Linear Regression
- Non-content Based(No Features)
- Collaborative Filtering
- Matrix Factorization
If any of these topics sound interesting to you, signup for the Stanford Machine Learning class. Professor Andrew Ng will do an excellent job explaining the details.
This is a nice post by Socketware. It provides a nice overview of a few machine learning algorithms.
- Recommendation Mining
- Document Clustering
- Document Classification
- Frequent Itemset Mining
In a matter of days, Stanford will begin the second round of the free online machine learning course. I enrolled in the course last fall, and it exceded all expectations. Professor Andrew Ng is great. The prerequisites are minimal, so don’t worry if your math is a little rusty. Also, the videos are short (around 8 – 12 minutes). Therefore, you don’t need large blocks of time set aside. Just watch a video or two during your lunch and you should be able to keep up. There are programming assignments (optional) and review questions to go along with the videos.
Don’t worry if you fall behind. The videos will still be there. The material you learn is more important than the pace. If you don’t know machine learning, the Stanford class is a great opportunity to get started.
Here is Professor Ng’s introduction to the class.
If I am going to create a blog about becoming a data scientist, I must at least provide some type of definition. One of the best definitions I have read is by Hilary Mason, Chief Scientist at Bit.ly,
A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics, and machine learning.
This definition is short and simple, but there are many more definitions out there. In fact CITO Research, a site for CIOs and CTOs, set out to define what a data scientist is. They interviewed six leaders in the data science community, and posted all of the interviews online. The interviews produced varied results, but focused on some main themes of what a data scientist should know.
After reading Hilary’s definition, the CITO Research interview’s, a great post at Quora, and numerous other articles, I created a list of data science skills:
- Machine Learning
- Story Telling (Communication)
- Big Data
I am sure this list will change and evolve over time, but that is where I am going to focus for now. If you have anything to add to the list, please leave a comment. If you are interested in gaining some data science skills, please follow along and let’s learn together.