A few days ago I posted Data Science is more than just Statistics. I did not feel the post was complete so I am adding a part 2.
I received a couple comments about data scientists being involved with the collection of data. Yes, I would agree that is true. In order for data products to work, the correct data has to be collected. I probably did not explain this very well, but the approach to data collection is different. Typically, a statistics project will run some sort of rigorous experiment to collect data. The experiment will be very controlled and well-understood. In contrast, a data science project will collect data from existing systems, new system, sites on the web, sensors, and various other places. Most of the data does not come from a very controlled environment. By controlled, I mean specific number of users, specific type of users, specific time frame, and/or set constraints on the environment. This conglomeration of data is one of the reasons data scientists deal with such large datasets (there are other reasons too such as cheap hardware). I think it is more common for a data science project to deal with the whole population rather than a certain sampling of the population.
Importance of Statistics
In the previous post, I failed to emphasize the importance of statistics. I never said that statistics is not important to data science. Statistics is a critical element of data science. However, if you only know and study statistics, I feel you are missing other key elements of data science.
Thus, I stand by my previous statement, “if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.” Stay tuned for a post on choosing a data science graduate program.
Do you agree/disagree?
I occasionally get comments and emails similar to the following question:
Should I attend a graduate program in data science or statistics?
I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.
Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.
This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.
On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.
Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.
Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.
What are your thoughts? Agree/Disagree?
I don’t think anybody does it better than Hans Rosling. In the following video he helps to explain population growth, child mortality, and fossil fuel usage based upon wealth. I love how he uses toy blocks and chips to help visualize his point.
See the original post from the Guardian, Hans Rosling: the man who’s making data cool
The blog post, Central Limit Theorem Visualized in D3, was posted last week.
Probabilistic Programming and Bayesian Methods for Hackers is an open source online book. The book is developed with iPython, so it can be read in a variety of formats: web, PDF, or locally with iPython installed.
Also, contributions are welcome via the Github repository for the book (or you can email the authors).
This is the first iPython project I have really looked at, and iPython looks very promising.
Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?.
This is an excellent write-up for the differences between:
- Machine Learning
- Data Mining
- Big Data
- Predictive Analytics
- Data Science
I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.
- Apache Mahout
See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.
Yhat, a new predictive modeling startup, wrote up a nice blog post about
10 R Packages I wish I knew about earlier. It is worth reading through the list.
Special Thanks to Mark Nickel for pointing out this link.
Hans Rosling does an excellent job of showing how “not boring” statistics can be. This is a great informative statistics video. It was originally posted at The Joy of Stats.
Yes, 2013 is the International Year of Statistics. Thus a video was made.