Data Science is more than just Statistics

I occasionally get comments and emails similar to the following question:

Should I attend a graduate program in data science or statistics?

I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.

Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.

Statistics

This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.

Data Science

On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.

Conclusion

Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.

Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.

Thoughts/Questions

What are your thoughts? Agree/Disagree?

13 thoughts on “Data Science is more than just Statistics”

  1. A data science program will be worthless without a fair amount of statistics. I have a couple of issues with the description above because:
    1) Data science is ALSO about collecting the data (and not just assuming it’s already there). Data scientists are often tasked with supporting IT/Business Users to ensure the RIGHT data is collected (and may be an integral part of the design process of NoSQL databases for instance).
    2) “Data science wants to predict that outcome.” — The correct solution to that prediction may involve statistical models such as Linear Regression, Logistic Regression, etc. How does build, tune, and understand these models without some statistical training?

    Sample size determination, variable selection, etc. are all critically important elements in model building and are commonplace in statistics programs.

    My major concern with “data science” programs is that you are undereducated in either the mathematics or the computer element. A great data scientist is a “statistician who is a programmer or a programmer who is also a statistician”. Therefore a great data science program will require a heavy investment in both areas (Computer science — Scripting/Programming/etc. & Statistics – Experiment design, variable transformation, hypothesis testing, etc.).

    I work in Business Analytics/Data Science and am currently obtaining a master’s in Applied Statistics. I believe it is easier for me to teach myself R/Python/NoSQL/etc. easier than it is for me to teach myself complicated mathematics. Others may have an opposite view and rather be taught the algorithms, programming, and can teach themselves the necessary statistics.

    1. Matthew, I think you raise a really good point here. Data science is not just stats, it’s not just CS, and it’s not just half stats + half CS. In fact, even a double major in stats + CS in undergrad is probably insufficient (though it’s a good start!) Data science is probably not something one can engage in holistically until later in life… probably after at least one MS, or a PHD, and plenty of real world experience. I would estimate that a good 70% of DS skills come from on the job training. Feel free to quibble with my percentage there.

  2. I think your explanation does a great job of distinguishing the two fields and your advice is right on! I would only add that one should look very closely to the course offerings from Data Science programs. Make sure they’re teaching graduate level topics. The last thing you want to do is pay $35,000 and two years of your life for a degree equivalent to an undergraduate minor in statistics and computer science. Sure, your resume will say MS Data Science, but you still have to pass the technical questions in the interview.

      1. It depends on your exact goals. If you look at the salaries in the latest post, http://datascience101.wordpress.com/2014/04/14/big-data-jobs-are-booming/ it might help you determine some financials. I think NYU offers a strong program and so does Cal Berkeley, but both are very expensive. However, the extra salary you earn might pay for the extra tuition costs. Anyhow, with the huge number of big data jobs available, I don’t think you can go wrong with any of the programs.

  3. Thanks for the interesting post Ryan. My observations:
    1. Data science also does and needs to collect data and move towards a “problem forward” approach than a “solution backward” one, in that I agree with Matthew above. However I feel the Data Science programs seem to be tuned to satisfying today’s practical requirements more wherein you need to get the basics of both CS & Statistics skills right, in that I agree with you Ryan.
    2. However for leadership roles one does need a depth, either in computer science or statistics. In future, perhaps even operations research.
    3. I feel the better depth is computer science (programming, databases, distributed computing, etc) because the incremental business value generated by optimizing or constructing better machine learning algorithms (say) is outpaced by constructing & improving ETL processes, data analysis pipelines (say) for bigger & bigger data. Case in point-Netflix prize, kaggle competitions wherein after a certain point the improvement in analysis or alternative analysis (i.e. incremental business value) has a lot of opportunity cost in terms of time & money. I look forward to contrasting views on this one.

  4. As a health researcher I see statistics as a way of handling numbers derived from data that is more heavily influenced by mathematics than anything else. The methodologies of the various sciences dictate how the data ought to be collected and sometimes [fortuitously} strongly influence how it IS collected. I think data science comes in strictly after the data is collected and can adjust analyses for the types and methods used during data collection. Some health research institutions will call in a data scientist when they have a pile of data and no human being who can analyse it thoroughly or they give it to someone in their own subject area who is also an applied statistician. There has been an increasing use of auxiliary computer programmers who may not have a grounding in the subject area from which the data arises or else there is a “clever clogs” among the subject leaders (especially a medical specialist) who has worked up specialist skills in stats programming [or thinks they have] who takes over the dataset from the unfortunates who have invested the last X years of their life to collecting it! In recent years I have noticed many health research groups recruiting people with good PhDs in statistics and training them up in the basic science of the subject area, then letting them loose on data. From your suggestions I think that data science graduate courses need to offer something comprehensive that teaches a span of skills from science methodologies and philosophies [how about qualitative methodologies for many of the social sciences?], grounding in one or more of the statistical packages used broadly across the field of data analysis, plus grounding in programming using something like Python or R. What the data scientist really needs is a knowledge of data structures, parsing/sorting etc algorithms, computer memory usage, storage and transfer capacities, data security, network permissions and backup. Ethics and security also need thorough coverage as these can be a lot trickier now that ‘big data’ can pool information from a variety of public and private sources that have different standards of “public” and “private”. I’ve worked as a researcher/data analyst in public hospitals where research staff are not permitted to speak directly to hospital patients, let alone collect anonymous data which just might make its way into a pooled data set without the patient’s express permission for each piece of that data. Tedious, but places have their own rules! So a unit on interpersonal and managerial skills wouldn’t go too far astray for a data scientist who is going to work with information from touchy sources such as taxation, criminal justice, race relations etc. Overall I would love to see some comprehensive graduate courses in data science. Perhaps they might even trial them in an MOOC so that I can give them a try!

  5. I’m currently pursuing my bachelors in statistics from India. The program consists of detailed mathematics and statistics. I am unable to figure out whether the traditional masters in statistics program will add more to my job prospects or the data science / business analytics program . Please guide me further

    1. It depends upon if you want to be a statistician or a data scientist. If you want to be a data scientist, you will need to learn programming and some domain knowledge. What are you future goals?

      Ryan

Leave a Reply

Your email address will not be published. Required fields are marked *