Data Science is more than just Statistics: Part 2

A few days ago I posted Data Science is more than just Statistics. I did not feel the post was complete so I am adding a part 2.

Data Collection

I received a couple comments about data scientists being involved with the collection of data. Yes, I would agree that is true. In order for data products to work, the correct data has to be collected. I probably did not explain this very well, but the approach to data collection is different. Typically, a statistics project will run some sort of rigorous experiment to collect data. The experiment will be very controlled and well-understood. In contrast, a data science project will collect data from existing systems, new system, sites on the web, sensors, and various other places. Most of the data does not come from a very controlled environment. By controlled, I mean specific number of users, specific type of users, specific time frame, and/or set constraints on the environment. This conglomeration of data is one of the reasons data scientists deal with such large datasets (there are other reasons too such as cheap hardware). I think it is more common for a data science project to deal with the whole population rather than a certain sampling of the population.

Importance of Statistics

In the previous post, I failed to emphasize the importance of statistics. I never said that statistics is not important to data science. Statistics is a critical element of data science. However, if you only know and study statistics, I feel you are missing other key elements of data science.

Thus, I stand by my previous statement, “if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.” Stay tuned for a post on choosing a data science graduate program.

Do you agree/disagree?

2 thoughts on “Data Science is more than just Statistics: Part 2”

  1. I do agree, although I feel that you are wading into “rough waters” in this post due to a lack of a standardized definition surrounding Data Science. Opinions will be prevalent until the dust settles over the next few years. I my self was an initial skeptic of the “Data Science” term as I felt it was a re-branding of mathematical/computer science fields, however it seems to definitely have a niche – albeit one that borrows bits and pieces from the fields around it.

    I am very much in agreement with your point around Data Scientists working with full data populations, structured/unstructured/incomplete data, and many differing sources. This is where the magic is in my opinion, as you may be required to determine associations between very complex, highly variable, seemingly uncorrelated data in high volume. This does not typically mean you can gather your data, put up your blinders, and run analysis in isolation either as contextual information from the underlying systems and organizational teams is still paramount. Hence the cross functional aspects of Data Science.

    As an aside, I have been very interested in continuing post-grad work in the Data Sciences/Machine Learning fields but am finding a lack of adoption by other than a few elite US schools (I’m a Canadian). Do you have any opinion around current offerings? (loaded question for the day 😉 )

Leave a Reply

Your email address will not be published. Required fields are marked *