I occasionally get comments and emails similar to the following question:
Should I attend a graduate program in data science or statistics?
I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.
Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.
This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.
On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.
Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.
Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.
What are your thoughts? Agree/Disagree?
This is one of the better descriptions, I have seen, for what a data scientist does.
They must find interesting, novel, and useful insights about the real world in the data. And they must turn those insights into products and services, and deliver those products and services at a profit.
Notice, data scientists don’t just need to find insights in data. They also need create profitable products from that insight. I often times feel that data products are not seen as important as improving the machine learning algorithms, but the data products really are the end goal.
The quote came from the Harvard Business Review article, To Work with Data, You Need a Lab and a Factory.
A very nice slidedeck from Jeff Hammerbacher of Cloudera. It goes over k-means clustering and some enhancements.
Deep Learning is a new term that is starting to appear in the data science/machine learning news.
What is Deep Learning?
According to DeepLearning.net, the definition goes like this:
Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.
Wikipedia provides the following defintion:
Deep learning is set of algorithms in machine learning that attempt to learn layered models of inputs, commonly neural networks. The layers in such models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.
Deep Learning is sometimes referred to as deep neural networks since much of deep learning focuses on artificial neural networks. Artificial neural networks are a technique in computer science modelled after the connections (synapses) of neurons in the brain. Artificial neural networks, sometimes just called neural nets, have been around for about 50 years, but advances in computer processing power and storage are finally allowing neural nets to improve solutions for complex problems such as speech recognition, computer vision, and Natural Language Processing (NLP).
Hopefully, this blog post provides some inspiration and useful links to help you learn more about deep learning.
How is Deep Learning being applied?
The following talk, Tera-scale Deep Learning, by Quoc V. Le of Stanford gives some indication of the size of problems to be tackled. The talk discusses work being done on a cluster of 2000 machines and more than 1,000,000,000 parameters.
Startup50’s list of 42 Big Data Startups.
The voting the done, but the list contains plenty of startups working in the data science field.
The following video goes well with the previous post about Open Source Alternatives to AWS.
It says a lot for the quality of OpenStack, since one the world’s most secretive organizations trusts it. OpenStack might be a good option for data teams needing to quickly build and deploy data products.
Note: This post has nothing to do with the recent NSA whistle blower news.
Working with big data can often mean doing some cloud computing. If a public cloud like Amazon AWS is not an option, there are some open source alternatives. They all offer some level of compatibility with the AWS API for both EC2(compute) and S3(storage).
- Rackspace OpenStack
- Apache CloudStack
I don’t think anybody does it better than Hans Rosling. In the following video he helps to explain population growth, child mortality, and fossil fuel usage based upon wealth. I love how he uses toy blocks and chips to help visualize his point.
See the original post from the Guardian, Hans Rosling: the man who’s making data cool
The blog post, Central Limit Theorem Visualized in D3, was posted last week.