Data Science Methodology
- Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
- Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
- Analysis – This is the part of the process where insight is to be extracted
from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
- Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.
Can you think of anything the methodology is missing?
Note: This post is similar to the Data Scientific Method which I blogged about nearly 2 years ago.
I have frequently been hearing the term data lake. Being the curious person that I am, I decided to go in search of a definition.
Currently, the company Pivotal is responsible for marketing the term. However, I believe the term was originally coined by Dan Woods of CITO Research back in 2011. Anyhow, here is a basic description of a data lake.
A data lake is an information system consisting of the following 2 characteristics
- A parallel system able to store big data
- A system able to perform computations on the data without moving the data
Currently, Hadoop is the most common technology to implement a data lake, but it might not be that way forever. Thus it is important to distinguish the difference between Hadoop and a data lake. A data lake is a concept, and Hadoop is a technology to implement the concept.
The following is a recent Strata Talk by Kaushik Das of Pivotal. He discusses how a data lake can be used to create the digital brain.
Stephen Wolfram, founder of Wolfram Research and creator of Mathematica, just announced the new Wolfram Programming Language. This is really exciting and cool, so please take some time to watch the video. I think this might be a game changer in data science.
Professor Ian Witten of The University of Waikato has just begun his second iteration of the online course, Data Mining With Weka. Hurry, because the course started March 3, but there is still time to register and complete the course. The course lasts 5 weeks and covers how to analyze your own data with Weka.
Weka is an open source tool for machine learning and data mining.