The Data Science Methodology

Data Science Methodology

  1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
  2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
  3. Analysis – This is the part of the process where insight is to be extracted
    from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
  4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.

Can you think of anything the methodology is missing?

Note: This post is similar to the Data Scientific Method which I blogged about nearly 2 years ago.

About Ryan Swanstrom

Creator of Data Science 101

View all posts by Ryan Swanstrom

12 Comments on “The Data Science Methodology”

    1. I think you might not be asking the right question. It depends upon what you want to do. Java is not the preferred language for data analysis and statistical models. Python and R are much easier for that. However, Java can be used with Hadoop.
      A better approach is: “What do I want to learn?” Then go take a class about that. Learn to use the best tools for the job.

    1. Yes, Visualization is important. I maybe did not specify it very well, but I think it falls into the final 2 stages. The analysis phase is were you would look at some exploratory visualizations, and the DAta Product itself may end up being a visualization.

      Thanks for commenting. I hope my response helps clear things up.


  1. One other consideration is that, in a real world setting, you have to incorporate continuous validation. Your models become stale over time and need to be updated.

  2. can you help me, I wanna apply data science to a business cost reduction, that’s my topic for masters project, so tell me what are the techniques under data science would you recommend for me to use them for cost reduction

Leave a Reply

Your email address will not be published. Required fields are marked *