Tag Archives: data products

The Goal is Data Products: Now How Do We Get There?

The primary output of data science is data products. Data products can be anything from a list of recommendations to a dashboard to a single chart or any other product that aides in making a more informed decision. In the end, data science should produce some usable results, and those results are the data product. The process used to created those data products needs a bit more formalization. Call it a: methodology, process, lifecycle, or workflow; but it needs to exist.

Dr. Kirk Bourne provided some thoughts in July 2014 with his article, Raising the Standard in the Big Data Analytics Profession. Data science needs some standards and possibly even a workflow, but the focus on data products cannot be lost.

burndown chart

Data Science is not Software Engineering

First, data science is often treated as software engineering because code is written. However, they are not the same thing. Agile methods, waterfall, and scrum are not pluggable methodologies that can be used with data science. Data science is more science and less engineering; therefore it should follow a more scientific method.

Existing Data Science Workflows

Luckily, some options already exist for data science. Much like software engineering, there is not a magic workflow that fits every project. The goal is to find a workflow that best fits the needs of the current project.

CRISP-DM

The most popular and oldest method is CRISP-DM. CRISP-DM was designed for data mining projects, which is closer to data science than software engineering, but still not exact. The 6 steps of CRISP-DM are:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Data Science Project Lifecycle

The The Data Science Project Lifecycle is a recent modification/improvement of CRISP-DM with a bit more of an engineering focus. The steps can be seen as:

  1. Data acquisition
  2. Data preparation
  3. Hypothesis and modeling
  4. Evaluation and Interpretation
  5. Deployment
  6. Operations
  7. Optimization

Data Science Workflow

The Data Science Workflow: Overview and Challenges was presented on the ACM blog in 2013. It was part of a dissertation by Philip Guo. Here are the steps:

  1. Preparation
  2. Analysis
  3. Reflection
  4. Dissemination

Those are 3 options of workflows for data science. They are not the only options. Feel free to modify the workflows to best suit the project. It will be exciting to see the new workflows for data science that will be created in the near future. It will also be fun to see which ones turn out to be the most beneficial.

One thing a data product must do is help answer a question. Thus, a logical staring point for data science is a good question. Just don’t let the focus of the workflow come down to the process, which is often the case in software engineering. Let the focus be on data products.


Note:
I have previously written 2 posts on this topic, and I don’t think either post gets the methodology exactly correct.

7 Browser-based IDEs for Coding on Chromebook

Although an Integrated Development Environment (IDE) is not always essential for data science, building data products often requires a complete development environment and not just statistical analysis. In my quest to be able to perform all my web development and data science work on a chromebook, it is essential to have a development environment in the browser. Luckily for me and for all of you, there are a number of companies working on just that. Full disclosure: Some of the links below contain my referral codes.

Name Languages Supported
Nitrous.io Ruby, Node.js, Python, PHP or Go
CodeBox PHP, Java, Ruby, Node.js, Python, Go, C/C++, and many more
Koding ?
Codio PHP, Node.js, Ruby, many others
Cloud9 IDE Node.js, PHP, and many more
Runnable Java, PHP, Python, and more
Codeanywhere ?

Disclaimer, I have not had a chance to try out each of the options, but I thought I should share anyhow.

What does a data scientist do?

This is one of the better descriptions, I have seen, for what a data scientist does.

They must find interesting, novel, and useful insights about the real world in the data. And they must turn those insights into products and services, and deliver those products and services at a profit.

Notice, data scientists don’t just need to find insights in data. They also need create profitable products from that insight. I often times feel that data products are not seen as important as improving the machine learning algorithms, but the data products really are the end goal.

The quote came from the Harvard Business Review article, To Work with Data, You Need a Lab and a Factory.