Tag Archives: software engineering

Software Engineering Podcasts for Data Science

If you are a former software engineer looking to gain some data science skills, here are a list of podcasts that will most likely interest you.

Software Engineering Daily

A nice podcast which just ran a series of podcasts about data science.

Software Engineering Radio

Great software engineering podcast, here are a couple of topics related to data science.

Enjoy some listening while you are on the train, plane, bus or car.

The Problem with Software Analytics

Software Analytics is the marriage of data science and software engineering.  It hopes to use data generated from software and software engineering processes to provide insights for creating better software.

The following is a quote from a software analytics round table discussion in 2013. All of the round table members are leading academics at prestigious universities.  Obviously, they were chosen because they are very accomplished and know the field.  Now, onto the quote.

Modern software services such as GitHub, BitBucket, Ohlol, Jira, FogBugz, and the like employ wide use of visualization and even bug effort estimation. We can pat ourselves on the backs even if those developers never read a single one of our papers.


Here is the source in IEEE Computer (which most likely you cannot access unless you are an academic): Roundtable: What’s Next in Software Analytics). For the non-academics an InfoQ reprint is available free online.

The academic research community cannot take credit for what Github, BitBucket, and others have done.  Yes, that academic research community is doing some excellent work, but most software practitioners are not seeing it because that research is being hidden in academic journals. The advancements might have occurred simultaneously and coincidentally, but there is not a clear causal relationship.  Unfortunately, the academic research is not getting into the hands of the software practitioners.

I would like to think the target audience of software engineering research would be software engineers, project managers, and developers. However, as this quote points out, those practitioners hardly ever see the research. If the research does not reach the intended audience, then there is a clear problem.  A problem that needs to be fixed.

Unfortunately, I do not yet know what the fix is. If you have any ideas, please leave a comment below.

If there is enough interest, maybe I will start something (just don’t know what that something is).

The Goal is Data Products: Now How Do We Get There?

The primary output of data science is data products. Data products can be anything from a list of recommendations to a dashboard to a single chart or any other product that aides in making a more informed decision. In the end, data science should produce some usable results, and those results are the data product. The process used to created those data products needs a bit more formalization. Call it a: methodology, process, lifecycle, or workflow; but it needs to exist.

Dr. Kirk Bourne provided some thoughts in July 2014 with his article, Raising the Standard in the Big Data Analytics Profession. Data science needs some standards and possibly even a workflow, but the focus on data products cannot be lost.

burndown chart

Data Science is not Software Engineering

First, data science is often treated as software engineering because code is written. However, they are not the same thing. Agile methods, waterfall, and scrum are not pluggable methodologies that can be used with data science. Data science is more science and less engineering; therefore it should follow a more scientific method.

Existing Data Science Workflows

Luckily, some options already exist for data science. Much like software engineering, there is not a magic workflow that fits every project. The goal is to find a workflow that best fits the needs of the current project.

CRISP-DM

The most popular and oldest method is CRISP-DM. CRISP-DM was designed for data mining projects, which is closer to data science than software engineering, but still not exact. The 6 steps of CRISP-DM are:

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

Data Science Project Lifecycle

The The Data Science Project Lifecycle is a recent modification/improvement of CRISP-DM with a bit more of an engineering focus. The steps can be seen as:

  1. Data acquisition
  2. Data preparation
  3. Hypothesis and modeling
  4. Evaluation and Interpretation
  5. Deployment
  6. Operations
  7. Optimization

Data Science Workflow

The Data Science Workflow: Overview and Challenges was presented on the ACM blog in 2013. It was part of a dissertation by Philip Guo. Here are the steps:

  1. Preparation
  2. Analysis
  3. Reflection
  4. Dissemination

Those are 3 options of workflows for data science. They are not the only options. Feel free to modify the workflows to best suit the project. It will be exciting to see the new workflows for data science that will be created in the near future. It will also be fun to see which ones turn out to be the most beneficial.

One thing a data product must do is help answer a question. Thus, a logical staring point for data science is a good question. Just don’t let the focus of the workflow come down to the process, which is often the case in software engineering. Let the focus be on data products.


Note:
I have previously written 2 posts on this topic, and I don’t think either post gets the methodology exactly correct.

Data Scientist vs Data Engineer

As the field of data science continues to grow and mature, it is nice to begin seeing some distinction in the roles of a data scientist. A new job title gaining popularity is the data engineer. In this post, I lay out some of the distinctions between the 2 roles.

Data Scientist vs Data Engineer Venn Diagram
Data Scientist vs Data Engineer Venn Diagram

Data Scientist

A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.

Data Engineer

Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data.
In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

Conclusion

It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Do you see other distinctions between the roles?