Tag Archives: data engineering

Best Practices for Machine Learning Engineering

Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.

Rules of Machine Learning:
Best Practices for ML Engineering

If you do data engineering or are involved with building data science systems, this document is worth a look.

Data Scientists, Data Engineers, Software Engineers: The Difference According to LinkedIn

The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.

As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.

More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.

Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:

Data Roles and Skill Sets
A comparison of software engineers vs data engineers vs data scientists

Software Engineer

A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.

Work includes:

  • Frontend and backend development
  • Web apps
  • Mobile apps
  • Operating system development
  • Software design

Data Engineer

A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.

Work includes:

  • Advanced data structures
  • Distributed computing
  • Concurrent programming
  • Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
  • Building ETL/data pipelines

Data Scientist

A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.

Work includes:

  • Data modeling
  • Machine learning
  • Algorithms
  • Business Intelligence dashboards

Evolving Data Teams

These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.

At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.

Data Scientist vs Data Engineer

As the field of data science continues to grow and mature, it is nice to begin seeing some distinction in the roles of a data scientist. A new job title gaining popularity is the data engineer. In this post, I lay out some of the distinctions between the 2 roles.

Data Scientist vs Data Engineer Venn Diagram
Data Scientist vs Data Engineer Venn Diagram

Data Scientist

A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.

Data Engineer

Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data.
In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

Conclusion

It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Do you see other distinctions between the roles?

Wanna Be a Data Engineer? – Insight Data Engineering Can Help

Last week, I got the opportunity to spend some time with the team from Insight Data Engineering. They offer a free program that trains people to be data engineers. Then they help those people connect with a job at an impressive company. The program runs a few times a year and consists of 6 intense weeks learning about and working on a data engineering project.

Although the program is free, it does have a highly-selective application process. Once accepted, you can expect the following:

  • A beautiful office space in sunny Palo Alto, CA
  • Mentoring from experts in the field
  • Meet and Greets with some of the biggest names in data science
  • Introductions to some of the leading data engineering companies
  • Access to a growing network of program alumni
  • A bright future as a data engineer

Insight Data Engineering is the same company that has run Insight Data Science, a similar type of program but for scientists instead of engineers, for the past 2 years. That program has 100% placement so far, and I don’t see that number ever changing. The program has an excellent advisory board that is actively involved in the program.

The Data Engineering program is actively accepting applications for the next session scheduled to start in September. Hurry, the deadline for applications is July 7, 2014.

Insight Data Engineering – A Free Tuition Training Program

The creators is the Insight Data Science Fellows Program have done it again. This time they have created the Insight Data Engineering Program. The program aims to training highly specialized software engineers that can build big data systems and big data pipelines. Unlike the data science program, the data engineering program does not target people with PhDs. Please visit the Insight Data Engineering website for a white paper with all the details on the program.

Here is an official announcement:

The Insight Data Engineering Fellows Program is a professional training fellowship designed to help engineers from various backgrounds, as well as mathematicians, and computer scientists, transition to careers in data engineering. – Tuition free, 6 week, full-time, data engineering training fellowship in Silicon Valley this summer. – Alumni network of 70 Insight Fellows who are now data scientists and data engineers at Facebook, LinkedIn, Microsoft, Twitter, Square, Netflix, Airbnb, Palantir, Jawbone and many others. – Interview at top technology companies hiring data engineers at the end of the fellowship. For more information please visit:
http://insightdataengineering.com