Learn to Analyze Big Data with R – Free Course

R is a hugely popular language among data scientists and statisticians. One of the difficulties with open-source R is the memory constraint. All the data needs to be loaded into a data.frame. Microsoft solves this problem with the RevoScaleR package of the Microsoft R Server. Just launched this week is an EdX course on
Analyzing Big Data with Microsoft R Server.

According the syllabus:

Upon completion, you will know how to use R for big-data problems.

Full Disclosure: I work at Microsoft, and the course instructor, Seth Mottaghinejad, is one of my colleagues.

Data Scientists, Data Engineers, Software Engineers: The Difference According to LinkedIn

The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.

As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.

More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.

Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:

Data Roles and Skill Sets
A comparison of software engineers vs data engineers vs data scientists

Software Engineer

A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.

Work includes:

  • Frontend and backend development
  • Web apps
  • Mobile apps
  • Operating system development
  • Software design

Data Engineer

A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.

Work includes:

  • Advanced data structures
  • Distributed computing
  • Concurrent programming
  • Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
  • Building ETL/data pipelines

Data Scientist

A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.

Work includes:

  • Data modeling
  • Machine learning
  • Algorithms
  • Business Intelligence dashboards

Evolving Data Teams

These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.

At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.

Know Your Probability Distributions

In data science and statistics, probability distributions can be very important. I have been meaning to create a listing of them. However, I no longer need to since the fine folks at Cloudera have already created a list at Common Probability Distributions: The Data Scientist’s Crib Sheet.

Learn the distributions and pick a favorite. (My favorite of the common ones is the normal distribution. I also like the Cauchy distribution which is much less common.)

Our World In Data

Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites.

Beware, the site is addicting, and you might spend a lot of time exploring data.