Category Archives: Guest Post

Data Scientists, Data Engineers, Software Engineers: The Difference According to LinkedIn

The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.

As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.

More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.

Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:

Data Roles and Skill Sets
A comparison of software engineers vs data engineers vs data scientists

Software Engineer

A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.

Work includes:

  • Frontend and backend development
  • Web apps
  • Mobile apps
  • Operating system development
  • Software design

Data Engineer

A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.

Work includes:

  • Advanced data structures
  • Distributed computing
  • Concurrent programming
  • Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
  • Building ETL/data pipelines

Data Scientist

A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.

Work includes:

  • Data modeling
  • Machine learning
  • Algorithms
  • Business Intelligence dashboards

Evolving Data Teams

These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.

At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.

Data Science and the Perfect Team

Today, I am proud to welcome a guest post by Claire Gilbert, Data Analyst at Gongos. For more on Gongos, see the description at the end of the post.

It’s fair to say that for those who run in business intelligence circles, many admire the work of Fast Forward Labs CEO and Founder Hilary Mason. Perhaps what resonates most with her fans is the moniker she places on data scientists as being ‘awesome nerds’—those who embody the perfect skillsets of math and stats, coding, and communication. She asserts that these individuals have the technical expertise to not only conduct the really, really complex work—but also have the ability to explain the impact of that work to a non-technical audience.

As insights and analytics organizations strive to assemble their own group of ‘awesome nerds,’ there are two ways to consider Hilary’s depiction. Most organizations struggle by taking the first route—searching for those very expensive, highly rare unicorns—individuals that independently sit at this critical intersection of genius. Besides the fact that it would be even more expensive to clone these data scientists, there is simply not enough bandwidth in their day to fulfill on their awesomeness 24/7.

To quote Aristotle, one of the earliest scientists of our time, “the whole is greater than the sum of its parts,” which brings us to the notion of the team. Rather than seeking out those highly sought-after individuals with skills in all three camps, consider creating a collective of individuals with skills from each camp. After all, no one person can solve for the depth and breadth of an organization’s growing data science needs. It takes a specialist such as a mathematician to dive deep; as well as a multidisciplinary mind who can comprehend the breadth, to truly achieve the perfect team.

awesome-nerds-img
Awesome Nerds

Team Dynamics of the Data Kind

The ultimate charge for any data science team is to be a problem-solving machine—one that constantly churns in an ever-changing climate. Faced with an increasing abundance of data, which in turn gives rise to once-unanswerable business questions, has led clients to expect new levels of complexity in insights. This chain reaction brings with it a unique set of challenges not previously met by a prescribed methodology. As the sets of inputs become more diverse, so too should the skillsets to answer them. While all three characteristics of the ‘awesome nerd’ are indispensable, it’s the collective of ‘nerds’ that will become the driving force in today’s data world.
True to the construct, no two pieces should operate independent of the third. Furthermore, finding and honing balance within a data science team will result in the highest degree of accuracy and relevancy possible.
Let’s look at the makeup of a perfectly balanced team:

  • Mathematician/Statistician:
    This trained academic builds advanced models based on inputs, while understanding the theory and requirements for the results to be leveraged correctly.
  • Coder/Programmer:
    This hands-on ‘architect’ is in charge of cleaning, managing and reshaping data, as well as building simulators or other highly technical tools that result in user-friendly data.
  • Communicator/Content Expert:
    This business ‘translator’ applies an organizational lens to bring previous knowledge to the table in order to connect technical skill sets to client needs.

It’s the interdependence of these skillsets that completes the team and its ability to deliver fully on the promise of data:
A Mathematician/Statistician’s work relies heavily on the Coder/Programmer’s skills. The notion of garbage-in/garbage-out very much applies here. If the Coder hasn’t sourced and managed the data judiciously, the Mathematician cannot build usable models. Both then rely on the knowledge of the Communicator/Content Expert. Even if the data is perfect, and the results statistically correct, the output cannot be activated against unless it is directly relevant to the business challenge. Furthermore, teams out of balance will be faced with hurdles for which they are not adequately prepared, and output that is not adequately delivered.

To Buy or to Build?

In today’s world of high velocity and high volume of data, companies are faced with a choice. Traditional programmers like those who have coded surveys and collected data are currently integrated in the work streams of most insights organizations. However, many of them are not classically trained in math and/or statistics. Likewise, existing quantitative-minded, client-facing talents can be leveraged in the rebuilding of a team. Training either of these existing individuals who have a bent in math and/or stats is possible, yet is a time-intensive process that calls for patience. If organizations value and believe in their existing talent and choose to go this route, it will then point to the gaps that need to be filled—or bought—to build the ‘perfect’ team.
Organizations have long known the value of data, but no matter how large and detailed it gets, without the human dimension, it will fail to live up to its $30 billion valuation by 2019. The interpretation, distillation and curation of all kinds of data by a team in equilibrium will propel this growth and underscore the importance of data science.
Many people think Hilary’s notion of “awesome nerds” applies only to individuals. But in practice, we must realize this kind of market potential, the team must embody the constitution of awesomeness.
As organizations assemble and recruit teams, perhaps their mission statement quite simply should be…
If you can find the nerds, keep them, but in the absence of an office full of unicorns, create one.

About Gongos

Gongos, Inc. is a decision intelligence company that partners with Global 1000 corporations to help build the capability and competency in making great consumer-minded decisions. Gongos brings a consultative approach in developing growth strategies propelled by its clients’ insights, analytics, strategy and innovation groups.

Enlisting the multidisciplinary talents of researchers, data scientists and curators, the company fuels a culture of learning both internally and within its clients’ organizations. Gongos also works with clients to develop strategic frameworks to navigate the change required for executional excellence. It serves organizations in the consumer products, financial services, healthcare, lifestyle, retail, and automotive spaces.

How to Kickstart Your Data Science Career

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.

 

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20 10
3 Alice 2005 25 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.

SELECT

SELECT is the foundational keyword in SQL. SELECT can also filter on columns.  For example

SELECT Name, StartYear FROM Employees

returns

Name StartYear
Bob 2001
Sally 2004
Alice 2005
Fred 2004

 

WHERE

The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004

returns

EmployeeId Name StartYear Salary DepartmentId
2 Sally 2004 20 10
4 Fred 2004 12.5 20

 

GROUP BY

Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
FROM EMPLOYEES
GROUP BY StartYear

returns

StartYear Num AvgSalary
2001 1 10.5
2004 2 16.25
2005 1 25

 

JOIN

Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId DepartmentName
10 Sales
20 Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;

The results might look like:

EmpName DepName
Bob Sales
Sally Sales
Alice Engineering
Fred Engineering

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done.  Don’t forget, that you can nest queries and create really complicated joins.  It’s a lot more powerful than Excel, and gives you much better control of your data.  Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.

 

And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

The Most Popular Skills and Degrees of Today’s Data Scientists

Today, we are lucky to have Daniel Levine of RJMetrics provide a guest post. RJMetrics created an extensive report detailing The State of Data Science. I asked Daniel to provide some results as they relate to the current education of data scientists.

Recently, RJMetrics released a benchmark report that looked to answer many of the questions people have about today’s data scientists, such as how many data scientists are there, what degrees do they have, and what skills do they posses.

From LinkedIn data on the 11,400 data scientists working now, we can get a much better sense of what types of data scientists companies are hiring, and how senior data scientists differ from their junior counterparts.

Education Levels

While it was typical to see data scientists report multiple degrees, when we looked at the percentages of all distinct bachelor’s, master’s, and doctorate degrees, we found that 42% finished their education with a master’s.

Highest Education Level of Data Scientists
Highest Education Level of Data Scientists

The high number of data scientists that receive graduate degrees (79%) is indicative of the increasing demand for specialists and a desire from data scientist for advanced training.

Additionally, these numbers may indicate that data science is simply attracting highly educated educated individuals because of its sexy and lucrative career path.

So what does this distribution look like as you climb the corporate ladder? You may assume that the higher the position, the more PhDs; but in fact, across Junior, Senior, and Chief Data Scientists, we saw the highest ratio of PhDs to Master’s at the Senior level.

Data Scientist's Education Level By Seniority
Data Scientist’s Education Level By Seniority

We speculate that the drop from 43% at the Senior level to 35% at the chief level actually reflects how long those individuals have been in the field. In a study by Heirick & Struggles titled, “Understanding Today’s Chief Data Scientist,” they found that chief Data Scientists “average nearly 15 years of post-degree commercial (PDC) experience.” What we’re likely seeing in this data is the “first crop” of Chief Data Scientists who earned this title in the field, not in the classroom.

Subjects Studied

When we looked at what data scientists studied during their education, we found that besides Business Administration/Management, they were mostly STEM-focused.

Educational Background of Data Scientists
Educational Background of Data Scientists

We believe that Computer Science is so popular because a data scientist without CS skills is at an extreme disadvantage because they won’t be able to extract the data well enough to properly analyze it. DJ Patil and Hilary Mason, in their book Creating a Data Culture, went as far as to say, “a data scientist who lacks the tools to get data from a database into an analysis package and back out again will become a second-class citizen in the technical organization.”

Skills Reported

In analyzing 254,600 records of skills, we found the most popular skills to be more generic than we’d expect. Popular buzz term like “big data” and “hadoop” didn’t crack the top 10, while programming languages like “r” and “python” are extremely popular among data scientists.

Top 20 Data Science Skills
Top 20 Data Science Skills

When the data was sliced by seniority, we saw a major difference between Junior, Senior, and Chief levels. To make these differences easier to digest, we compared each level to the same common denominator: the average data scientist.

Data Science Skills Difference By Seniority
Data Science Skills Difference By Seniority

Again, the chief data scientists data is of particular interest. These C-suite professionals are more likely to list skills like “business intelligence,” “analytics,” “leadership,” “strategy,” and “management” among their skills than both junior and senior data scientists; but less likely to list skills on the more technical side, like “python” and “r”.

While it’s true that chief data scientists may be simply emphasizing skills that are more relevant to their position within the company, we also speculate that many chief data scientists assumed these roles by virtue of being in the field longer or having additional qualifications, such as a business degree. Therefore, it is also possible that some chief data scientists never actually learned many of the skills listed by more junior people.

If you’d like more analysis about this data and a more detailed explanation about our methods, you can check out the full State of Data Science.

Free Big Data Analytics Handbook

Brian Liou from Leada was kind enough to provide a guest post about their latest handbook, The Data Analytics Handbook: Big Data Edition.

Data Analytics Handbook
Data Analytics Handbook
Have you ever wondered what the deal was behind all the hype of “big data”? Well, so did we. In 2014, data science hit peak popularity, and as graduates with degrees in statistics, business, and computer science from UC Berkeley we found ourselves with a unique skill set that was in high demand. We recognized that as recent graduates, our foundational knowledge was purely theoretical; we lacked industry experience; we also realized that we were not alone in this predicament. And so, we sought out those who could supplement our knowledge, interviewing leaders, experts, and professionals – the giants in our industry. What began as a quest for the reality behind the buzzwords of “big data” and “data science,” The Data Analytics Handbook, quickly turned into our first educational product of our startup Leada (see www.teamleada.com). Thirty plus interviews and four editions later, the handbook has been downloaded over 30,000 times by readers from all over the world In them, you’ll discover whether “big data” is overblown, what skills your portfolio companies should look for when hiring a data scientist, how leading “big data” and analytics companies interview, and which industries will be most impacted by the disruptive power of data science. We hope you enjoy reading these interviews as much as we enjoyed creating them!
Download all 4 handbooks at www.teamleada.com/handbook

Data Sources for Cool Data Science Projects: Part 2 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 1) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are some more cool public data sources you can use for your next project:

Data With a Cause:

  1. Environmental Data: Data on household energy usage is available as well as NASA Climate Data.
  2. Medical and biological Data: You can get anything from anonymous medical records, to remote sensor reading for individuals, to data of the Genomes of 1000 individuals.

Miscellaneous:

  1. Geo Data: Try looking at these Yelp Datasets for venues near major universities and one for major cities in the Southwest. The Foursquare API is another good source. Open Street Map has open data on venues as well.
  2. Twitter Data: you can get access to Twitter Data used for sentiment analysis, network Twitter Data, social Twitter data, on top of their API.
  3. Games Data: Datasets for games, including a large dataset of Poker hands, dataset of online Domion Games, and datasets of Chess Games are available.
  4. Web Usage Data: Web usage data is a common dataset that companies look at to understand engagement. Available datasets include Anonymous usage data for MSNBC, Amazon purchase history (also anonymized), and Wikipedia traffic.

Metasources: these are great sources for other web pages.

  1. Stanford Network Data: http://snap.stanford.edu/index.html
  2. Every year, the ACM holds a competition for machine learning called the KDD Cup. Their data is available online.
  3. UCI maintains archives of data for machine learning.
  4. US Census Data
  5. Amazon is hosting Public Datasets on s3
  6. Kaggle hosts machine-learning challenges and many of their datasets are publicly available
  7. The cities of Chicago, New York, Washington DC, and SF maintain public data warehouses.
  8. Yahoo maintains a lot of data on its web properties which can be obtained by writing them.
  9. BigML is a blog that maintains a list of public datasets for the machine learning community.
  10. Finally, if there’s a website with data you are interested in, crawl for it!

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!

 

Additional Sources (added via comments since the post was published)

Data Sources for Cool Data Science Projects: Part 1 – Guest Post


I am excited for the first ever guest posts on the Data Science 101 blog. Dr. Michael Li, Executive Director of The Data Incubator in New York City, is providing 2 great posts (see Part 2) about finding data for your next data science project.

At The Data Incubator, we run a free six week data science fellowship to help our Fellows land industry jobs. Our hiring partners love considering Fellows who don’t mind getting their hands dirty with data. That’s why our Fellows work on cool capstone projects that showcase those skills. One of the biggest obstacles to successful projects has been getting access to interesting data. Here are a few cool public data sources you can use for your next project:

Economic Data:

  1. Publically Traded Market Data: Quandl is an amazing source of finance data. Google Finance and Yahoo Finance are additional good sources of data. Corporate filings with the SEC are available on Edgar.
  2. Housing Price Data: You can use the Trulia API or the Zillow API.
  3. Lending data: You can find student loan defaults by university and the complete collection of peer-to-peer loans from Lending Club and Prosper, the two largest platforms in the space.
  4. Home mortgage data: There is data made available by the Home Mortgage Disclosure Act and there’s a lot of data from the Federal Housing Finance Agency available here.

Content Data:

  1. Review Content: You can get reviews of restaurant and physical venues from Foursquare and Yelp (see geodata). Amazon has a large repository of Product Reviews. Beer reviews from Beer Advocate can be found here. Rotten Tomatoes Movie Reviews are available from Kaggle.
  2. Web Content: Looking for web content? Wikipedia provides dumps of their articles. Common Crawl has a large corpus of the internet available. ArXiv maintains all their data available via Bulk Download from AWS S3. Want to know which URLs are malicious? There’s a dataset for that. Music data is available from the Million Songs Database. You can analyze the Q&A patterns on sites like Stack Exchange (including Stack Overflow).
  3. Media Data: There’s open annotated articles form the New York Times, Reuters Dataset, and GDELT project (a consolidation of many different news sources). Google Books has published NGrams for books going back to past 1800.
  4. Communications Data: There’s access to public messages of the Apache Software Foundation and communications amongst former execs Enron

Government Data:

  1. Municipal Data: Crime Data is available for City of Chicago, and Washington DC. Restaurant Inspection Data is available for Chicago and New York City.
  2. Transportation Data: NYC Taxi Trips in 2013 are available courtesy of the Freedom of Information Act. There’s bikesharing data from NYC, Washington DC, and SF. There’s also Flight Delay Data from the FAA
  3. Census Data: Japanese Census Data. US Census data from 2010, 2000, 1990. From census data, the government has also derived time use data. EU Census Data. Checkout popular male / female baby names going back to the 19th Century from the Social Security Administration.
  4. World Bank: they have a lot of data available on their website.
  5. Election Data: Political contribution data for the last few US elections can be downloaded from the FEC here and here. Polling data is available from Real Clear Politics.

 

While building your own project cannot replicate the experience of fellowship at The Data Incubator (our fellows get amazing access to hiring managers and access to nonpublic data sources) we hope this will get you excited about working in data science. And when you are ready, you can apply to be a Fellow!

Got any more data sources? Let us know or leave a comment and we’ll add them to the list!