Introduction to Microsoft R Open (Webinar)

Tomorrow, January 28, 2016, David Smith will present a webinar titled Introduction to Microsoft R Open. David is the R Community Lead at Microsoft. The webinar will discuss:

  • Introduction to R
  • History of R
  • Enhancements of Microsoft R Open (Microsoft’s enhanced distribution of open-source R)
  • CRAN Time Machine
  • Reproducible Data Analysis

If you are looking to get started with R or get more from R, this webinar will be worth your time.

Plus, the webinar is the first in a series of Microsoft webinars focused on R.


Full Disclosure: I work for Microsoft, and I will be helping (in a very minimal capacity) with the webinar.

Data Science for Social Good Fellowship 2016

Once again, Data Science for Social Good will be hosting a fellowship during the summer of 2016. They have a summer program that trains data scientists by having them work on important and impactful projects. The fellowship is paid and it lasts 12 weeks in Chicago, Illinois. The deadline to apply is February 1, 2016, so you need to hurry if you are planning to apply. The projects vary every summer, but a list of potential and past projects include:

  • Education Enrollment
  • Government Spending
  • Energy Usage
  • Crime Prevention
  • Maternal Mortality
  • And many more….

Also, the program is looking for mentors. If you are an experienced data scientist in either academia or industry, you are encouraged to apply and spend your summer helping to work on some very important problems.

The Most Popular Skills and Degrees of Today’s Data Scientists

Today, we are lucky to have Daniel Levine of RJMetrics provide a guest post. RJMetrics created an extensive report detailing The State of Data Science. I asked Daniel to provide some results as they relate to the current education of data scientists.

Recently, RJMetrics released a benchmark report that looked to answer many of the questions people have about today’s data scientists, such as how many data scientists are there, what degrees do they have, and what skills do they posses.

From LinkedIn data on the 11,400 data scientists working now, we can get a much better sense of what types of data scientists companies are hiring, and how senior data scientists differ from their junior counterparts.

Education Levels

While it was typical to see data scientists report multiple degrees, when we looked at the percentages of all distinct bachelor’s, master’s, and doctorate degrees, we found that 42% finished their education with a master’s.

Highest Education Level of Data Scientists
Highest Education Level of Data Scientists

The high number of data scientists that receive graduate degrees (79%) is indicative of the increasing demand for specialists and a desire from data scientist for advanced training.

Additionally, these numbers may indicate that data science is simply attracting highly educated educated individuals because of its sexy and lucrative career path.

So what does this distribution look like as you climb the corporate ladder? You may assume that the higher the position, the more PhDs; but in fact, across Junior, Senior, and Chief Data Scientists, we saw the highest ratio of PhDs to Master’s at the Senior level.

Data Scientist's Education Level By Seniority
Data Scientist’s Education Level By Seniority

We speculate that the drop from 43% at the Senior level to 35% at the chief level actually reflects how long those individuals have been in the field. In a study by Heirick & Struggles titled, “Understanding Today’s Chief Data Scientist,” they found that chief Data Scientists “average nearly 15 years of post-degree commercial (PDC) experience.” What we’re likely seeing in this data is the “first crop” of Chief Data Scientists who earned this title in the field, not in the classroom.

Subjects Studied

When we looked at what data scientists studied during their education, we found that besides Business Administration/Management, they were mostly STEM-focused.

Educational Background of Data Scientists
Educational Background of Data Scientists

We believe that Computer Science is so popular because a data scientist without CS skills is at an extreme disadvantage because they won’t be able to extract the data well enough to properly analyze it. DJ Patil and Hilary Mason, in their book Creating a Data Culture, went as far as to say, “a data scientist who lacks the tools to get data from a database into an analysis package and back out again will become a second-class citizen in the technical organization.”

Skills Reported

In analyzing 254,600 records of skills, we found the most popular skills to be more generic than we’d expect. Popular buzz term like “big data” and “hadoop” didn’t crack the top 10, while programming languages like “r” and “python” are extremely popular among data scientists.

Top 20 Data Science Skills
Top 20 Data Science Skills

When the data was sliced by seniority, we saw a major difference between Junior, Senior, and Chief levels. To make these differences easier to digest, we compared each level to the same common denominator: the average data scientist.

Data Science Skills Difference By Seniority
Data Science Skills Difference By Seniority

Again, the chief data scientists data is of particular interest. These C-suite professionals are more likely to list skills like “business intelligence,” “analytics,” “leadership,” “strategy,” and “management” among their skills than both junior and senior data scientists; but less likely to list skills on the more technical side, like “python” and “r”.

While it’s true that chief data scientists may be simply emphasizing skills that are more relevant to their position within the company, we also speculate that many chief data scientists assumed these roles by virtue of being in the field longer or having additional qualifications, such as a business degree. Therefore, it is also possible that some chief data scientists never actually learned many of the skills listed by more junior people.

If you’d like more analysis about this data and a more detailed explanation about our methods, you can check out the full State of Data Science.

Big Data & Analytics Summit Canada 2016

Early next year (February 2016 to be exact), Canada’s first Cross-Industry Big Data Summit will be held in Toronto. The speaker lineup looks strong with presenters from Twitter, Rogers, Boeing and more. Also, the conference contains both a technical and a business track.

When: February 17th & 18th 2016
Where: Sheraton Centre Toronto
More Info: http://www.bigdatasummitcanada.com/

Below are some further details from the conference organizers:

From retail to telecommunications, e-commerce, transportation and banking, big data is revolutionizing the way Canadian businesses operate. With expanded storage capacity and new forms of data processing, big data analytics is paving the way for more confident corporate decision-making.

Engage in two days of top content intended to help you maintain your competitive advantage as a data-driven organization. Gather key strategies to extract actionable insights, optimize your processes, and strengthen your client relationships.

NEW in 2016: Specialized tracks for Technical and Business Use. Attend sessions most pertinent to your learning needs.
This is your chance to gain the knowledge you need to solve your big data problems and monetize the immense business potential that lies in your untapped data. Engage in dialogue with cross-industry big data innovators and source solutions for immediate implementation within your organization.

The Data Science Industry: Who Does What

The fine folks at DataCamp, a great site for learning data science right in your browser, have come up with another great infographic. This time it compares some of the many job titles in the data science field.

The infographic lays out the roles and skills needed for the following job titles. Note: not all the job roles can be confused with a data scientist, but all the roles can be important when completing an entire data science project.

  • Data Scientist
  • Data Analyst
  • Data Architect
  • Data Engineer
  • Statistician
  • Database Administrator
  • Business Analyst
  • Data & Analytics Manager
The Data Science Industry: Who Does What
The Data Science Industry: Who Does What

Want a Quick Jupyter Notebook?

If you have been hearing about Jupyter (formerly iPython) and have not tried it out, here are a couple quick, free, and easy options for giving it a try. No installation need, and no account setup. Just visit a link.

Easy Jupyter Notebook

Try Jupyter and tmpnb are two projects for instantly getting a jupyter notebook with just a simple URL. Tmpnb was created by Rackspace for Nature and Try Jupyter is a demo from the main Jupyter website. I believe both projects use the same open source code found on GitHub. They might even be 2 URLs to the same infrastructure.

The major limitation is the lack of an ability to comeback to your notebook later (which is not a problem if you host the Jupyter notebook on your own). The notebooks die after some time of inactivity, but you can always create a new one. For more on the design decisions, see the Rackspace blog post, How did we serve more than 20,000 IPython notebooks for Nature readers? or join the open source project on GitHub.

If you have been wishing to try our Jupyter, it cannot get much easier than these options.

jupyter notebook
jupyter notebook

Dat – Version Controlled Data

Dat is an open source project focusing on data storage. In particular, the project wants to version control data. What is version control? In short it allows for tracking of history associated with something (typically source code files or documents). Dat takes the idea a bit further, and the data is versioned at the row level and not the file level. Plus, it is built for collaboration among teams.

Use the online tutorial to learn more.

Dat is currently in beta. This is going to be a very interesting project to watch. I can see many great use cases.

Learning To Be A Data Scientist