The First Ever Data Scientist

It’s probably not who you think. It’s not DJ Patil or Hilary Mason. The first data scientist was Tobias Mayer. Who? Yeah, that’s exactly right, I had never heard of him either. Thankfully, John Rauser, a Data Scientist at Amazon, gave a great talk about this person at Strata New York 2011.

Well, Tobias was an astronomer way back in the mid 1700s. He spent a lot of time observing the libration (wobbling) of the moon, and he came up with the following formula:

beta - x = y alpha - z alpha sin{theta}

He could measure x, y and z . Thus he needed to solve for alpha, beta and theta. Given measurements from 3 observations and 3 equations, Tobias could solve for the unknown. That is when the real problem arose. Tobias had 27 observations instead of 3. He had too much data. This may have been the first known occurrence of big data. For more on Tobias Mayer’s solution, you will need to watch the video below. Hint: he strategically grouped the data.

Rauser has this to say about why Mayer qualifies as the first data scientist.

As far as I know, the first time in history that someone made a quantitative argument that more data is better.

Rauser doesn’t stop there though. The rest of his talk goes on to explain the path to becoming a data scientist and the necessary skillset. Below are the skills he mentions.

  • Math
  • Engineering
  • Writing
  • Skepticism
  • Curiousity

So, do yourself a favor, and take a few minutes to watch this great talk.

As I watched this video, I kept asking myself the same question. Why have I never seen this video before?

Github Is Cool: They Like Data

Today, GitHub announced the release of archived public activity data called the GitHub public timeline. The dataset can be queried via the Google BigQuery tool.

To make things even more awesome, GitHub is also hosting a Data Challenge. The challenge is to play around with data and create the best visualization possible. You better start now, because the competition ends May 21st. I am not familiar with Google BigQuery so this might be a good time to learn.

This should not surprise anyone. GitHub is always doing cool things, especially for developer-minded people. If you don’t know, GitHub is the best place for hosting your source code.

Large Scale Text Processing with MapReduce: A Free Textbook

Data-Intensive Text Processing with MapReduce is a Free online (PDF) textbook about text processing on large amounts of data. The 1st edition has been available for a couple of years, and a 2nd edition is in the works. Here is quick overview of some of the topics.

  • Mapreduce
  • Graph Algorithms
  • Text Processing

Happy Reading (and Text Processing)!