Data Science: The Paper that Started it All

Although Tobias Mayer may be known as the first data scientist, he did not coin the term data science. According to Wikipedia, the first use of the term data science was in 2001.

Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics was published in the April 2001 edition of the International Statistics Review. The author was William S. Cleveland, currently a Professor of Statistics at Purdue University.

The paper proposes a new field of study named data science. It then goes on to list and explain 6 technical focus areas for a university data science department.

  1. Multidisciplinary Investigations
  2. Models and Methods for Data
  3. Computing with Data
  4. Pedagogy
  5. Tool Evaluation
  6. Theory

For the most part, the paper is still relevant. I did find a couple of good quotes from the paper that deserve comment.

The primary agents for change should be university departments themselves.

That did not happen. The driving agents for change in the data science field have been some of the newer technology/web companies such as LinkedIn, Twitter, and Facebook (none of which even existed in 2001).

…knowledge among computer scientists about how to think of and approach the analysis of data is limited, just as the knowledge of computing environments by statisticians is limited. A merger of the knowledge bases would produce a powerful force for innovation.

I think this statement still applies today. The world is just starting to realize the benefits of merging knowledge from computer science and statistics. There is much more work to do. Fortunately, businesses and universities are working to address the merger.

Have you seen the paper before? What are your thoughts on it?

Large-Scale Machine Learning at NYU

New York University is offering a Large Scale Machine Learning course starting later this month. This is NOT a MOOC, so it is not open to everyone. However, the lecture videos will be posted and possibly the other class handouts. This is not an introductory course, so knowledge of machine learning is a prerequisite. The course is being taught by John Langford of Microsoft Research and Yann LeCun of NYU.

For more about the course, see the original blog announcement.

Data Analysis Landscape

Jeff Leak, instructor of the upcoming Coursera Data Analysis course, wrote up a nice blog post, The Landscape of Data Analysis, explaining the topics to be covered in the course. The topics look good. He also made a video explaining how data science fits in with other disciplines such as: computer science, medicine, statistics, and so on. The video is short (less than 5 minutes), so it is definitely worth the time.

If The World Were 100 People?

Not only is the topic interesting, but the concept of breaking the global population down into 100 people is brilliant. This infographic is easily understandable, and it conveys a whole lot of information in a clean and concise manner. For more about where the data came from, see the 100 People page.