Introduction to Apache Mahout Slides

Although Apache Mahout is not an absolute beginners topic in data science, this slide deck provides a nice overview of machine learning, and it provides some excellent links at the end. In case you are wondering, Mahout is a scalable machine learning library for very large data sets.

The slides were prepared by Varad Meru.

How To Build Data Science Teams?

Companies everywhere are struggling to assemble data science teams. Here are a couple of videos to help answer the following questions and more.

  • How do you assemble a team?
  • What skills do you need?
  • Where do you look for data scientists?

      DJ Patil, one of the stars of the data science world, answers a bunch of great questions in this talk. It is a couple years old, but still relevant.

      What are the Characteristics to look for in a Data Scientist?

      • Curiosity
      • Passion for playing with data
      • History of having to manipulate data to solve problems


      What are the Key Data Science Skills?

      • Finding Data Sources
      • Working with large data sets despite constraints
      • Cleaning data
      • Merging data sets
      • Visualization
      • Building tools for others to use


      Where to look for data science team members?

      • Internal
      • Interns
      • Other fields (physics, neurology, sciences)
      • Academic counterparts


      Principles for Data Science Talent

      • Would we be willing to work on a startup together?
      • Can you knock the socks off in 90 days?
      • Will you be doing amazing things?


      David Dietrich of EMC just recently added some insight to DJ’s points about building data science teams. His philosophy is: Building data science teams is not the goal. Developing data science capabilities is the goal. The structure is not nearly as important as the work being done. Different organizations can be successful doing data science different ways. In the video he lays out the pros and cons of all the following strategies.

      Strategies to Assemble Data Science Capabilities

      1. Transforming – reposition/add/modify existing teams such as a reporting team
      2. Creating – just start from scratch
      3. As a Service – consultants or websites, new ones are appearing every day
      4. Crowdsourcing – competitions like the Netflix prize or Kaggle


      Now, go start developing data science capabilities! – Awesome Interactive Infographics is a website for creating interactive infographics. Strangely, the site is new to me even though nearly 1 million “infograms” have already been created. Here are some of the features:

  • Use more than 30 different chart types
  • Edit data with a built-in spreadsheet
  • Download the infographic
  • Share on all your favorite social sites

Some Good Interactive Infographics on

Facebook dominates Social Networking

The Real Data on Facebook vs. Google+ is a great article about the popularity of different social networks. All the big social networks (Twitter, Facebook, Google+, LinkedIn, Pinterests, and even Myspace) are included. If you have ever wondered whether Google+ is dead or not, this article will help you out. After the data was gathered and analyzed, it is clear that Facebook is currently winning the social media battle.

The best part of the article is an interactive infographic. You can change the view of the infographic for different years and different business segments. Here is a direct link to just the Interactive Infographic, Facebook Dominates Social Networking.

Data Mining Standard Processes

There are a couple of standard processes for approaching data mining problems.


The most common approach is Cross Industry Standard Process for Data Mining (CRISP-DM).

Steps of CRISP-DM

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

The steps are mostly self-explanatory, but the CRISP-DM wikipedia page has a lengthier description.


The second most popular process for data mining is SEMMA.

Steps of SEMMA

  1. Sample
  2. Explore
  3. Modify
  4. Model
  5. Assess

More details can be found on the SEMMA wikipedia page.

A Data Science Process?

Other than The Data Scientific Method (which is not a standard), I am not aware of any other process for data science.

Do you know of any processes for data science? Is anyone aware of a group working on standardizing a data science process?

New Berkeley Online Data Science Degree

The University of California at Berkeley just announced a new masters degree in Information and Data Science (MIDS). The program is targeted to be completed entirely online with the exception of a one week visit to the campus. The program has a approximate cost of $60,000 for the 27 required credits. The curriculum looks good. It includes: machine learning, data analysis, visualization, big data processing, and privacy/ethics. The initial class of students will start in January of 2014.