Data From eReaders

Reading Your E-book is Reading You in the Wall Street Journal, is an excellent example of data science. Book publishers now know how much of a book readers will finish, how long they read, what book they read next, and lots of other stuff. Read the article and find out more. It also opens the door to some privacy issues.

A big thanks to Mark Nickel for sharing the article with me.

The Data Scientific Method

DJ Patil and Josh Elman, both of Greylock Partners, give an insightful talk at LeWeb London 2012. The most important part was the introduction of the Data Scientific Method.

Data Scientific Method

  1. Start with a Question
  2. Leverage your current data
  3. Create features and run tests
  4. Analyze the results and draw insights
  5. Let the data frame a conversation

How To Learn Data Science? Part 2

Yesterday, I posted about some traditional strategies to acquire data science skills. Today, I will post a nontraditional strategy.

Internet Based

There is hoards of data science information available on the internet for free. With enough personal motivation, a person could learn all the skills necessary for free (or cheap) online. Coursera is probably a great place to start. There are also other good sites such as Udacity, the Kaggle Wiki, other blogs and websites.

The problem with this approach is knowing exactly what to learn. A course in machine learning is great, but data science is more than just machine learning. How do you know what to learn? It would be really nice to have a collection of data science topics and the associated online training materials.

Would this strategy work for you?

How To Learn Data Science?

Based upon the popularity of a previous post about a certificate program from the University of Washington, it appears that many people are interested in learning the skills necessary to become a data scientist. Thus, I decided to compile a list of some of the possible learning strategies.

Traditional College Education

The most obvious path would be to study at a traditional college or university. Colleges and universities are starting to notice the demand for data science skills, and many colleges are currently offering programs to prepare someone as a data scientist. This path is safe and predictable. Do the homework, complete the courses, and get the degree or certificate. Most people are familiar with the process, and it offers few surprises. The problems here are the costs, lack of flexibility, and time involved.

Corporate Training

Companies are now starting to offer training programs for data science. EMC is leading the way in this category with their data science training program. Cloudera also offers lots of training related to hadoop and big data. Wolfram offers data science training with Mathematica. One of the problems with this category is the cost. Another problem is the companies have the tendency to teach and promote their own products. This may leave the student with numerous gaps in the full data science spectrum.

Your Thoughts?

What are you thoughts about the above approaches? What are the positives and negatives? Also, later this week I will be posting some less-traditional approaches to learning data science.

Profile Of A Data Scientist – Interview

Visit this excellent video interview with David Dietrich, creator of EMC’s data science curriculum. He talks about his experience helping people transition to becoming data scientists.
David lays out a list of 5 traits of a data scientist.

  • Quantitative
  • Technical
  • Skeptical
  • Communication and Collaboration
  • Creative and Curiosity

For a diagram of these 5 traits, see this brief writeup about the profile of a data scientist. Also, see the slides of his latest talk at EMC World 2012.

**Note: I removed the embedded video because it was set to automatically play the video

Kaggle Launches New Products

If you follow the blog, you probably know I am a big fan of Kaggle. Just last week, they announced the launch of 2 new products.

  1. Kaggle Recruit In this competition, the participants are not competing for a cash prize but rather a job interview with a specific company. Currently, Facebook is hosting the first such competition.
  2. Kaggle Prospect In this competition, the participants are trying to come up with the best question to ask. Participants are presented with various related datasets, and the goal is to find which data science question should be asked of the data. The winner gets a small cash prize, and the winning question becomes a regular kaggle competition.

What do you think? Are you excited to try out these new competitions?

New Data Science Certificate Program

Starting in the fall of 2012, the University of Washington will be offering a certificate in Data Science. The program has two sections: one located in Seattle and the other online. The certificate consists of three separate courses each lasting approximately 3 months. Thus the program can be completed in 9 months, and the cost is around $3000.

There are some information sessions later this summer. If you are in Seattle, there is an information session on July 19. If you are interested in the online program, a webinar is scheduled for August 29.

The program content looks quite good. Some of the topics to be covered include: hadoop, NoSQL, machine learning, statistics, graph algorithms, and more. If you are looking to become a data scientist, this just might be the program for you.

I also added this certificate program to my list of College’s offering data science degrees.

Visualization Of Data Science Twitter Users

This is a fun and interactive visualization of 659 twitter accounts linked to data science.


Bitmarks: Bitly's Data Science One URL At A Time

Earlier this week, Bitly launched a new bookmarking service. They call links/URLs bitmarks instead of bookmarks. It has a nice Chrome Extension and Bitmarklet. So far, I very much like the service.

So, Why Should You Care?

Well, at its core, Bitly is a data science company. This is just another way for Bitly to collect more URLs. I think that is a good thing. Bitly has huge amounts of data created by collecting lots and lots of small things.

What do you think Bitly is doing with all those URLs? I am not completely sure, but I would bet some of it is really neat. Bitly can already track breaking news in near real-time. I will be curious if Bitly can predict the winner of the November presidential election before the news organizations can.

By the way

I have create a Data Science 101 Bitmark Bundle. You are welcome to follow along, although I do not know if there is a way to follow a bundle.

Your First Kaggle Submission

Yesterday, I wrote a post explaining the Kaggle Biological Response competition. If you don’t know, Kaggle is a website for data science competitions. Now it is time to submit a solution. After this post, you should have a spot on the Leaderboard. Granted, it will not be first place but it won’t be last place either. If you have not already done so, please create an account at Kaggle.

Setup Python

For this example, we can use the Python programming language. You will need to perform the following steps to get going. These steps are for Windows machines, but they could very easily be modified for a UNIX/Linux/MAC system.

  1. Install Python 2.7.3 – you need the programming language
  2. Install numpy – for linear algebra and other stuff
  3. Install scipy – for scientific calculations
  4. Install setuptools – easier python package installation
  5. Install scikit-learn – machine learning for python

Setup A File Structure And Get Data

Next create a directory on your C drive. Call it whatever you want. I recommend C:/kaggle/bioresponse. Then download and save the file csv_io.py for reading and writing CSV files. Thanks to Ben Hamner of Kaggle for that file. Next, go download the test and train files from Kaggle and save to your directory.

The Default Solution

If you opened the test.csv file, you would have noticed it has 2501 rows of actual data. Thus, a very simple default solution is to create a submission file with 2501 rows and the number 0.5 on each row. Then go to Kaggle and upload the submission file. I will not provide code for creating that file. There are many ways to do it manually or programatically. This solution will get you on the Leaderboard near the bottom, but not last.

A Logistic Regression Solution

Now, if you know a little statistics, you will recognize this problem as a classification problem, since the observed responses are either 0 or 1. Thus logistic regression is a decent algorithm to try. Here is the Python code to run logistic regression.

#!/usr/bin/env python

from sklearn.linear_model import LogisticRegression
import csv_io
import math
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# code for logistic regression
lr = LogisticRegression()
lr.fit(train, target)
predicted_probs = lr.predict_proba(realtest)

# write solutions to file
predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("log_solution.csv", predicted_probs)

print ('Logistic Regression Complete! Submit log_solution.csv to Kaggle')

if __name__=="__main__":

Raw code can be obtained here (Please use the raw code if you are going to copy/paste).
Save this file as log_regression.py in the directory you created above. Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.

import os

Now you can run the actual logistic regression.

import log_regression

Now upload log_solution.csv to Kaggle, and you are playing the game.


If you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.