Increase Your Kaggle Score With a Random Forest

Previously, I blogged about submitting your first solution to Kaggle for the Biological Response Competition. Well, that technique used Logistic Regression and the resulting score was not very good. Now, let’s try to improve upon that score. In this example, we will use what is called a Random Forest. Kaggle claims that random forests have performed well in many of the competitions.

Setup

There is no setup required beyond what was done when submitting your first solution. This technique will also use python as the software tool and the same data and directory structure.

The Random Forest Code

Scikit-learn, the machine learning library for python, has a nice implementation of a random forest. Here is some python code to run the random forest. A special thanks to Ben Hamner for supplying the basic code.

#!/usr/bin/env python

from sklearn.ensemble import RandomForestClassifier
import csv_io
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# random forest code
rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)
# fit the training data
print('fitting the model')
rf.fit(train, target)
# run model against test data
predicted_probs = rf.predict_proba(realtest)

predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)

print ('Random Forest Complete! You Rock! Submit random_forest_solution.csv to Kaggle')

if __name__=="__main__":
main()

Raw code can be obtained here. (Please use the raw code if you are going to copy/paste). Now save this file as random_forest.py in the directory (c:/kaggle/bioresponse) you previously created.

Running the code

Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.

import os
os.chdir('c:/kaggle/bioresponse')

Now you can run the actual random forest python code.

import random_forest
random_forest.main()

Results

Now upload random_forest_solution.csv to Kaggle and enjoy moving up the Leaderboard. This score should place you at or near the random forest benchmark. As of today (5/30/2012), that score is about in the middle of the Leaderboard. Note: as the name implies, a random forest has a bit of randomness built into the algorithm, so your results may vary slightly.

Once again if you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.

This helps to solidify MongoDB as the leading NoSQL database. Having used MongoDB myself, I can see why they were able to raise so much funding. MongoDB is fast, simple, and flexible.

An Infographic for Taming Big Data

The gorilla may be a bit too much, but the rest of the infographic contains valuable information. What do companies do with bigdata? Also, spending on bigdata is set to increase rapidly in the next few years.

Taming Big Data | A Big Data Infographic
Via: Wikibon Big Data

Religion and the Number of Babies: A Hans Rosling TedTalk

Hans Rosling does it again. Hans may be the best storyteller of data on earth. He has a real gift of turning data into an exciting story. He also asks great questions and has some good wit. The whole goal of this TedTalk is to answer the following question.

Do women of certain religions have more babies?

Your First Kaggle Submission

Yesterday, I wrote a post explaining the Kaggle Biological Response competition. If you don’t know, Kaggle is a website for data science competitions. Now it is time to submit a solution. After this post, you should have a spot on the Leaderboard. Granted, it will not be first place but it won’t be last place either. If you have not already done so, please create an account at Kaggle.

Setup Python

For this example, we can use the Python programming language. You will need to perform the following steps to get going. These steps are for Windows machines, but they could very easily be modified for a UNIX/Linux/MAC system.

  1. Install Python 2.7.3 – you need the programming language
  2. Install numpy – for linear algebra and other stuff
  3. Install scipy – for scientific calculations
  4. Install setuptools – easier python package installation
  5. Install scikit-learn – machine learning for python

Setup A File Structure And Get Data

Next create a directory on your C drive. Call it whatever you want. I recommend C:/kaggle/bioresponse. Then download and save the file csv_io.py for reading and writing CSV files. Thanks to Ben Hamner of Kaggle for that file. Next, go download the test and train files from Kaggle and save to your directory.

The Default Solution

If you opened the test.csv file, you would have noticed it has 2501 rows of actual data. Thus, a very simple default solution is to create a submission file with 2501 rows and the number 0.5 on each row. Then go to Kaggle and upload the submission file. I will not provide code for creating that file. There are many ways to do it manually or programatically. This solution will get you on the Leaderboard near the bottom, but not last.

A Logistic Regression Solution

Now, if you know a little statistics, you will recognize this problem as a classification problem, since the observed responses are either 0 or 1. Thus logistic regression is a decent algorithm to try. Here is the Python code to run logistic regression.

#!/usr/bin/env python

from sklearn.linear_model import LogisticRegression
import csv_io
import math
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# code for logistic regression
lr = LogisticRegression()
lr.fit(train, target)
predicted_probs = lr.predict_proba(realtest)

# write solutions to file
predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("log_solution.csv", predicted_probs)

print ('Logistic Regression Complete! Submit log_solution.csv to Kaggle')

if __name__=="__main__":
main()

Raw code can be obtained here (Please use the raw code if you are going to copy/paste).
Save this file as log_regression.py in the directory you created above. Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.

import os
os.chdir('c:/kaggle/bioresponse')

Now you can run the actual logistic regression.

import log_regression
log_regression.main()

Now upload log_solution.csv to Kaggle, and you are playing the game.

Results

If you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.

Get Started With Kaggle – Description

Yesterday, I posted about the popularity of data hackathons. Well, today let’s get started with Kaggle. This is the first of a few simple posts about making your first submission to a Kaggle competition. I also promise you won’t be last place. You won’t be first either. This is an excellent way to start developing your data science skills.

The Problem

The Biological response competition seems to be a good starting point. The data is fairly straight forward. The data consists of rows and columns. Each row represents a molecule. The first column represents a biological response, and the remaining 1776 columns are features of the molecule (technically, calculated molecular descriptors). Unfortunately, the data does not specifically state what each column represents. Thus, domain knowledge of biology is not really helpful.

The Data

For this problem, Kaggle provides 2 sets of data. The first file is a training set. It includes data with responses and features. Obviously it is used for training your algorithm. The actual responses are either the value 0 or the value 1. The second file is very similar except it does not contain the responses. It is called the test file.

How To Submit A Solution

Your goal as a participant is to run your algorithm against the test file and predict the response. Each predicted response should be a value between 0 and 1. After your algorithm runs it should produce an output file with the predicted response for each row on a separate line. Your submission file is just a single column.

The Ranking

To submit a solution, you just upload your submission file. Kaggle then compares your predicted responses with the actual responses for the test set. Kaggle knows those values, but they do not share them with participants. The comparison method used for this competition is called Log Loss. For a description of Log Loss, see the Kaggle Wiki Page about scoring metrics. The goal of this competition is to get the lowest score.
Note: only 2 submissions are allowed per day.

You Can Do It

That is my brief description of a Kaggle Competition. It doesn’t sound too hard does it? Tomorrow, we can step through making our first submission. Go register for an account, so you are ready to submit a solution tomorrow. Be careful, once you start Kaggling (I think I just invented that word), you might not want to stop.

Hackathons with Data are Everywhere

It seems that competitions and meetups for hacking data are all over the place. Coding challenges have been around for a long time. Recently, it appears that data is being thrown into the mix. I think the idea is great. Instead of just hacking some app, why not hack with some data that might help people?

GitHub just concluded with the GitHub Data Challenge. Also, the World’s first ever global data science hackathon occurred last month. The Silicon Prairie has even hosted a couple data-centric hackathons. The Omaha World Herald newspaper organized Hack Omaha to turn government data into something useful. Open Iowa also did a very similar thing for developers, designer, and data-junkies in the Des Moines area. I did not personally attend either of these events, but I was surprised to see these similar types of events occurring in the midwest.

DataKind is busy organizing data dives all over the country, and Kaggle is currently organizing data science competitions for anyone regardless of location. By the way, Kaggle has become one of my favorite sites, and I will be blogging soon about how to quickly get involved.

Anyhow, it appears hackathons with data are here to stay. What data hackathons or competitions do you know about? Are you planning to attend?

Data Science Training Program in New York

If you are in New York City or the surrounding area and you want to learn data science, this post is for you. General Assembly; a technology, design, and entrepreneurship campus in New York City; is running a 12-week Intensive Program in Data Science. The course consists of lectures (twice a week), labs, homework, and a comprehensive project. The instructors are Max Shron of OkCupid fame and Ryan Witt, founder of Opani. The course does cost $3000, but that seems like a fair price for the knowledge gain and a certificate.

Are you aware of any other training programs like this?

Challenge To Future Developers: Start Storing More Data

Dear Future Developers

Please store as much data as possible. Do not worry about the cost of the extra storage disks. The value in the data will far outweigh the cost of the hardware. Here are some examples of data that could be stored but is typically not.

Start storing data about the order in which pages on your site get visited. Where do visitors most often land, and where do they go from there? Is there a path that leads to visitors becoming customers? Is there a path that leads to visitors leaving? Both would be good to know. Given enough of this data, it would be possible to predict what pages eventually lead to the most customers.

Start storing log information to a database. Some places do this, but far too many do not. As developers, this should get a higher priority. It is never fun to go debug a problem only to find the log file has been overwritten. Setting up a database for this would definitely save on debug time. Plus, the log data could possibly be helpful for determining trends or parts of the system that frequently have issues. It is important to remember that not all bugs produce errors, thus it is important to store all the log data.

Start storing data about the errors that occur and what(screen/page) caused the error. This information is typically stored in log file somewhere. It is too frequently lost after a couple days. It would be much better to store this information in a database for archival purposes. This is closely related to the previous paragraph.

Start storing information about which fields on a form get updated. Then you can notice if users are constantly returning to the same form to update a different field. Maybe the user was unaware that both fields can be updated simultaneously. Rearranging the fields might create a better user experience, and it will decrease the amount of updates hitting the database.

Start storing data about which buttons and links users click. This is not just the pages visited but the actual user actions. A good web analytics program can cover some of this, but why not store all of it yourself. Then you can do with it as you please. It would be great to know for your site what buttons users click the most? Is it the color, location, neither, or both that determine a popular button? What buttons and links never get clicked? How frequently does the same user click each button? If a user continues to come back and click the same button, it may indicate a navigation issue. There are some nice usability enhancements that can be made with this data.

Start storing data that you cannot immediately see as useful. The bigdata movement is continually showing the advantage of having more data. You never know when or for what the data will be useful.

Many of the current NoSQL choices would be good candidates to store the above data. This data will obviously grow very quickly, and speedy inserts are a must. Therefore, a database like MongoDB, Cassandra or Redis might be a good choice.

What other data do you think could be collected? I am sure there are lots of other possibilities. Also, I am going to take myself up on this challenge. I would like to store more information about the software I build.

Sincerely,

Ryan Swanstrom

Easel.ly Launches For Creating Infographics

Easel.ly recently launched. It is a site for easily creating infographics. It looks pretty simple, but I am still not sure I have the artistic skills to make a good looking infographic.

Infographics are still great for telling the story of your data.