Increase Your Kaggle Score With a Random Forest

Previously, I blogged about submitting your first solution to Kaggle for the Biological Response Competition. Well, that technique used Logistic Regression and the resulting score was not very good. Now, let’s try to improve upon that score. In this example, we will use what is called a Random Forest. Kaggle claims that random forests have performed well in many of the competitions.

Setup

There is no setup required beyond what was done when submitting your first solution. This technique will also use python as the software tool and the same data and directory structure.

The Random Forest Code

Scikit-learn, the machine learning library for python, has a nice implementation of a random forest. Here is some python code to run the random forest. A special thanks to Ben Hamner for supplying the basic code.
#!/usr/bin/env python


from sklearn.ensemble import RandomForestClassifier

import csv_io

import scipy
def main():

    #read in the training file

    train = csv_io.read_data("train.csv")

    #set the training responses

    target = [x[0] for x in train]

    #set the training features

    train = [x[1:] for x in train]

    #read in the test file

    realtest = csv_io.read_data("test.csv")
    # random forest code

    rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)

    # fit the training data

    print('fitting the model')

    rf.fit(train, target)

    # run model against test data

    predicted_probs = rf.predict_proba(realtest)
    predicted_probs = ["%f" % x[1] for x in predicted_probs]

    csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)
    print ('Random Forest Complete! You Rock! Submit random_forest_solution.csv to Kaggle')

if __name__=="__main__": main()
Raw code can be obtained here. (Please use the raw code if you are going to copy/paste). Now save this file as random_forest.py in the directory (c:/kaggle/bioresponse) you previously created.

Running the code

Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.
import os os.chdir('c:/kaggle/bioresponse')
Now you can run the actual random forest python code.
import random_forest random_forest.main()

Results

Now upload random_forest_solution.csv to Kaggle and enjoy moving up the Leaderboard. This score should place you at or near the random forest benchmark. As of today (5/30/2012), that score is about in the middle of the Leaderboard. Note: as the name implies, a random forest has a bit of randomness built into the algorithm, so your results may vary slightly.

Once again if you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.

Originally Posted

May 31, 2012

Data Science 101, Learn Data Science

Ryan Swanstrom

Last Modified:

November 11, 2020

Tags:

competition, kaggle, python, random forest, scikit

Comments

12 responses to “Increase Your Kaggle Score With a Random Forest”

Chris

June 1, 2012

Hi Ryan,

Great post! I’ve done something similar on Kaggle wiki here: https://www.kaggle.com/wiki/GettingStartedWithPythonForDataScience/

If you have additions or comments, please edit away (it’s a wiki, after all!). And by the same token, if you have new things to add, we’d love to see this kind of knowledge up on our wiki so it’s at the fingertips of our users.

Cheers,
Chris Clark
Product Manager, Kaggle

Reply
1. Ryan Swanstrom
  
  June 2, 2012
  
  Chris,
  Thanks for leaving that comment. I have seen the Kaggle Wiki. However, I had not seen that specific page. It looks very nice. I will take a further look and see if I have anything to add or update.
  
  Thanks,
  Ryan
  
  Reply
Rakesh

June 4, 2012

Looks like with same code. I have got a better rank than you 🙂

Reply
1. Ryan Swanstrom
  
  June 6, 2012
  
  That is because a random forest has some “randomness” involved. Results will vary slightly, but usually will be in approximately the same range. Congrats and thanks for commenting.
  
  Reply
anonymousguerrillamailblockcom

December 19, 2012

Can you please explain why did you use predict_proba and not predict? As far as I understand predict returns the predicted target value for a record while predict_proba returns the probability for a specific target value per record, is that correct?

Reply
1. Ryan Swanstrom
  
  December 20, 2012
  
  predict returns 1 or 0
  predict_proba returns the actual probability (.89 or .23)
  Kaggle wants to compare probabilities not just the 1 or 0.
  Here are the scikit-learn docs:
  http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict_proba
  
  Reply
Thought this was cool: Recent Algorithms Development and Faster Belief Propagation algorithms « CWYAlpha

January 3, 2013

[…] was the ascent of the Random Forest algorithm as witnessed in the numerous top places it got in different Kaggle contests (The Two Most Important Algorithms in Predictive Modeling Today). Since those contests are done on […]

Reply
rusty

September 9, 2013

Thanks man You simply rocks tons of information on data science

Reply
manishranjan

September 27, 2013

How can i avoid name as they throw error for being string and your link to git hub doesnt work

Reply
1. manishranjan
  
  September 27, 2013
  
  Sorry , i mistook this as if you were trying to explain titanic problem of kaggle then realized you are solving bio one, really sorry
  
  Reply
  1. Ryan Swanstrom
    
    September 28, 2013
    
    No problem.
2. Ryan Swanstrom
  
  September 28, 2013
  
  I did change the github link. Thanks for catching that.
  
  Reply