Tag Archives: kaggle

Recent Resources for Open Data

Recently, a number of resources for publicly available datasets have been announced.

  • Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
  • NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
  • Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.

For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.

Are you aware of any other resources that have been recently announced? If so, please leave a comment.

National Data Science Bowl

Kaggle and Booz | Allen | Hamilton have just launched the National Data Science Bowl. It is a data science competition hosted at Kaggle.

If you are interested in getting started, a tutorial is available in iPython format. Best of Luck!

10 Great R Packages

These slides are targeted at Kaggle competitions, but the R packages can be helpful to anyone using R for data analysis. The slides were created by Xavier Conort, a winner of multiple competitions.

Top 5 Places to Get a Data Scientist job

  1. LinkedIn They turn data into products better than anyone else.
  2. Facebook If you are the type of person that loves to analyze people’s lives, there is no better place.
  3. Twitter Duh, It’s Twitter. lots of data and lots of possibilities
  4. Cloudera Cloudera is a successful Hadoop-based startup. Build tools and explore huge datasets for a variety of industries.
  5. Kaggle If optimizing algorithms and really diving into the data to get every last ounce of information is your thing, then Kaggle is it. Plus, there is nowhere else you will get to work on so many important problems in such a wide range of domains. Unfortunately, Kaggle is not currently hiring any data scientists, but they most likely will be seeking more in the future.

There are many other companies hiring data scientists. Where would you like to be a data scientist?

Top 5 Data Startups

  1. Kaggle They make data science a sport, enough said.
  2. DataKind DataKind may not technically be a startup because it is a nonprofit, but they are doing cool stuff.  They match nonprofit organizations with people that love to analyze data and create visualizations.
  3. Cloudera They call themselves “The Platform for Big Data”.  They are working hard to make hadoop easier to use.
  4. Coursera  Coursera is an education startup, but with 2 Computer Science Professors as founders, you can bet they are crunching a lot of data about how people learn.
  5. BigML They are trying to make machine learning available to everyone.  Machine Learning as a Service!

blog.untrod.com: Engineering Practices in Data Science

This is a great post by Chris Clark of Kaggle. It explains some of the primary differences among engineers and statisticians.
Both groups have something to learn from each other.

blog.untrod.com: Engineering Practices in Data Science.

Map of Kaggle Submissions

See this interactive map of Kaggle Submissions. The map is a nice example of data visualization. The data is much easier to see on a map than in a data table. Nice work by Ramzi Ramey of Kaggle.

How To Learn Data Science? Part 2

Yesterday, I posted about some traditional strategies to acquire data science skills. Today, I will post a nontraditional strategy.

Internet Based

There is hoards of data science information available on the internet for free. With enough personal motivation, a person could learn all the skills necessary for free (or cheap) online. Coursera is probably a great place to start. There are also other good sites such as Udacity, the Kaggle Wiki, other blogs and websites.

The problem with this approach is knowing exactly what to learn. A course in machine learning is great, but data science is more than just machine learning. How do you know what to learn? It would be really nice to have a collection of data science topics and the associated online training materials.

Would this strategy work for you?

Kaggle Launches New Products

If you follow the blog, you probably know I am a big fan of Kaggle. Just last week, they announced the launch of 2 new products.

  1. Kaggle Recruit In this competition, the participants are not competing for a cash prize but rather a job interview with a specific company. Currently, Facebook is hosting the first such competition.
  2. Kaggle Prospect In this competition, the participants are trying to come up with the best question to ask. Participants are presented with various related datasets, and the goal is to find which data science question should be asked of the data. The winner gets a small cash prize, and the winning question becomes a regular kaggle competition.

What do you think? Are you excited to try out these new competitions?

Increase Your Kaggle Score With a Random Forest

Previously, I blogged about submitting your first solution to Kaggle for the Biological Response Competition. Well, that technique used Logistic Regression and the resulting score was not very good. Now, let’s try to improve upon that score. In this example, we will use what is called a Random Forest. Kaggle claims that random forests have performed well in many of the competitions.


There is no setup required beyond what was done when submitting your first solution. This technique will also use python as the software tool and the same data and directory structure.

The Random Forest Code

Scikit-learn, the machine learning library for python, has a nice implementation of a random forest. Here is some python code to run the random forest. A special thanks to Ben Hamner for supplying the basic code.

#!/usr/bin/env python

from sklearn.ensemble import RandomForestClassifier
import csv_io
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# random forest code
rf = RandomForestClassifier(n_estimators=150, min_samples_split=2, n_jobs=-1)
# fit the training data
print('fitting the model')
rf.fit(train, target)
# run model against test data
predicted_probs = rf.predict_proba(realtest)

predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("random_forest_solution.csv", predicted_probs)

print ('Random Forest Complete! You Rock! Submit random_forest_solution.csv to Kaggle')

if __name__=="__main__":

Raw code can be obtained here. (Please use the raw code if you are going to copy/paste). Now save this file as random_forest.py in the directory (c:/kaggle/bioresponse) you previously created.

Running the code

Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.

import os

Now you can run the actual random forest python code.

import random_forest


Now upload random_forest_solution.csv to Kaggle and enjoy moving up the Leaderboard. This score should place you at or near the random forest benchmark. As of today (5/30/2012), that score is about in the middle of the Leaderboard. Note: as the name implies, a random forest has a bit of randomness built into the algorithm, so your results may vary slightly.

Once again if you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.