Tag Archives: data science

New Data Science Certificate Program

Starting in the fall of 2012, the University of Washington will be offering a certificate in Data Science. The program has two sections: one located in Seattle and the other online. The certificate consists of three separate courses each lasting approximately 3 months. Thus the program can be completed in 9 months, and the cost is around $3000.

There are some information sessions later this summer. If you are in Seattle, there is an information session on July 19. If you are interested in the online program, a webinar is scheduled for August 29.

The program content looks quite good. Some of the topics to be covered include: hadoop, NoSQL, machine learning, statistics, graph algorithms, and more. If you are looking to become a data scientist, this just might be the program for you.

I also added this certificate program to my list of College’s offering data science degrees.

Visualization Of Data Science Twitter Users

This is a fun and interactive visualization of 659 twitter accounts linked to data science.

http://www.greenplum.com/datasciencesummit/community/

Bitmarks: Bitly's Data Science One URL At A Time

Earlier this week, Bitly launched a new bookmarking service. They call links/URLs bitmarks instead of bookmarks. It has a nice Chrome Extension and Bitmarklet. So far, I very much like the service.

So, Why Should You Care?

Well, at its core, Bitly is a data science company. This is just another way for Bitly to collect more URLs. I think that is a good thing. Bitly has huge amounts of data created by collecting lots and lots of small things.

What do you think Bitly is doing with all those URLs? I am not completely sure, but I would bet some of it is really neat. Bitly can already track breaking news in near real-time. I will be curious if Bitly can predict the winner of the November presidential election before the news organizations can.

By the way

I have create a Data Science 101 Bitmark Bundle. You are welcome to follow along, although I do not know if there is a way to follow a bundle.

Your First Kaggle Submission

Yesterday, I wrote a post explaining the Kaggle Biological Response competition. If you don’t know, Kaggle is a website for data science competitions. Now it is time to submit a solution. After this post, you should have a spot on the Leaderboard. Granted, it will not be first place but it won’t be last place either. If you have not already done so, please create an account at Kaggle.

Setup Python

For this example, we can use the Python programming language. You will need to perform the following steps to get going. These steps are for Windows machines, but they could very easily be modified for a UNIX/Linux/MAC system.

  1. Install Python 2.7.3 – you need the programming language
  2. Install numpy – for linear algebra and other stuff
  3. Install scipy – for scientific calculations
  4. Install setuptools – easier python package installation
  5. Install scikit-learn – machine learning for python

Setup A File Structure And Get Data

Next create a directory on your C drive. Call it whatever you want. I recommend C:/kaggle/bioresponse. Then download and save the file csv_io.py for reading and writing CSV files. Thanks to Ben Hamner of Kaggle for that file. Next, go download the test and train files from Kaggle and save to your directory.

The Default Solution

If you opened the test.csv file, you would have noticed it has 2501 rows of actual data. Thus, a very simple default solution is to create a submission file with 2501 rows and the number 0.5 on each row. Then go to Kaggle and upload the submission file. I will not provide code for creating that file. There are many ways to do it manually or programatically. This solution will get you on the Leaderboard near the bottom, but not last.

A Logistic Regression Solution

Now, if you know a little statistics, you will recognize this problem as a classification problem, since the observed responses are either 0 or 1. Thus logistic regression is a decent algorithm to try. Here is the Python code to run logistic regression.

#!/usr/bin/env python

from sklearn.linear_model import LogisticRegression
import csv_io
import math
import scipy

def main():
#read in the training file
train = csv_io.read_data("train.csv")
#set the training responses
target = [x[0] for x in train]
#set the training features
train = [x[1:] for x in train]
#read in the test file
realtest = csv_io.read_data("test.csv")

# code for logistic regression
lr = LogisticRegression()
lr.fit(train, target)
predicted_probs = lr.predict_proba(realtest)

# write solutions to file
predicted_probs = ["%f" % x[1] for x in predicted_probs]
csv_io.write_delimited_file("log_solution.csv", predicted_probs)

print ('Logistic Regression Complete! Submit log_solution.csv to Kaggle')

if __name__=="__main__":
main()

Raw code can be obtained here (Please use the raw code if you are going to copy/paste).
Save this file as log_regression.py in the directory you created above. Then open the Python GUI. You may need to run the following commands to navigate to the correct directory.

import os
os.chdir('c:/kaggle/bioresponse')

Now you can run the actual logistic regression.

import log_regression
log_regression.main()

Now upload log_solution.csv to Kaggle, and you are playing the game.

Results

If you performed these steps, I would love to know about it. Thanks for following along, and good luck with Kaggle.

Get Started With Kaggle – Description

Yesterday, I posted about the popularity of data hackathons. Well, today let’s get started with Kaggle. This is the first of a few simple posts about making your first submission to a Kaggle competition. I also promise you won’t be last place. You won’t be first either. This is an excellent way to start developing your data science skills.

The Problem

The Biological response competition seems to be a good starting point. The data is fairly straight forward. The data consists of rows and columns. Each row represents a molecule. The first column represents a biological response, and the remaining 1776 columns are features of the molecule (technically, calculated molecular descriptors). Unfortunately, the data does not specifically state what each column represents. Thus, domain knowledge of biology is not really helpful.

The Data

For this problem, Kaggle provides 2 sets of data. The first file is a training set. It includes data with responses and features. Obviously it is used for training your algorithm. The actual responses are either the value 0 or the value 1. The second file is very similar except it does not contain the responses. It is called the test file.

How To Submit A Solution

Your goal as a participant is to run your algorithm against the test file and predict the response. Each predicted response should be a value between 0 and 1. After your algorithm runs it should produce an output file with the predicted response for each row on a separate line. Your submission file is just a single column.

The Ranking

To submit a solution, you just upload your submission file. Kaggle then compares your predicted responses with the actual responses for the test set. Kaggle knows those values, but they do not share them with participants. The comparison method used for this competition is called Log Loss. For a description of Log Loss, see the Kaggle Wiki Page about scoring metrics. The goal of this competition is to get the lowest score.
Note: only 2 submissions are allowed per day.

You Can Do It

That is my brief description of a Kaggle Competition. It doesn’t sound too hard does it? Tomorrow, we can step through making our first submission. Go register for an account, so you are ready to submit a solution tomorrow. Be careful, once you start Kaggling (I think I just invented that word), you might not want to stop.

Data Science Training Program in New York

If you are in New York City or the surrounding area and you want to learn data science, this post is for you. General Assembly; a technology, design, and entrepreneurship campus in New York City; is running a 12-week Intensive Program in Data Science. The course consists of lectures (twice a week), labs, homework, and a comprehensive project. The instructors are Max Shron of OkCupid fame and Ryan Witt, founder of Opani. The course does cost $3000, but that seems like a fair price for the knowledge gain and a certificate.

Are you aware of any other training programs like this?

A Data Science Curriculum

This is not intended to be mapped to a set of college courses. It is intended to be a listing of necessary skills for a data scientist. For a definition of data scientist, see this previous post.

Mathematics

  • Calculus – not directly important to data science, but the knowledge is important to understand the statistics and machine learning
  • Matrix Operations

Statistics

  • Regression – Linear and Logistic
  • Bayesian Statistics

Tools

  • Hadoop
  • R – stats
  • Octave – machine learning

Computing

  • Basic Programming – Java, C/C++, and Python seem to be good language choices
  • Machine Learning
  • Database Knowledge – not limited to just relational databases

Communication

  • Data Visualization – how to make data look good: maps, graphs, etc
  • Presentation – story telling, be comfortable explaining data to others
  • Writing

Do you have anything to add/remove from the list?

Tell Someone About Data Science

Please spread the word about why data science is important. If you are excited, others will be too. If you are not sure what to say, here is a list of possible topics.

What can you tell people about data science?

What are some other things you could tell people about data science?

STEM Graduates Quit Because The Material Is Difficult

STEM stands for Science, Technology, Engineering and Mathematics. Due to the difficulty of STEM degrees, it appears many students abandon the degrees in college. While this fact is not surprising, it is still concerning. Our country and world need more good people with STEM skills.

A STEM degree is not essential to becoming a data scientist, but many data scientists have STEM backgrounds. Thus, I thought this information fit well with the Data Science Education Week theme.

How do we convince students to not abandon the STEM degrees?

One solution is to put less emphasis on grades. Grades in STEM courses are typically the lowest on campus, and this causes some students to switch degree programs in order to get better grades. Second, tell young people about some of the cool STEM projects available. Lots of people in Science and Math work on really interesting projects. If you can, tell the world about your projects.

What are some other ways to keep students in STEM programs?

Below is a nice infographic with various numbers about STEM students.

Thanks to Online Engineering Degree for the infographic.

Data Science Courses

Data Science Courses

This is a nice collection of data science related courses offered at various colleges and universities. It is on a wiki page so you are free to add  links.