Yesterday, I posted about the popularity of data hackathons. Well, today let’s get started with Kaggle. This is the first of a few simple posts about making your first submission to a Kaggle competition. I also promise you won’t be last place. You won’t be first either. This is an excellent way to start developing your data science skills.
The Biological response competition seems to be a good starting point. The data is fairly straight forward. The data consists of rows and columns. Each row represents a molecule. The first column represents a biological response, and the remaining 1776 columns are features of the molecule (technically, calculated molecular descriptors). Unfortunately, the data does not specifically state what each column represents. Thus, domain knowledge of biology is not really helpful.
For this problem, Kaggle provides 2 sets of data. The first file is a training set. It includes data with responses and features. Obviously it is used for training your algorithm. The actual responses are either the value 0 or the value 1. The second file is very similar except it does not contain the responses. It is called the test file.
How To Submit A Solution
Your goal as a participant is to run your algorithm against the test file and predict the response. Each predicted response should be a value between 0 and 1. After your algorithm runs it should produce an output file with the predicted response for each row on a separate line. Your submission file is just a single column.
To submit a solution, you just upload your submission file. Kaggle then compares your predicted responses with the actual responses for the test set. Kaggle knows those values, but they do not share them with participants. The comparison method used for this competition is called Log Loss. For a description of Log Loss, see the Kaggle Wiki Page about scoring metrics. The goal of this competition is to get the lowest score.
Note: only 2 submissions are allowed per day.
You Can Do It
That is my brief description of a Kaggle Competition. It doesn’t sound too hard does it? Tomorrow, we can step through making our first submission. Go register for an account, so you are ready to submit a solution tomorrow. Be careful, once you start Kaggling (I think I just invented that word), you might not want to stop.
If you are in New York City or the surrounding area and you want to learn data science, this post is for you. General Assembly; a technology, design, and entrepreneurship campus in New York City; is running a 12-week Intensive Program in Data Science. The course consists of lectures (twice a week), labs, homework, and a comprehensive project. The instructors are Max Shron of OkCupid fame and Ryan Witt, founder of Opani. The course does cost $3000, but that seems like a fair price for the knowledge gain and a certificate.
Are you aware of any other training programs like this?
This is not intended to be mapped to a set of college courses. It is intended to be a listing of necessary skills for a data scientist. For a definition of data scientist, see this previous post.
- Calculus – not directly important to data science, but the knowledge is important to understand the statistics and machine learning
- Matrix Operations
- Regression – Linear and Logistic
- Bayesian Statistics
- R – stats
- Octave – machine learning
- Basic Programming – Java, C/C++, and Python seem to be good language choices
- Machine Learning
- Database Knowledge – not limited to just relational databases
- Data Visualization – how to make data look good: maps, graphs, etc
- Presentation – story telling, be comfortable explaining data to others
Do you have anything to add/remove from the list?
Please spread the word about why data science is important. If you are excited, others will be too. If you are not sure what to say, here is a list of possible topics.
What can you tell people about data science?
What are some other things you could tell people about data science?
STEM stands for Science, Technology, Engineering and Mathematics. Due to the difficulty of STEM degrees, it appears many students abandon the degrees in college. While this fact is not surprising, it is still concerning. Our country and world need more good people with STEM skills.
A STEM degree is not essential to becoming a data scientist, but many data scientists have STEM backgrounds. Thus, I thought this information fit well with the Data Science Education Week theme.
How do we convince students to not abandon the STEM degrees?
One solution is to put less emphasis on grades. Grades in STEM courses are typically the lowest on campus, and this causes some students to switch degree programs in order to get better grades. Second, tell young people about some of the cool STEM projects available. Lots of people in Science and Math work on really interesting projects. If you can, tell the world about your projects.
What are some other ways to keep students in STEM programs?
Below is a nice infographic with various numbers about STEM students.
Thanks to Online Engineering Degree for the infographic.
Data Science Courses
This is a nice collection of data science related courses offered at various colleges and universities. It is on a wiki page so you are free to add links.
This infographic displays the need for colleges and universities to start preparing more data science graduates.
Yes, I just declared this week as Data Science Education Week. As far as I know, I have no authority to do such a thing. Hey, I did it anyway. All week I will be posting information related to data science education. So, pull up a chair and get ready for some serious data science education topics.
I have a problem. This is a problem that I would guess many other people have. I have access to way too much information. I want less, but I also want the best and newest.
How do I find the best and newest information on any topic? There is a lot of new information everyday. I spend a lot of time searching the internet for quality information on data science. I would love to be able to visit a page and get the latest and greatest information on data science, statistics, bigdata, and machine learning. I would be hoping to get news articles and/or blog posts from the last couple days.
Here is a list of products I have seen and why they are not exactly what I am looking for:
- News.me – This site is close. It emails a daily list of top articles, but the articles only come from my twitter followers. The problem is: I may not be following the right people. They did just do an excellent blog series about how people get news, so they may be working on something right now.
- Paper.li – This seems very promising, but not all the content is new. It also only updates 1 or 2 times per day. The slow updates make it difficult to easily and quickly get the latest news. After a few weeks of training your searches and parameters, this might be really good for getting news on a daily basis. The big problem for me is; I have to wait until the next day to see if my search parameters are correct. It is not built to handle ah-hoc queries. Here is the paper I am working on, Data Science 101 News.
- Storify – This is a way to create a collection of information from various social networks. Storify makes it really easy to find social media mentions, but it is not automated and doesn’t save much time. Not that it matters much to me, but the final product is not real pretty.
- Summify – Recently bought by Twitter, so the future is uncertain.
- TweetedTimes – Same problem as News.me. It only generates information based upon people I follow on twitter.
- Google News – This is good for searching, but I think a better one exists or needs to be created. Plus, the Google Privacy issues are a concern here.
This whole concept of filtering/searching/rating news sounds like a data science problem itself. For starters, given a topic, what information has been tweeted the most? What new information has spread the quickest? This approach could be expanded to include Facebook likes and Google +1’s. Also there are numerous other API’s that could be included as well. What I really want is a product that will do this in realtime (or near realtime). I want to be able to enter my search terms and get a list of the most recent quality information pertaining to those terms. I guess what I want is a search engine for recent news (but not Google).
Does anyone know if a product like this exists or is anyone working to build a similar product?
Update: This idea also goes along with Paul Graham’s Ambitious Startup Idea #1 – A New Search Engine.
Data Without Borders
Jake Porway started Data Without Borders because he attended a hack-a-thon and the groups came up with apps that didn’t really better the world very much. I believe he used the word, “unfulfilling” to describe the apps. He decided to create a way to provide organizations (Government or non-profit) with access to data scientists. His thinking goes like this. There are lots of data scientists that love to work with data. There are great organization with lots of data. If the two can be matched together, what amazing things can be done? Data Without Borders hopes to find out.
Data Without Borders organizes a bunch of DataDives, which are weekend hack-a-thons that match up a group of data scientists and developers with data.
Jake concluded with some wonderful remarks:
What if we started using data not just to make better decisions about what kind of movies we wanted to see? What if we started using data to make betters decisions about what kind of a world we wanted to see?
What is Data Without Borders looking for?
Jake’s Presentation at PopTech