Data Science Training Program in New York

If you are in New York City or the surrounding area and you want to learn data science, this post is for you. General Assembly; a technology, design, and entrepreneurship campus in New York City; is running a 12-week Intensive Program in Data Science. The course consists of lectures (twice a week), labs, homework, and a comprehensive project. The instructors are Max Shron of OkCupid fame and Ryan Witt, founder of Opani. The course does cost $3000, but that seems like a fair price for the knowledge gain and a certificate.

Are you aware of any other training programs like this?

Challenge To Future Developers: Start Storing More Data

Dear Future Developers

Please store as much data as possible. Do not worry about the cost of the extra storage disks. The value in the data will far outweigh the cost of the hardware. Here are some examples of data that could be stored but is typically not.

Start storing data about the order in which pages on your site get visited. Where do visitors most often land, and where do they go from there? Is there a path that leads to visitors becoming customers? Is there a path that leads to visitors leaving? Both would be good to know. Given enough of this data, it would be possible to predict what pages eventually lead to the most customers.

Start storing log information to a database. Some places do this, but far too many do not. As developers, this should get a higher priority. It is never fun to go debug a problem only to find the log file has been overwritten. Setting up a database for this would definitely save on debug time. Plus, the log data could possibly be helpful for determining trends or parts of the system that frequently have issues. It is important to remember that not all bugs produce errors, thus it is important to store all the log data.

Start storing data about the errors that occur and what(screen/page) caused the error. This information is typically stored in log file somewhere. It is too frequently lost after a couple days. It would be much better to store this information in a database for archival purposes. This is closely related to the previous paragraph.

Start storing information about which fields on a form get updated. Then you can notice if users are constantly returning to the same form to update a different field. Maybe the user was unaware that both fields can be updated simultaneously. Rearranging the fields might create a better user experience, and it will decrease the amount of updates hitting the database.

Start storing data about which buttons and links users click. This is not just the pages visited but the actual user actions. A good web analytics program can cover some of this, but why not store all of it yourself. Then you can do with it as you please. It would be great to know for your site what buttons users click the most? Is it the color, location, neither, or both that determine a popular button? What buttons and links never get clicked? How frequently does the same user click each button? If a user continues to come back and click the same button, it may indicate a navigation issue. There are some nice usability enhancements that can be made with this data.

Start storing data that you cannot immediately see as useful. The bigdata movement is continually showing the advantage of having more data. You never know when or for what the data will be useful.

Many of the current NoSQL choices would be good candidates to store the above data. This data will obviously grow very quickly, and speedy inserts are a must. Therefore, a database like MongoDB, Cassandra or Redis might be a good choice.

What other data do you think could be collected? I am sure there are lots of other possibilities. Also, I am going to take myself up on this challenge. I would like to store more information about the software I build.


Ryan Swanstrom

Free Online Strata Conference

May Strata 2012 will occur online this year. The cost is zero, and the event is tomorrow (May 16, 2012). The only catch is that you must register first. The entire conference is scheduled to take place in the morning, so the format looks quick. Judging from other Strata videos I have seen, I would guess this will be an event of high quality.

Thanks to DataGeeks-MSP for alerting me to the conference.

Healthcare, BigData, and Startups

Healthcare is starting to see the value of data science. Here are 2 data science events aimed at generating value for healthcare.

Call for Startups – HealthStartup III on Big Data This is an event for connecting healthcare, startups and investors in Europe.
Health 2.0’s Boston Big-Data Code-a-thon This event starts today. It is a competition to develop some application for the healthcare industry. The application must use bigdata, and the teams have 2 days.

Be A Data Rat

In this video, Jeff Hammerbacher of Cloudera mentions that good data scientists are “data rats.” Athletes are often considered “gym rats” if they spend a lot of time in the gym, so Jeff believes “data rats” need to spend a lot of time with data. Having a high level of curiosity is very important.

Jeff also teaches an introductory course in Data Science at Berkeley. In the course, he tries to cover 5 skills that are not typically covered in an undergraduate curriculum.

  1. Data Collection and Integration – know how to acquire and integrate data
  2. Visualization Design – not just chart design but entire dashboard design
  3. Large-scale Experimentation – rapidly design and deploy features to be tested
  4. Causal Inference – you don’t get to design the studies, you just deal with the data
  5. Data Products – how to deploy and evaluate a machine learning algorithm