What Makes a Good Data Scientist?

Jeremy Howard is the Chief Scientist at Kaggle. At the end of this interview, from the Strata Conference 2012, he identified 4 simple traits that a data scientist needs.

  1. Creativity
  2. Open-mindedness
  3. Tenacity
  4. A Good Skillset

Jeremy Howard of Kaggle at Strata 2012

In this brief interview he covers a range of other data science topics:

  • Big Data is an engineering problem
  • Analytics generate value/insight from data
  • Predictive Modeling is about answering a question – build a model to do that
  • Is Data Science about tools or people? – watch the video for Jeremy’s answer
  • And others…

See this previous post for more videos from Strata 2012.

Did You Miss Strata 2012?

The Strata Conference Making Data Work for 2012 just finished up. If you (like me) were unable to attend the conference, you may have missed out on some of the networking and excitement of actually being at the conference, but you can still glean some knowledge from the videos.

Steve Schoettler “Learning Analytics”

This is a good video about how data can be used to help people learn.
There are many other Strata 2012 videos available as well. See below for links to them.

Other Strata 2012 Videos

See the O’Reilly Strata CA 2012 Playlist on Youtube for more videos. The videos contain numerous interviews with the speakers and even a few of the talks. Also, many of the slide decks can be found on the Strata Conference website.

Have fun catching up on everything that happened at Strata Conference 2012.

Machine Learning on Big Data

Max Lin of Google Research provides a great slide deck.  The title is self-explanitory.

Link to Data Science Infographic

This infographic does a great job of displaying what a data scientist does and what skills are needed.  Just click and check it out for yourself.

Need A Data Scientist? Probably

In the article Do you need a data scientist?, the following questions get answered:

  • What data scientist’s do?
  • Who makes a good data scientist?
  • When is the right time to hire a data scientist?

Hopefully, I will discuss each of these questions in more detail in a later blog post.

To answer the question, if your data is growing you would probably benefit from a data scientist.

The following is a video that goes along with this topic.

205 Million Dollars of Funding For Big Data Startups

It is a great time to be working on a startup in the Big Data arena. First the topics of big data and data science are really popular in the tech world right now. Second, it appears that investors are interested as well. Below are two examples:

Accel Partners formed a $100 million fund for startups that are focused on Big Data.  The fund is not limited to storage or analysis.  It is really all encompassing.  If you have a startup that, in any way, helps people or businesses deal with lots of data, then you are welcome to apply. For more on the fund and how to apply, see the page on the Accel Partners’ website.

IA Ventures has also set up a $105 million dollar fund for startups that are focused on data.

So if you have a “data” startup, now is a great time to get some funding.

Stanford Machine Learning Class – What is covered

A few days ago, I mentioned that the Stanford Machine Learning class will be starting soon.  I thought I should quickly mention some of the topics covered.  The list also serves as a great outline for machine learning.

Supervised Learning

In supervised learning, one has a set of data with features and labels.

  • Linear Regression – one/multiple variables
  • Gradient Descent – a general algorithm for minimizing a function
  • Logistic Regression – This is useful when predicting classification type results.  For example, are you looking for a yes or no result.  Does the patient have cancer?  Will the customer buy my new product?  It can also be helpful for more than 2 results.  What color will a person choose (red, blue, green, silver)?
  • Neural Networks – A learning algorithm that is modeled after the brain.  Think of neurons.
  • Support Vector Machines

Unsupervised Learning

In unsupervised learning, one has a set of data with no features and labels.  Can some structure be found for the data?

  • Clustering – The most popular technique is K-means.
  • PCA (Principal Components Analysis) – speed up a learning algorithm

Anomaly Detection

This section covers methods to determine if data is bad.  Bad data is considered an anomaly.

Recommender Systems

Like the name says, recommender systems are used to make recommendations.  Companies like Netflix use recommender systems to recommend new movies to customers.  LinkedIn also recommends people to connect with.  This is a fairly hot topic in the tech world right now.

  • Content Based(Features)
    • Modified Linear Regression
  • Non-content Based(No Features)
    • Collaborative Filtering
    • Matrix Factorization

If any of these topics sound interesting to you, signup for the Stanford Machine Learning class.  Professor Andrew Ng will do an excellent job explaining the details.

Another Big Data startup launches.

Hilary Mason at Strata 2011

Hilary talks about data, datapeople, and the current momentum. She brings up some current challenges.

Challenges with Data

  • Robust analysis on streams of data (in volume)
  • Store data so that it can be processed in Real-time
  • Better Education – Good news for this blog
  • Imagination – Stop solving the same problems
  • What to do with the data

Heroku Thinks Sharing Data is Important

Last week,  Heroku announced a new feature to its PostgreSQL database service.  The new feature is called Data Clip, and it allows users to share results of an SQL query.  It has options to store the exact data from when the query was originally run or the query can be refreshed to return the current data.  I can definitely see this being useful for debugging of code and troubleshooting, which may have been Heroku’s original intent.

I can also see the Data Clip being very useful for data science and quick sharing of relevant data. I doubt the Data clip can handle huge result sets, but huge data is not always necessary. Sometimes, being able to quickly share data results is just as important. Plus the Data Clip allows the results to be downloaded into Excel, csv, json, or yaml formats. Therefore the data can be easily manipulated from there.

See an example in action.

Learning To Be A Data Scientist