Tag Archives: cloudera

Cloudera Machine Learning Slides

A very nice slidedeck from Jeff Hammerbacher of Cloudera. It goes over k-means clustering and some enhancements.

Top 5 Places to Get a Data Scientist job

  1. LinkedIn They turn data into products better than anyone else.
  2. Facebook If you are the type of person that loves to analyze people’s lives, there is no better place.
  3. Twitter Duh, It’s Twitter. lots of data and lots of possibilities
  4. Cloudera Cloudera is a successful Hadoop-based startup. Build tools and explore huge datasets for a variety of industries.
  5. Kaggle If optimizing algorithms and really diving into the data to get every last ounce of information is your thing, then Kaggle is it. Plus, there is nowhere else you will get to work on so many important problems in such a wide range of domains. Unfortunately, Kaggle is not currently hiring any data scientists, but they most likely will be seeking more in the future.

There are many other companies hiring data scientists. Where would you like to be a data scientist?

Top 5 Data Startups

  1. Kaggle They make data science a sport, enough said.
  2. DataKind DataKind may not technically be a startup because it is a nonprofit, but they are doing cool stuff.  They match nonprofit organizations with people that love to analyze data and create visualizations.
  3. Cloudera They call themselves “The Platform for Big Data”.  They are working hard to make hadoop easier to use.
  4. Coursera  Coursera is an education startup, but with 2 Computer Science Professors as founders, you can bet they are crunching a lot of data about how people learn.
  5. BigML They are trying to make machine learning available to everyone.  Machine Learning as a Service!

How To Learn Data Science?

Based upon the popularity of a previous post about a certificate program from the University of Washington, it appears that many people are interested in learning the skills necessary to become a data scientist. Thus, I decided to compile a list of some of the possible learning strategies.

Traditional College Education

The most obvious path would be to study at a traditional college or university. Colleges and universities are starting to notice the demand for data science skills, and many colleges are currently offering programs to prepare someone as a data scientist. This path is safe and predictable. Do the homework, complete the courses, and get the degree or certificate. Most people are familiar with the process, and it offers few surprises. The problems here are the costs, lack of flexibility, and time involved.

Corporate Training

Companies are now starting to offer training programs for data science. EMC is leading the way in this category with their data science training program. Cloudera also offers lots of training related to hadoop and big data. Wolfram offers data science training with Mathematica. One of the problems with this category is the cost. Another problem is the companies have the tendency to teach and promote their own products. This may leave the student with numerous gaps in the full data science spectrum.

Your Thoughts?

What are you thoughts about the above approaches? What are the positives and negatives? Also, later this week I will be posting some less-traditional approaches to learning data science.

Be A Data Rat

In this video, Jeff Hammerbacher of Cloudera mentions that good data scientists are “data rats.” Athletes are often considered “gym rats” if they spend a lot of time in the gym, so Jeff believes “data rats” need to spend a lot of time with data. Having a high level of curiosity is very important.

Jeff also teaches an introductory course in Data Science at Berkeley. In the course, he tries to cover 5 skills that are not typically covered in an undergraduate curriculum.

  1. Data Collection and Integration – know how to acquire and integrate data
  2. Visualization Design – not just chart design but entire dashboard design
  3. Large-scale Experimentation – rapidly design and deploy features to be tested
  4. Causal Inference – you don’t get to design the studies, you just deal with the data
  5. Data Products – how to deploy and evaluate a machine learning algorithm