Intoductory Machine Learning Textbook

After numerous semesters of teaching an introductory course on machine learning, Max Welling, Professor at University of California Irvine, decided to compile an introductory textbook titled, A First Encounter with Machine Learning (PDF link).


Startups working on Machine Learning as a Service (MLaaS)

  1. BigML – A great interface. Just upload your data and it shows basic information for each column such as a histogram and mean values. See the Gallery for some examples of the final models.
  2. – Just launched, but it looks to be a serious contender. It was started by a team from UC Berkeley.
  3. Precog – Taking a slightly different approach, Precog is both a platform and an online IDE for data science. The IDE supports Quirrel, hyped as R for big data.
  4. Ersatz – Ersatz is currently in private beta, but they are building a web platform for building deep neural networks.

Am I missing any startups on this list?

While not really startups, the following 2 links might also fit here.

  1. Google Prediction API – Cloud-based machine learning tools
  2. PSI Project – A research project at the Australian National University.

A Couple Good Python Resources

In just the past month, a couple of great resources for learning python have been created.

  1. Getting started with Python: Tips, Tools and Resources – If you are new to python, this is a great place to start. It contains a brief description and links to books, tutorials, and MOOCs.
  2. Getting Started With Python for Data Scientists – This focuses more on tools specifically for data science.

Combined together, the previous links should provide a person all the resources necessary to begin doing some data science with the python language.

10 Big Data Best Practices

10 Big Data Implementation Best Practices

This is a great article and list of topics to remember when working on big data projects. Here is the list.

  1. Gather business requirements before gathering data
  2. Implementing big data is a business decision not IT
  3. Use Agile and Iterative Approach to Implementation
  4. Evaluate data requirements
  5. Ease skills shortage with standards and governance
  6. Optimize knowledge transfer with a center of excellence
  7. Embrace and plan your sandbox for prototype and performance
  8. Align with the cloud operating model
  9. Associate big data with enterprise data
  10. Embed analytics and decision-making using intelligence into operational workflow/routine

See the original article, 10 Big Data Implementation Best Practices, for details.

R Commands for Cleaning Data

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

  • sub() replace the first occurrence
  • gsub() replaces all occurrences

Quantitative Variables in Ranges

  • cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
  • cut2(data$col, g=6) return a factor variable with 6 groups
  • cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

  • merge() for combining data frames
  • sort() sorting an array
  • order(data$col, na.last=T) returns indexes for the ordered row
  • data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
  • melt() in the reshape2 package, this is for reshaping data
  • rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.

Strata Videos to Watch

Here are a few of the recent Strata videos I would recommend.