Intoductory Machine Learning Textbook

After numerous semesters of teaching an introductory course on machine learning, Max Welling, Professor at University of California Irvine, decided to compile an introductory textbook titled, A First Encounter with Machine Learning (PDF link).

Startups working on Machine Learning as a Service (MLaaS)

  1. BigML – A great interface. Just upload your data and it shows basic information for each column such as a histogram and mean values. See the Gallery for some examples of the final models.
  2. – Just launched, but it looks to be a serious contender. It was started by a team from UC Berkeley.
  3. Precog – Taking a slightly different approach, Precog is both a platform and an online IDE for data science. The IDE supports Quirrel, hyped as R for big data.
  4. Ersatz – Ersatz is currently in private beta, but they are building a web platform for building deep neural networks.
  5. Predictron Labs – Cloud-based predictive analytics platform

Am I missing any startups on this list?

While not really startups, the following 2 links might also fit here.

  1. Google Prediction API – Cloud-based machine learning tools
  2. PSI Project – A research project at the Australian National University.

A Couple Good Python Resources

In just the past month, a couple of great resources for learning python have been created.

  1. Getting started with Python: Tips, Tools and Resources – If you are new to python, this is a great place to start. It contains a brief description and links to books, tutorials, and MOOCs.
  2. Getting Started With Python for Data Scientists – This focuses more on tools specifically for data science.

Combined together, the previous links should provide a person all the resources necessary to begin doing some data science with the python language.

Free Textbook and Toolkit: Natural Language Processing with Python

This is an online, HTML version of the book, Natural Language Processing with Python. The book is a companion for NLTK which is a free, open source toolkit, written in python, for Natural Language Processing (NLP).

10 Big Data Best Practices

10 Big Data Implementation Best Practices

This is a great article and list of topics to remember when working on big data projects. Here is the list.

  1. Gather business requirements before gathering data
  2. Implementing big data is a business decision not IT
  3. Use Agile and Iterative Approach to Implementation
  4. Evaluate data requirements
  5. Ease skills shortage with standards and governance
  6. Optimize knowledge transfer with a center of excellence
  7. Embrace and plan your sandbox for prototype and performance
  8. Align with the cloud operating model
  9. Associate big data with enterprise data
  10. Embed analytics and decision-making using intelligence into operational workflow/routine

See the original article, 10 Big Data Implementation Best Practices, for details.

R Commands for Cleaning Data

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

  • sub() replace the first occurrence
  • gsub() replaces all occurrences

Quantitative Variables in Ranges

  • cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
  • cut2(data$col, g=6) return a factor variable with 6 groups
  • cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

  • merge() for combining data frames
  • sort() sorting an array
  • order(data$col, na.last=T) returns indexes for the ordered row
  • data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
  • melt() in the reshape2 package, this is for reshaping data
  • rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.

Strata Videos to Watch

Here are a few of the recent Strata videos I would recommend.

Data Science for Social Good Summer Fellowship

The University of Chicago and Argonne National Labs are hosting Data Science for Social Good Summer Fellowship 2013. The Fellowship program is open to students at all levels whom are interested in working on real-world social problems. The program takes place in Chicago and the application deadline is April 1, 2013, so apply soon.

Online Textbook Publishing Platform?

About a week ago I posted a link to a free data mining textbook. Hacker News got wind of the book as well, and I am guessing a flood of traffic hit the textbook’s site. The flood happened to take the site completely down for a couple of days. It was a shame because the book is really good.

If you frequently read this blog, you will notice it has quite a number of links to free online textbooks. Each free online textbook is available a bit differently. Most are PDF downloads (either by chapter or the entire book) hosted at some person’s personal website or somewhere on a university’s website.

Here is my question. Does the web have a publishing platform for textbooks? Is there a startup working on something like this?

I am aware of wikibooks, but I just don’t hear much about the quality of the books. As a matter of fact, I just don’t hear much about wikibooks.

Quandl Excel Add In

A few weeks ago, I blogged about Quandl, a search engine for datasets. Well, they have just released an Excel add in that allows a person to pull a dataset from Quandl straight into an Excel spreadsheet. It is very new, so Quandl would appreciate your comments and any bugs you may find.