Quandl – A Search Engine for Datasets

I just found this site a couple days ago. Quandl is a new startup that is a search engine for datasets. The site really has a lot of data (over 2 million datasets). Plus the data can be sorted, filtered, graphed, combined, and finally downloaded in many different formats (Excel, JSON, R, csv, XML). Most of the data is numerical and/or time series.

If you have been looking for some data to explore, Quandl may be a good place to look.

Advertisements

12 Useful Tips for Machine Learning

Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].

Below are the 12 useful tips.

  1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
  2. IT’S GENERALIZATION THAT COUNTS
  3. DATA ALONE IS NOT ENOUGH
  4. OVERFITTING HAS MANY FACES
  5. INTUITION FAILS IN HIGH DIMENSIONS
  6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
  7. FEATURE ENGINEERING IS THE KEY
  8. MORE DATA BEATS A CLEVERER ALGORITHM
  9. LEARN MANY MODELS, NOT JUST ONE
  10. SIMPLICITY DOES NOT IMPLY ACCURACY
  11. REPRESENTABLE DOES NOT IMPLY LEARNABLE
  12. CORRELATION DOES NOT IMPLY CAUSATION

For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].

Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.

11 Steps to Data Analysis

Here is a list of Steps to Data Analysis from the Data Analysis Coursera course.

  1. Define Question
  2. Define Ideal Dataset
  3. Define what data you can access
  4. Obtain the data
  5. Clean the data
  6. Exploratory Data analysis
  7. Statistical prediction
  8. Interpret results
  9. Challenge Results
  10. Writeup results
  11. Create reproducible code for others to recreate

Update: A couple of comments have been made indicating the following 2 steps be added.

  1. Missing Value Analysis
  2. Outlier management

What do you think? Is anything missing?

Nice GraphDB and NoSQL Talk

This is a wonderful talk by Max DeMarzi (he has a very informative blog as well). If you are new to NoSQL or Graph Databases, I highly recommend this video.

One comment stuck out for me:

You’re never gonna run out of nodes when you get to half a trillion…

That is a really big number, but I wonder how many years that statement will stand. If you have any thoughts, please leave a comment.

ChiSC: Max DeMarzi – Is Your Problem a Graph Problem? from 8th Light on Vimeo.

Buffalo Bills to start advanced analytics department

Even the NFL is getting into data analysis these days.

Buffalo Bills to start advanced analytics department

Personal note: Like many American children, I grew up dreaming of playing professional football in the NFL. Also, like many American children, that dream did not come true. Maybe now I could try to make the NFL as a data scientist. I wonder if they have fall training camp for the analytics department. If so, sign me up.