I just found this site a couple days ago. Quandl is a new startup that is a search engine for datasets. The site really has a lot of data (over 2 million datasets). Plus the data can be sorted, filtered, graphed, combined, and finally downloaded in many different formats (Excel, JSON, R, csv, XML). Most of the data is numerical and/or time series.
If you have been looking for some data to explore, Quandl may be a good place to look.
Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].
Below are the 12 useful tips.
- LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
- IT’S GENERALIZATION THAT COUNTS
- DATA ALONE IS NOT ENOUGH
- OVERFITTING HAS MANY FACES
- INTUITION FAILS IN HIGH DIMENSIONS
- THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
- FEATURE ENGINEERING IS THE KEY
- MORE DATA BEATS A CLEVERER ALGORITHM
- LEARN MANY MODELS, NOT JUST ONE
- SIMPLICITY DOES NOT IMPLY ACCURACY
- REPRESENTABLE DOES NOT IMPLY LEARNABLE
- CORRELATION DOES NOT IMPLY CAUSATION
For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].
Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.
Yhat, a new predictive modeling startup, wrote up a nice blog post about
10 R Packages I wish I knew about earlier. It is worth reading through the list.
Special Thanks to Mark Nickel for pointing out this link.
Here is a list of Steps to Data Analysis from the Data Analysis Coursera course.
- Define Question
- Define Ideal Dataset
- Define what data you can access
- Obtain the data
- Clean the data
- Exploratory Data analysis
- Statistical prediction
- Interpret results
- Challenge Results
- Writeup results
- Create reproducible code for others to recreate
Update: A couple of comments have been made indicating the following 2 steps be added.
- Missing Value Analysis
- Outlier management
What do you think? Is anything missing?
This is a wonderful talk by Max DeMarzi (he has a very informative blog as well). If you are new to NoSQL or Graph Databases, I highly recommend this video.
One comment stuck out for me:
You’re never gonna run out of nodes when you get to half a trillion…
That is a really big number, but I wonder how many years that statement will stand. If you have any thoughts, please leave a comment.
ChiSC: Max DeMarzi – Is Your Problem a Graph Problem? from 8th Light on Vimeo.