Microsoft Research Open Data is a search engine for free datasets available from Microsoft Research. The datasets are primarily aimed at Natural Language Processing (NLP) and computer vision. Take a look if you are in need of a dataset for your next project.
- A timeline of Deep Learning papers (with download links) written since 2011
- A large collection of Deep Learning papers broken out by specific topic. It also includes ratings.
- A list of papers to compliment Deep Learning Book
The last links are not official academic papers, but they are quite good resources on deep learning.
Twitter has just released the idea of a Data Grant. You have to login with your twitter account to see the details. The gist is: Twitter will provide you with historical twitter data for research purposes.
What could you do with this data?
It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.
- PageRank – This is the paper that explains the algorithm behind Google search.
- MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
- Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.
These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.
- 10 algorithms in data mining | pdf download – This paper covers a number (10 to be exact) of important machine learning algorithms.
- A Few Useful Things to Know about Machine Learning – This paper is filled with tips, tricks, and insights to make machine learning more successful.
- Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.
Are there any other papers you feel should be on the list?
Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].
Below are the 12 useful tips.
- LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
- IT’S GENERALIZATION THAT COUNTS
- DATA ALONE IS NOT ENOUGH
- OVERFITTING HAS MANY FACES
- INTUITION FAILS IN HIGH DIMENSIONS
- THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
- FEATURE ENGINEERING IS THE KEY
- MORE DATA BEATS A CLEVERER ALGORITHM
- LEARN MANY MODELS, NOT JUST ONE
- SIMPLICITY DOES NOT IMPLY ACCURACY
- REPRESENTABLE DOES NOT IMPLY LEARNABLE
- CORRELATION DOES NOT IMPLY CAUSATION
For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].
Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.