The last links are not official academic papers, but they are quite good resources on deep learning.
In the past, the blog has included 7 Important Data Science Papers and 5 More Data Science Papers. Here is another list if you are looking for something to read over the summer.
Twitter has just released the idea of a Data Grant. You have to login with your twitter account to see the details. The gist is: Twitter will provide you with historical twitter data for research purposes.
What could you do with this data?
It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.
- PageRank – This is the paper that explains the algorithm behind Google search.
- MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
- Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.
These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.
- Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.
Are there any other papers you feel should be on the list?
Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].
Below are the 12 useful tips.
- LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
- IT’S GENERALIZATION THAT COUNTS
- DATA ALONE IS NOT ENOUGH
- OVERFITTING HAS MANY FACES
- INTUITION FAILS IN HIGH DIMENSIONS
- THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
- FEATURE ENGINEERING IS THE KEY
- MORE DATA BEATS A CLEVERER ALGORITHM
- LEARN MANY MODELS, NOT JUST ONE
- SIMPLICITY DOES NOT IMPLY ACCURACY
- REPRESENTABLE DOES NOT IMPLY LEARNABLE
- CORRELATION DOES NOT IMPLY CAUSATION
For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].
Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.
Springer has just release a new data science journal named EPJ Data Science. The journal is open access which means that articles are freely available online. That catch is that people whom submit articles must pay a fee for publication. Sometimes the fee will be covered by the author’s university or company. Anyhow, if you are interested in data science research, this journal is probably worth following.
Are you interested in academic journals?
Does this excite you?
Yesterday’s announcement included a few main highlights:
- NIH will make 200TB of human genetic variation data freely available on Amazon Web Services
- NSF will provide $2M in support of undergraduate education for studying graphical and visualization techniques of bigdata
- DoD will announce some prize competitions in the coming months
- Numerous other projects to increase data analysis across the US Government
Here is one note I did find interesting (or funny). One of the speakers mentioned the need for workers with bigdata skills. He mentioned the workforce needs 159,000 workers with data skills. Then a couple minutes later he mention a need for data savvy managers. He mentioned the workforce will need 1,500,000 managers with data knowledge. I just thought those 2 numbers did not match up well.
Here are a few more links to other articles summarizing the Research Initiative:
Later today (2-3:45 pm ET), the White House will announce a $200M BigData research Initiative. Appropriately, it is being named the “Big Data Research and Development Initiative.”
The announcement will be broadcast live on Science360.
See this PDF for a listing of bigdata projects within the US Government.
I am excited to see how this will affect the education and training of data scientists.
What are your thoughts? Is this a good idea?