Tag Archives: research

Deep Learning Research Paper Lists for Summer 2017

The last links are not official academic papers, but they are quite good resources on deep learning.

5 Data Science Research Papers to read in Summer 2017

In the past, the blog has included 7 Important Data Science Papers and 5 More Data Science Papers. Here is another list if you are looking for something to read over the summer.

Twitter Open Data Grants

Twitter_logo_blue

Twitter has just released the idea of a Data Grant. You have to login with your twitter account to see the details. The gist is: Twitter will provide you with historical twitter data for research purposes.

What could you do with this data?

7 Important Data Science Papers

It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.

Google Search

  • PageRank – This is the paper that explains the algorithm behind Google search.

Hadoop

  • MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
  • Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.

NoSQL

These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.

Machine Learning

Bonus Paper

  • Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.

Are there any other papers you feel should be on the list?

12 Useful Tips for Machine Learning

Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].

Below are the 12 useful tips.

  1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
  2. IT’S GENERALIZATION THAT COUNTS
  3. DATA ALONE IS NOT ENOUGH
  4. OVERFITTING HAS MANY FACES
  5. INTUITION FAILS IN HIGH DIMENSIONS
  6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
  7. FEATURE ENGINEERING IS THE KEY
  8. MORE DATA BEATS A CLEVERER ALGORITHM
  9. LEARN MANY MODELS, NOT JUST ONE
  10. SIMPLICITY DOES NOT IMPLY ACCURACY
  11. REPRESENTABLE DOES NOT IMPLY LEARNABLE
  12. CORRELATION DOES NOT IMPLY CAUSATION

For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].

Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.

New Data Science Journal

Springer has just release a new data science journal named EPJ Data Science. The journal is open access which means that articles are freely available online. That catch is that people whom submit articles must pay a fee for publication. Sometimes the fee will be covered by the author’s university or company. Anyhow, if you are interested in data science research, this journal is probably worth following.

Are you interested in academic journals?
Does this excite you?

Highlights of the White House BigData Research Initiative

Yesterday’s announcement included a few main highlights:

  • NIH will make 200TB of human genetic variation data freely available on Amazon Web Services
  • NSF will provide $2M in support of undergraduate education for studying graphical and visualization techniques of bigdata
  • DoD will announce some prize competitions in the coming months
  • Numerous other projects to increase data analysis across the US Government

Here is one note I did find interesting (or funny).  One of the speakers mentioned the need for workers with bigdata skills.  He mentioned the workforce needs 159,000 workers with data skills.  Then a couple minutes later he mention a need for data savvy managers.  He mentioned the workforce will need 1,500,000 managers with data knowledge.   I just thought those 2 numbers did not match up well.

Here are a few more links to other articles summarizing the Research Initiative:

White House is Announcing $200M BigData Research Initiative

Later today (2-3:45 pm ET), the White House will announce a $200M BigData research Initiative. Appropriately, it is being named the “Big Data Research and Development Initiative.”

The announcement will be broadcast live on Science360.

See this PDF for a listing of bigdata projects within the US Government.

I am excited to see how this will affect the education and training of data scientists.
What are your thoughts? Is this a good idea?