Watch the Strata Livestream and Videos

The Livestream is available at the Strata website. Keynotes and interviews will be happening most of the day Thursday.

If you would like to watch some of the videos and keynotes from today, you can view many of them on the Strata Youtube Channel.

Here is a video to get you started. It is an introduction to Gooru, a search engine for learning.

Latex Documents Online

Although not specific to data science, if you write a lot of documents with mathematical notation, you are probably familiar with LaTeX. LaTeX is a typesetting system common for mathematics.

There are now some nice resources to help you write and produce LaTeX documents collaboratively on the web.

Another site, LaTeX Templates is just what you would guess. It contains of bunch of sample templates to make creating documents a bit quicker and easier.

Data Scientist: Consider the Curriculum

A while back James Kobielus wrote the article, Data Scientist: Consider the Curriculum. It contains one of the best descriptions of a data science curriculum I have seen.  Also the article includes a list of algorithms/modeling techniques that should be known by a data scientist. Below is the list from the article.

  • linear algebra
  • basic statistics
  • linear and logistic regression
  • data mining
  • predictive modeling
  • cluster analysis
  • association rules
  • market basket analysis
  • decision trees
  • time-series analysis
  • forecasting
  • machine learning
  • Bayesian and Monte Carlo Statistics
  • matrix operations
  • sampling
  • text analytics
  • summarization
  • classification
  • primary components analysis
  • experimental design
  • unsupervised learning
  • constrained optimization

The list almost looks overwhelming.
Do you think anything is missing from the list?

NYU Launches New Center For Data Science

New York University has just launched some Data Science programs via the new Center for Data Science.

… to establish the country’s leading data science training and research facilities at NYU.

Part of the announcement is an M.S. in Data Science. Applications for the initial class, starting Fall 2013, are now being accepted. The Center for Data Science also plans to offer Ph.D. degrees via the Mathematics, Statistics, and Computer Science departments. I am not sure if an official Ph.D. degree in Data Science is being planned.

This is great news!

Quandl – A Search Engine for Datasets

I just found this site a couple days ago. Quandl is a new startup that is a search engine for datasets. The site really has a lot of data (over 2 million datasets). Plus the data can be sorted, filtered, graphed, combined, and finally downloaded in many different formats (Excel, JSON, R, csv, XML). Most of the data is numerical and/or time series.

If you have been looking for some data to explore, Quandl may be a good place to look.

12 Useful Tips for Machine Learning

Pedro Domingos of the Department of Computer Science and Engineering at the University of Washington provides a very useful paper with tips for machine learning. The paper is title, A Few Useful Things to Know about Machine Learning [pdf].

Below are the 12 useful tips.

  1. LEARNING = REPRESENTATION + EVALUATION + OPTIMIZATION
  2. IT’S GENERALIZATION THAT COUNTS
  3. DATA ALONE IS NOT ENOUGH
  4. OVERFITTING HAS MANY FACES
  5. INTUITION FAILS IN HIGH DIMENSIONS
  6. THEORETICAL GUARANTEES ARE NOT WHAT THEY SEEM
  7. FEATURE ENGINEERING IS THE KEY
  8. MORE DATA BEATS A CLEVERER ALGORITHM
  9. LEARN MANY MODELS, NOT JUST ONE
  10. SIMPLICITY DOES NOT IMPLY ACCURACY
  11. REPRESENTABLE DOES NOT IMPLY LEARNABLE
  12. CORRELATION DOES NOT IMPLY CAUSATION

For details and a good explanation of each, see the paper A Few Useful Things to Know about Machine Learning [pdf].

Also,later this year, Pedro Domingos will be teaching a machine learning course via Coursera. Sign up if you are interested.