Big Data: Book Review

Not too long ago, I was in an airport looking for a book to help pass the time. I was thrilled to find a book with Big Data in the title, so I took it and started reading.


Big Data: A Revolution That Will Transform How We Live, Work, and Think

The book is titled, Big Data: A Revolution That Will Transform How We Live, Work, and Think. The authors are Viktor Mayer-Schonberger, professor at Oxford, and Kenneth Cukier, data editor at the Economist.

The book is well-written and entertaining. I would like to specifically point out 2 chapters that really stuck out to me. Chapter 4 on correlation provided an excellent description of why correlation is not the same as causation. It then went on to state that with all the data available, correlation might be all that is needed. Here is a quote from Chapter 4.

The correlations show what, not why, but as we have seen, knowing what is often good enough.

The final chapter, Chapter 10, is about what is next with big data. It provides a look into the future of where big data will make a difference: global problems, medicine, climate change, physics, sensors, and nearly all other parts of our lives. It also mentions that big data is only going to get bigger.

Also, chapter 5 introduced me to a new word, datafication. I am still not exactly sure what the definition is. Chapter 9 has a great discussion about privacy because people are losing control that information IS being collected. People can only hold others accountable for how the information is used.

Overall

The book will not help you master machine learning algorithms (it is not intended for that). It is not a technical book. However, if you are interested in what types of questions can be answered with all your data, this book is great. I believe the book is targeted at business people that are hoping to get a grasp of all the big data talk.

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?.

This is an excellent write-up for the differences between:

  • Statistics
  • Machine Learning
  • Data Mining
  • Informatics
  • Big Data
  • Predictive Analytics
  • Data Science

Work-Force Science

As of yesterday, I was completely new to the term work-force science. Essentially, work-force science is data analysis applied to Human Resources. It makes more sense than the old gut feeling approach. If you want to know more, see this excellent article from the New York Times, Big Data, Trying to Build Better Workers. The following quote sums up one of the key findings.

An applicant’s work history is not a good predictor of future results.

New York Times Data Science Articles

In just the last couple days, the New York Times has published 2 great articles about data science. Have a look for yourself.

Data Science: The Numbers of our lives – Universities offer courses in data science This is a nice read about universities starting to offer data science options.

Geek Appeal: New York vs. Seattle – New York and Seattle Compete for Data Science Crown If you want a PhD, Seattle might be currently winning, but I think NYU and Columbia are trying to change that.

Sites for Data Science Jobs

New Data Scientist jobs are being posted everyday. Due to the many different job titles of a data scientist, it can be difficult to find the right job postings. Here is a list of sites that post only data science related jobs.

Do you know of any other data science job sites?

Best Free Data Mining Tools

I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.

Another Columbia Data Science Course

This spring Columbia University is offering another course on Data Science. This one is targeted at introductory graduate students in math, and it is not intended to be an advanced machine learning course. The goal is to expose people with strong mathematical skill to some of the ideas from software development and machine learning without sacrificing the statistical theory. The course is title, Columbia Applied Data Science. The link for lecture notes(PDF) provides a great tutorial for beginning topics in data science. The lecture notes are currently still under development.