What Dataset Would You Love To Have?

Human Trafficking

I think a dataset related to human trafficking would be interesting. It would need to contain: when, where, and the age of the person kidnapped. It could also contain the eventual location of the victim. I don’t know that any organisation has this data. Many times the kidnappings occur unknowingly or the persons involved are not allowed to speak about it. I think this data could be used to predict when kidnappings for human trafficking would occur. Thus preventing the crime.

My Life

Also, I would love a dataset all about my life. I would love to know what factors constitute a better day for me. I would like the dataset to contains foods I eat, accomplishments I get done, sleep (including how often I wake up), exercise, devotion time, rating of how good I thought the day was and possibly anything else. I know books and experts say that good food and exercise make people feel better. I would really like to know for me, which factors are most important. The problem is: I don’t want to take the time and effort to track all this data. I bet there is an app for it.

Chinese Gender Predictor

This one would just be for fun. Currently, I would enjoy a large dataset with information about child births. The dataset would need to contain the conception date (or due date), mother’s birth date, and child’s gender. I know that hospitals have this type of data, but HIPPA prevents the sharing of medical records. Here is why I would like it. There are numerous Chinese Gender Predictors around. They claim to be able to accurately predict the gender of a baby. Given enough data, this would be a fairly simple thing to validate or invalidate. Just perform the Chinese Gender Predictor and see how often it is correct. If it is correct significantly more than 50% of the time, then the early Chinese may have known something we do not. Otherwise, the Chinese Gender Predictor is not a useful tool. This data would have little impact for bettering the world, but just sounds like a fun little project.

Whether it exists or not, what dataset would you love to access?

The Data Scientific Method

DJ Patil and Josh Elman, both of Greylock Partners, give an insightful talk at LeWeb London 2012. The most important part was the introduction of the Data Scientific Method.

Data Scientific Method

  1. Start with a Question
  2. Leverage your current data
  3. Create features and run tests
  4. Analyze the results and draw insights
  5. Let the data frame a conversation

OpenIntro (Free online Stats Book) is getting a new version. The updates sound good. If you are looking to learn statistics, this is an excellent and cost effective solution.


We are excited to announce that the Second Edition of OpenIntro Statistics will be released in August! The First Edition will remain available for one more academic year (2012-2013), or longer if there is continued interest. The Second Edition is a further evolution of OpenIntro Statistics and includes the following important changes:

  • New data. Many of the data sets, some just one or two years old, have been swapped out for newer data and studies. We’ve worked hard to ensure that OpenIntro Statistics remains fresh and current.
  • Updated Chapter 1. Data collection is now featured ahead of the summaries and graphics sections. We include a new research study with surprising results to lead off the textbook and engage students. Two new data sets featuring email and census data take the place of the possum and cars data sets that are present in the First Edition. An important new subsection has…

View original post 325 more words

Data Science Education Opportunity

This is a followup post for two previous posts about How to Learn Data Science the traditional and non-traditional way.

Data Science Education Opportunity

I think there is an opportunity for an online data science training program. Many people wanting to learn data science already have a degree and some of the necessary skills. The online curriculum would have to be flexible enough to allow a person to fill in the gaps of missing knowledge. I would also like the topics to be broken down further than a typical university semester course. Break the materials into one or two week segments. For example, don’t offer training in machine learning. Separate it into numerous training segments like: logistic regression, support vector machines, and random forests. Proper prerequisites would need to be stated, but this method would allow learners to quickly grasp small chunks of knowledge.

One problem here is the lack of credentials. The online material would need to present a student with some type of certificate/award for completing material. The certificate/award would have to mean something, and not just be a slip of paper. The other problem is the vast amount of time required for creating the training material. Anyhow, I think there is an opportunity for someone or some company to create this curriculum.

What would work for you?

How would you like to learn data science?