Yahoo just released a 1.5 TB dataset of “anonymized user interactions on the news feeds”. If you have been looking for a new dataset to analyze, this just might be it. It contains approximately 110 billion rows of data regarding user-news interactions. Happy data exploring!
Stanford University has just released a collection of large datasets of network data. When I say network data, I am referring to the mathematical term of networks (think of a collection of nodes and edges). Here are just a few of the possible categories.
- Citation Networks
- Road Networks
- Web graphs
- Social Networks such as twitter
- and many more
If you are looking to study network data, or just want some practice analyzing big data, this just might be a good place to start.
I just found this site a couple days ago. Quandl is a new startup that is a search engine for datasets. The site really has a lot of data (over 2 million datasets). Plus the data can be sorted, filtered, graphed, combined, and finally downloaded in many different formats (Excel, JSON, R, csv, XML). Most of the data is numerical and/or time series.
If you have been looking for some data to explore, Quandl may be a good place to look.
I think a dataset related to human trafficking would be interesting. It would need to contain: when, where, and the age of the person kidnapped. It could also contain the eventual location of the victim. I don’t know that any organisation has this data. Many times the kidnappings occur unknowingly or the persons involved are not allowed to speak about it. I think this data could be used to predict when kidnappings for human trafficking would occur. Thus preventing the crime.
Also, I would love a dataset all about my life. I would love to know what factors constitute a better day for me. I would like the dataset to contains foods I eat, accomplishments I get done, sleep (including how often I wake up), exercise, devotion time, rating of how good I thought the day was and possibly anything else. I know books and experts say that good food and exercise make people feel better. I would really like to know for me, which factors are most important. The problem is: I don’t want to take the time and effort to track all this data. I bet there is an app for it.
Chinese Gender Predictor
This one would just be for fun. Currently, I would enjoy a large dataset with information about child births. The dataset would need to contain the conception date (or due date), mother’s birth date, and child’s gender. I know that hospitals have this type of data, but HIPPA prevents the sharing of medical records. Here is why I would like it. There are numerous Chinese Gender Predictors around. They claim to be able to accurately predict the gender of a baby. Given enough data, this would be a fairly simple thing to validate or invalidate. Just perform the Chinese Gender Predictor and see how often it is correct. If it is correct significantly more than 50% of the time, then the early Chinese may have known something we do not. Otherwise, the Chinese Gender Predictor is not a useful tool. This data would have little impact for bettering the world, but just sounds like a fun little project.
Whether it exists or not, what dataset would you love to access?