Intel Labs built a tool for constructing mathematical graphs out of large datasets. It is Java based and works with Hadoop and MapReduce. Intel has release a whitepaper explaining more about GraphBuilder. The code is available on Github. A big thanks to Mark Nickel for pointing out this project.
ArangoDB is a flexible NoSQL database. It is a document database with the ability to add edges. Thus it can become a graph database. I had a fun time playing around with the online tutorial and demo. ArangoDB also claims to support being a key/value store. The code is available on Github.
This is a very quick and informative video about data science. What is data science? What makes a good data scientist?
DJ Patil does an excellent job answering both those questions.
Here are his answers for what makes a good data scientist:
- Story Telling
I think the information is interesting, but I also think the charts do a good job of telling the story.
When telling friends and family that I blog about data science, I am frequently asked to explain more. I usually respond with an answer similar to this:
You know the world is generating huge amounts of data everyday due to financial transactions, medical records, social networks, and other internet uses. Data Science aims to make better decisions based upon that data. Here are some possibilities. What type of people buy TVs in October? Which patients will get better with this new drug? Who are some other people that you probably already know?
Data Science is all about answering these types of questions with real data instead of assumptions.
I think this explanation could use some refinement. What am I leaving out? What should I remove? How do you explain data science to other people (preferably non-technical or non-data people)?
This is a nice graphic showing where data science is being taught. It appears that data science is being taught all over the country.
- Computer scientists discover statistics and find it useful – Ever wonder why computer scientists are getting all the attention for data science? Well, computer scientists stole ideas from statistics. Read the article and it will make more sense.
- Top 3 Myths About Data Science – Here is a highlight of the myths:
- Data science is a field for mathematical geeks.
- Learning a tool is the equivalent of learning data science
- Data scientists will be replaced by artificial intelligence soon
- The Big Data Fallacy And Why We Need To Collect Even Bigger Data – More Data is not always better because it does not necessarily mean more information. Read this for a good description of data vs. information vs. insights.
- MLbase – A distributed machine learning system, here is an academic paper about the system
- Predictive Analytics and Machine Learning: An Overview (PDF) – this is a very nice slide deck from IBM
Jeff Hammerbacher, founder and Chief Scientist of Cloudera, gives a nice talk about data science. He explains what he has done in the past, and what he plans to do in the future.
It is the second video, I have posted recently, emphasizing the importance of data science for more than just advertising. Jeff is getting involved in a Medical School to see how data can help.
Note: The video is about 45 minutes, but it contains some really good information.
Code School is offering a course title Try R. The course is completely free and can be completed online with the interactive tutorial. You will learn by doing. If you have been looking to learn R or need a quick refresher, this is probably a very good option.