Tag Archives: nosql

Huge List of Big Data and Machine Learning Technologies

Onur Akpolat has put together A curated list of awesome big data frameworks and resources. The list is very extensive and includes: NoSQL databases, machine learning libraries, frameworks, filesystems and more.

On a similar note, Joseph Misiti has compiled a large list of machine learning specific resources. The list is titled, Awesome Machine Learning, and it includes resources for various languages, NLP, visualization, and more.

Both lists are on Github, so if you notice something missing from the list, feel free to add it. Contributions are welcome.

Advertisements

7 Important Data Science Papers

It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.

Google Search

  • PageRank – This is the paper that explains the algorithm behind Google search.

Hadoop

  • MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
  • Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.

NoSQL

These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.

Machine Learning

Bonus Paper

  • Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.

Are there any other papers you feel should be on the list?

Nice GraphDB and NoSQL Talk

This is a wonderful talk by Max DeMarzi (he has a very informative blog as well). If you are new to NoSQL or Graph Databases, I highly recommend this video.

One comment stuck out for me:

You’re never gonna run out of nodes when you get to half a trillion…

That is a really big number, but I wonder how many years that statement will stand. If you have any thoughts, please leave a comment.

ChiSC: Max DeMarzi – Is Your Problem a Graph Problem? from 8th Light on Vimeo.

2 Recently Released Open Source Graph-Related Projects

  1. GraphBuilder

    Intel Labs built a tool for constructing mathematical graphs out of large datasets. It is Java based and works with Hadoop and MapReduce. Intel has release a whitepaper explaining more about GraphBuilder. The code is available on Github. A big thanks to Mark Nickel for pointing out this project.

  2. ArangoDB

    ArangoDB is a flexible NoSQL database. It is a document database with the ability to add edges. Thus it can become a graph database. I had a fun time playing around with the online tutorial and demo. ArangoDB also claims to support being a key/value store. The code is available on Github.

3 Secrets for Aspiring Data Scientists | Software Advice

Michael Koploy wrote 3 Secrets for Aspiring Data Scientists about what it takes to enter a career as a data scientist. He lays out 3 steps:

  1. Sharpen Your Scientific Saw – Hone your math and science skills
  2. Learn the Language of Business – Data Scientists need to explain the data in business terms
  3. Keep Adding to Your Technical Toolbelt – Learn all the tools you can (NoSQL, Excel, Hadoop,…)

The article is a nice read. http://blog.softwareadvice.com/articles/bi/3-career-secrets-for-data-scientists-1101712/

Java and MongoDB Webinars

10gen, the company behind MongoDB, will be offering some free webinars this fall. This webinar series is targeted at using MongoDB with Java. 10gen has been running successful webinars for a long time, so I would high recommend any/all of the following sessions.

Title Date
Building your first Java Application with MongoDB Oct. 18, 2012 and Nov. 22, 2012
Building Web Applications with MongoDB and Spring Nov. 1, 2012
MongoDB on the JVM Nov. 29, 2012
Simplifying Persistence for Java and MongoDB Dec. 13, 2012