Deep Learning in Java

Deep Learning is the hottest topic in all of data science right now. Adam Gibson, cofounder of, has created an open source deep learning library for Java named DeepLearning4j. For those curious, DeepLearning4j is open sourced on github.

Below is a video of Adam introducing deep learning and DeepLearning4j. Also, if you are interested in learning more about deep learning. Here are a couple more very help links.

CMU Machine Learning Summer School Videos

It was a 2-week intensive course focused on machine learning for big data. Some of the top academics in machine learning gave presentations. Most of the videos are fairly long (around 1 hour each), but a whole lot of material is covered.

All the CMU Machine Learning Summer School Videos are on Youtube.

Here is one lecture by Alex Smola on Scalable Machine Learning.

Want to Learn SQL? Here is a Great Tutorial!

Mode Analytics, a recently launched site for collaborative data science in the cloud, has published an excellent tutorial for learning SQL.

The tutorial is named SQL School .

This is one of the best SQL tutorials I have seen. Plus, it has the huge added advantage of not requiring you to setup your own database first (the data is already available). Setting up your own database can be a bit overwhelming when you are first learning. So, if you are looking to learn SQL, now is a great time to start.

Stanford Releases Large Network Datasets

Stanford University has just released a collection of large datasets of network data. When I say network data, I am referring to the mathematical term of networks (think of a collection of nodes and edges). Here are just a few of the possible categories.

  • Citation Networks
  • Road Networks
  • Web graphs
  • Social Networks such as twitter
  • and many more
  • If you are looking to study network data, or just want some practice analyzing big data, this just might be a good place to start.

An Organization for Opendata and Healthcare

Health Data Consortium is an advocacy group focused on helping the healthcare industry respond to the availability of health data. They are currently focused on innovation and the uses of open health data.

Healthcare is currently undergoing some radical changes and data science is going to play a key role in the future of healthcare. It is great to see the medical field building an official group to define the practice. I hope other industry will follow the lead of the medical field and begin forming their own groups around open data. I am eager to see how the Health Data Consortium progresses over the coming years and months.

Analytics Handbook: Book 3 is Free

The team that brought you the Analytics Handbook, has freely published the third and final book, titled THE DATA ANALYTICS HANDBOOK RESEARCHERS + ACADEMICS. This book focuses on data science in research and academics communities. Like the previous 2 books in the series, it includes interviews with top experts in the field. Here are just a few of the people with interviews in this book.

The authors are now working on a new data science training project called Leada. Check it out for more details.

Huge List of Big Data and Machine Learning Technologies

Onur Akpolat has put together A curated list of awesome big data frameworks and resources. The list is very extensive and includes: NoSQL databases, machine learning libraries, frameworks, filesystems and more.

On a similar note, Joseph Misiti has compiled a large list of machine learning specific resources. The list is titled, Awesome Machine Learning, and it includes resources for various languages, NLP, visualization, and more.

Both lists are on Github, so if you notice something missing from the list, feel free to add it. Contributions are welcome.

Data Science Productivity Platform

Tristan Zajonc, cofounder of Sense Platform, gave a recent thought-provoking talk at Data Driven NYC. He spoke about the future of data science productivity. According to Tristan:

In the next 2 or 3 years, everybody doing data science should be using a data science productivity platform…a cloud-based data science platform.

In addition to the productivity platforms, the power methods will see some improvements. Here are 2 that Tristan mentions:

  1. Probabilistic Programming – matching of computer science and Bayesian statistics
  2. Deep Reinforcement Learning – making optimal decisions via deep learning

It is an exciting time for data science. I think the next few years will see much better productivity tools, workflows, and platforms. More on that in an upcoming blog post.

The other videos from Data Driven NYC are also available on Youtube.

Data Scientist vs Data Engineer

As the field of data science continues to grow and mature, it is nice to begin seeing some distinction in the roles of a data scientist. A new job title gaining popularity is the data engineer. In this post, I lay out some of the distinctions between the 2 roles.

Data Scientist vs Data Engineer Venn Diagram
Data Scientist vs Data Engineer Venn Diagram

Data Scientist

A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.

Data Engineer

Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data.
In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.


It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Do you see other distinctions between the roles?