What is a "Data Lake"?

I have frequently been hearing the term data lake. Being the curious person that I am, I decided to go in search of a definition.

Currently, the company Pivotal is responsible for marketing the term. However, I believe the term was originally coined by Dan Woods of CITO Research back in 2011. Anyhow, here is a basic description of a data lake.

A data lake is an information system consisting of the following 2 characteristics

  1. A parallel system able to store big data
  2. A system able to perform computations on the data without moving the data

Currently, Hadoop is the most common technology to implement a data lake, but it might not be that way forever. Thus it is important to distinguish the difference between Hadoop and a data lake. A data lake is a concept, and Hadoop is a technology to implement the concept.

The following is a recent Strata Talk by Kaushik Das of Pivotal. He discusses how a data lake can be used to create the digital brain.

2 thoughts on “What is a "Data Lake"?”

Leave a Reply

Your email address will not be published. Required fields are marked *