Insight Data Engineering – A Free Tuition Training Program

The creators is the Insight Data Science Fellows Program have done it again. This time they have created the Insight Data Engineering Program. The program aims to training highly specialized software engineers that can build big data systems and big data pipelines. Unlike the data science program, the data engineering program does not target people with PhDs. Please visit the Insight Data Engineering website for a white paper with all the details on the program.

Here is an official announcement:

The Insight Data Engineering Fellows Program is a professional training fellowship designed to help engineers from various backgrounds, as well as mathematicians, and computer scientists, transition to careers in data engineering. – Tuition free, 6 week, full-time, data engineering training fellowship in Silicon Valley this summer. – Alumni network of 70 Insight Fellows who are now data scientists and data engineers at Facebook, LinkedIn, Microsoft, Twitter, Square, Netflix, Airbnb, Palantir, Jawbone and many others. – Interview at top technology companies hiring data engineers at the end of the fellowship. For more information please visit:
http://insightdataengineering.com

DataKind Looking For Local Chapters

DataKind, the organization matching nonprofits and data scientists, is looking for applications for Local Chapters.

The motto for DataKind is:

Let’s use data to change the world.

DataKind hopes to add 3-5 chapters by the end of 2014.  A Chapter will be responsible for building relationships between organiztions and data scientists, promoting data science for the social sector, and organizing data events.  If you are a skilled data scientist with a passion to change the world, this might be an excellent opportunity.  What do you think, are you going to apply?

 

Expand Your Data Science Toolbelt with 3 Predictive Model Tests

The team at Software Advice recently published a slide deck outlining 3 techniques for testing the accuracy of predictive models. The 3 techniques are:

  1. Lift Charts and Decile Tables
  2. Target Shuffling
  3. Bootstrap

Depending upon you situation, goals, and dataset; all 3 are worthy tests. I would say bootstrap is the most common of the 3, but it is always good to have extra tools in your data science toolbelt. If you are unfamiliar with any of the techniques, see the slides below for a quick overview. For a more detailed description, here is a blog post detailing the 3 techniques: 3 Ways to Test the Accuracy of Your Predictive Models.

BrightTALK Data Science Summit

On March 19th and 20th of 2014 (that is Wednesday and Thursday), BrightTALK will be hosting a virtual Data Science Summit. The event starts off with Kirk Borne, Professor at George Mason University, presenting on Data Science: Concepts and Missteps. The event continues with 5 more talks on Wednesday and 3 more talks on Thursday. The presenters are top notch from places like Cloudera, Twitter, Trulia, and more. You can attend any or all of the webinars.

I have attended BrightTALK webinars before and they do a very good job with them, so this event promises to be very good.

Google Data Science MOOC

Google recently announced the launch of their own Massive Open Online Course (MOOC). The course is titled, Making Sense of Data, and it begins tomorrow, March 18, 2014.

The prerequisites are quite simple. All that is needed is: a google account, a web browser, and a basic knowledge of spreadsheets.

The content of the course will focus on Fusion Tables, which is a new experimental product from Google. Fusion Tables is a web application for visualizing, gathering, and sharing data. I am not familiar with Fusion Tables, but the description sounds very useful.

Here is the promotional video.

The Art of Good Practice – Strata Video

This is definitely not the typical Strata talk. Rodney Mullen, The Godfather of Street Skating, gives a very thought-provoking talk. I do not really follow skateboarding, but Rodney is widely considered one of the best skateboarders on the planet.

I would say the talk applies to data science, but I would also say it applies to life. Here are the high points I took from the talk.

  • Share with others
  • Take risks to improve
  • Be smart enough to know when you are headed in the wrong direction

You should definitely watch the video, because you will likely get something different from the presentation.

By the way, Happy Pi Day!

The Data Science Methodology

Data Science Methodology

  1. Problem Formulation – First, identify the problem to be solved. This step is easily overlooked. However, many dollars and hours have been spent solving the wrong problems.
  2. Obtain The Data – Next, collect new data and/or gather the data that already exists. In almost all cases, this data will need to be transformed and cleansed. It is important to note that this stage does not always involve big data or a data lake.
  3. Analysis – This is the part of the process where insight is to be extracted
    from the data. Commonly, this step will involve creating and optimizing statistical/machine learning models for prediction, but that is not always necessary. Sometimes, the analysis only contains graphs, charts, and basic descriptions of the data.
  4. Data Product – The end goal of data science is a data product. The insight from the Analysis phase needs to be conveyed to an end user. The data product might be as simple as a slideshow; more commonly it is a website dashboard, a message, an alert, or a recommendation.

Can you think of anything the methodology is missing?


Note: This post is similar to the Data Scientific Method which I blogged about nearly 2 years ago.

What is a "Data Lake"?

I have frequently been hearing the term data lake. Being the curious person that I am, I decided to go in search of a definition.

Currently, the company Pivotal is responsible for marketing the term. However, I believe the term was originally coined by Dan Woods of CITO Research back in 2011. Anyhow, here is a basic description of a data lake.

A data lake is an information system consisting of the following 2 characteristics

  1. A parallel system able to store big data
  2. A system able to perform computations on the data without moving the data

Currently, Hadoop is the most common technology to implement a data lake, but it might not be that way forever. Thus it is important to distinguish the difference between Hadoop and a data lake. A data lake is a concept, and Hadoop is a technology to implement the concept.

The following is a recent Strata Talk by Kaushik Das of Pivotal. He discusses how a data lake can be used to create the digital brain.