Tag Archives: machine learning

The 3 Stages of Data Science

Businesses everywhere are racing to extract meaningful insight from their data. Many organizations are spinning up data science teams and attacking problems (some more successful than others). However, one of the challenges is determining the current stage of data science within the organization. Next is determining the desired stage of data science.

Below are 3 stages of a truly mature data science organization.

1. Dashboards

The beginning stage of data science is dashboards. It is all about answering “How much?” and “What happened?” by looking at reports of historical data. If done well, it might even help an organization answer “Why”. Many organizations will refer to this phase as Business Intelligence.

The dashboard stage can be very expensive for an organization, in terms of people-hours and money. It usually involves investments in:

  1. Data Warehouse or some other storage environment, for storing the data in a single location for easy reporting
  2. ETL (Extract Transform Load) Tools for manipulating, combining, and moving data to the data warehouse
  3. Reporting Tools for displaying the results and allowing users to “explore” the data

Here are some common questions that can be answered via traditional dashboards:

  • How many customers live in each region?
  • What were the sales on Black Friday?
  • How many patients visited the hospital last month?

As you can see, there are large amounts of value that can be gained by this phase alone. It is critical for a business to clearly understand past performance. Unfortunately, this phase is where many businesses stop.

2. Machine Learning

The real “science” of data science does not begin until the second stage which is machine learning. It focuses on estimating quantities that cannot be directly observed. This could be what movies a customer will like, the price of a company’s stock tomorrow, or the causal effect of a particular advertising campaign. Machine Learning uses the data from the first phase and applies statistical or other methods to gain additional insights.

Think of machine learning as answering the following:

  • When a customer moves, will he/she spend money at a hardware store?
  • When a credit card purchase is made, what is the probability the charge was fraudulent?
  • What is the expected lifetime value of a new customer?
  • If a hurricane is coming, what will people buy? (pop tarts? it is true).

Notice the connection between an event and some outcome. The value of machine learning comes from estimating the causal outcome of potential events. This phase is filled with terms such as: machine learning, data mining, and statistical modeling. The machine learning stage is all about looking into the future!

3. Actions

Determining the actions to perform, is the third and final phase. It tries to capitalize on the results of machine learning in order to take appropriate actions. The following actions might be suitable for the events identified in the predictive section above.

  • When a customer moves, send a “welcome to the neighborhood” packet with coupons to nearby hardware stores.
  • Decline the fraudulent charge or deactivate the credit card.
  • If the new customer has very high expected lifetime value, provide some special treatment or offers to ensure the customer becomes a customer for life.
  • When a hurricane is approaching, place Pop tarts near the front of the store.

As you can see, good machine learning from the second phase can lead to clear actions.

Conclusion

Claiming success in Data Science is all about conquering all three stages. Each stage builds upon the previous stage. If you have put in the effort to complete the first stage, why not continue to the second and third stages?

5 Data Science Research Papers to read in Summer 2017

In the past, the blog has included 7 Important Data Science Papers and 5 More Data Science Papers. Here is another list if you are looking for something to read over the summer.

Best Practices for Machine Learning Engineering

Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.

Rules of Machine Learning:
Best Practices for ML Engineering

If you do data engineering or are involved with building data science systems, this document is worth a look.

Recent Free Online Books for Data Science

This is just a short list of a few books that I have have recently discovered online.

  • Model-Based Machine Learning – Chapters of this book become available as they are being written. It introduces machine learning via case studies instead of just focusing on the algorithms.
  • Foundations of Data Science – This is a much more academic-focused book which could be used at the undergraduate or graduate level. It covers many of the topics one would expect: machine learning, streaming, clustering and more.
  • Deep Learning Book – This book was previously available only in HTML form and not complete. Now, it is free and downloadable.

Machine Learning Yearning Book

Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.

Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.

A Couple of Current Data Science Competitions

Decoding Brain Signals

Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.

Big Data Viz Challenge

Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.

Getting Started with Data Science Specialties

I frequently ask young people, particularly undergraduates, what they plan to do with their future. I am often less than enthused with the responses which sound something like this:

  • I hope to get a job doing statistics.
  • I just want to work with computers.
  • I want to be a data scientist.
  • I just want a job.

The responses are typically vague and void of direction. Most responses involve waiting for someone else to provide the guidance. You do not have to wait. You can get started today.

If you are just interested in getting a job, the rest of this post is not for you. If you want to make an impact with your data science career, the remainder of this post is for you.

Below is an explanation of numerous specialties in data science. You don’t need to learn them all. Just pick one and follow the first step. You will learn more along the way. Don’t stress about which one to pick, there is no wrong answer. Just pick one and start building.

Data Visualization

Data visualization is all about telling a story with data. Do you have a keen eye for color and design? Can you summarize complex data in a few simple charts? If you answer yes to those questions, then you just might be a good fit for data visualization.

First Step: Go to Data.gov and make an infographic

Data Science Educator

Are you the person always explaining your homework to others? This specialty might be for you. You can take a few different paths. One is the traditional university faculty approach. Another is more of a corporate training professional. The world needs both. Plus, if you are entrepreneurial, there are ample opportunities to consult as a data science educator. Businesses realize they need to know data science, and they are looking for training.

First Step: Start a video or blog with tutorials

Data Engineer

A data engineer is typically more interested in systems than just the machine learning. Data engineers are typically strong with computer science fundamentals. They love to build things that themselves and others can use. A good data engineer can also spend a lot of time cleaning data as well.

First Step: Build a solution (hint: Cortana Intelligence Solutions)

Data Programmer

Do you love to program? If so, you just might fall into this category. Data science has many needs for programmers. Everything from cleaning data to building data products needs programming.

First Step: Be on Github

Statistical Modeling (Machine Learning)

Some people just love the statistical modeling and machine learning. They love to tune models and squeeze the last bit of predictive power from a data set. If you love talking about regression, trees, random forests, AUC, cross-validation and boosting; then this specialty is most likely for you.

First Step: Enter Kaggle competitions.

Data Science Manager

If you are bossy, it does not mean you will make a good manager. The best managers know how to build strong teams and get out of the way. Managers will provide help and overall direction for projects. Plus, he/she should have a solid understanding of how data can help shape a team’s decisions.

First Step: Organize a group to help a non-profit analyze data (Similar to what DataKind does)

Data Science Researcher

A researcher is interested in pushing the boundaries of data science. Are you interested in creating your own machine learning algorithms? Do you want to build the next great data framework? Do you think data science can achieve something no one else has thought to try? If so, being a researcher is for you.

First Step: Go to graduate school

Data Science Unicorn

A data science unicorn is someone that knows all the specialties above and more. A unicorn understands all the topics of data science. Being a unicorn is not attainable for everyone, but a few people have become unicorns. If you think you can be a unicorn, go for it.

First Step: Start at visualization above

In Conclusion,

Simple: Pick a specialty and Go Make a Difference!


This post is based upon a talk I gave at Winona State University just before MUDAC. The original title was Go After Your Data Science Dreams.

Yahoo Just Released a Huge Machine Learning Dataset

Yahoo just released a 1.5 TB dataset of “anonymized user interactions on the news feeds”. If you have been looking for a new dataset to analyze, this just might be it. It contains approximately 110 billion rows of data regarding user-news interactions. Happy data exploring!

An executive’s guide to machine learning | McKinsey & Company

via An executive’s guide to machine learning | McKinsey & Company.

A nice read if you are looking for a short introduction to the history and importance of machine learning.

Understanding Machine Learning: From Theory to Algorithms (Free Book Download)

Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Associate Professor at the School of Computer
Science and Engineering at The Hebrew University, Israel, and
Shai Ben-David, Professor in the School of Computer Science at the
University of Waterloo, Canada. The book looks very thorough. Below is just a sampling of the topics covered.

  • Bias-Complexity Tradeoff
  • Model Selection
  • Support Vector Machines
  • Decision Trees
  • Neural Networks
  • Clustering
  • Dimensionality Reduction
  • Feature Selection and Generation
  • Advanced Theory
  • And LOTS LOTS more….

Happy Learning!