Tag Archives: machine learning

Best Practices for Machine Learning Engineering

Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.

Rules of Machine Learning:
Best Practices for ML Engineering

If you do data engineering or are involved with building data science systems, this document is worth a look.

Recent Free Online Books for Data Science

This is just a short list of a few books that I have have recently discovered online.

  • Model-Based Machine Learning – Chapters of this book become available as they are being written. It introduces machine learning via case studies instead of just focusing on the algorithms.
  • Foundations of Data Science – This is a much more academic-focused book which could be used at the undergraduate or graduate level. It covers many of the topics one would expect: machine learning, streaming, clustering and more.
  • Deep Learning Book – This book was previously available only in HTML form and not complete. Now, it is free and downloadable.

Machine Learning Yearning Book

Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.

Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.

A Couple of Current Data Science Competitions

Decoding Brain Signals

Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.

Big Data Viz Challenge

Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.

Getting Started with Data Science Specialties

I frequently ask young people, particularly undergraduates, what they plan to do with their future. I am often less than enthused with the responses which sound something like this:

  • I hope to get a job doing statistics.
  • I just want to work with computers.
  • I want to be a data scientist.
  • I just want a job.

The responses are typically vague and void of direction. Most responses involve waiting for someone else to provide the guidance. You do not have to wait. You can get started today.

If you are just interested in getting a job, the rest of this post is not for you. If you want to make an impact with your data science career, the remainder of this post is for you.

Below is an explanation of numerous specialties in data science. You don’t need to learn them all. Just pick one and follow the first step. You will learn more along the way. Don’t stress about which one to pick, there is no wrong answer. Just pick one and start building.

Data Visualization

Data visualization is all about telling a story with data. Do you have a keen eye for color and design? Can you summarize complex data in a few simple charts? If you answer yes to those questions, then you just might be a good fit for data visualization.

First Step: Go to Data.gov and make an infographic

Data Science Educator

Are you the person always explaining your homework to others? This specialty might be for you. You can take a few different paths. One is the traditional university faculty approach. Another is more of a corporate training professional. The world needs both. Plus, if you are entrepreneurial, there are ample opportunities to consult as a data science educator. Businesses realize they need to know data science, and they are looking for training.

First Step: Start a video or blog with tutorials

Data Engineer

A data engineer is typically more interested in systems than just the machine learning. Data engineers are typically strong with computer science fundamentals. They love to build things that themselves and others can use. A good data engineer can also spend a lot of time cleaning data as well.

First Step: Build a solution (hint: Cortana Intelligence Solutions)

Data Programmer

Do you love to program? If so, you just might fall into this category. Data science has many needs for programmers. Everything from cleaning data to building data products needs programming.

First Step: Be on Github

Statistical Modeling (Machine Learning)

Some people just love the statistical modeling and machine learning. They love to tune models and squeeze the last bit of predictive power from a data set. If you love talking about regression, trees, random forests, AUC, cross-validation and boosting; then this specialty is most likely for you.

First Step: Enter Kaggle competitions.

Data Science Manager

If you are bossy, it does not mean you will make a good manager. The best managers know how to build strong teams and get out of the way. Managers will provide help and overall direction for projects. Plus, he/she should have a solid understanding of how data can help shape a team’s decisions.

First Step: Organize a group to help a non-profit analyze data (Similar to what DataKind does)

Data Science Researcher

A researcher is interested in pushing the boundaries of data science. Are you interested in creating your own machine learning algorithms? Do you want to build the next great data framework? Do you think data science can achieve something no one else has thought to try? If so, being a researcher is for you.

First Step: Go to graduate school

Data Science Unicorn

A data science unicorn is someone that knows all the specialties above and more. A unicorn understands all the topics of data science. Being a unicorn is not attainable for everyone, but a few people have become unicorns. If you think you can be a unicorn, go for it.

First Step: Start at visualization above

In Conclusion,

Simple: Pick a specialty and Go Make a Difference!


This post is based upon a talk I gave at Winona State University just before MUDAC. The original title was Go After Your Data Science Dreams.

Yahoo Just Released a Huge Machine Learning Dataset

Yahoo just released a 1.5 TB dataset of “anonymized user interactions on the news feeds”. If you have been looking for a new dataset to analyze, this just might be it. It contains approximately 110 billion rows of data regarding user-news interactions. Happy data exploring!

An executive’s guide to machine learning | McKinsey & Company

via An executive’s guide to machine learning | McKinsey & Company.

A nice read if you are looking for a short introduction to the history and importance of machine learning.

Understanding Machine Learning: From Theory to Algorithms (Free Book Download)

Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz, Associate Professor at the School of Computer
Science and Engineering at The Hebrew University, Israel, and
Shai Ben-David, Professor in the School of Computer Science at the
University of Waterloo, Canada. The book looks very thorough. Below is just a sampling of the topics covered.

  • Bias-Complexity Tradeoff
  • Model Selection
  • Support Vector Machines
  • Decision Trees
  • Neural Networks
  • Clustering
  • Dimensionality Reduction
  • Feature Selection and Generation
  • Advanced Theory
  • And LOTS LOTS more….

Happy Learning!

One Algorithm To Learn Anything [An Interview with Pedro Domingos, Author of The Master Algorithm]

Releasing today (Sept. 22, 2015) is the fantastic new book, The Master Algorithm, by machine learning expert and University of Washington Computer Science Professor, Pedro Domingos. Recently, I got the opportunity to visit with Dr. Domingos about his new book and machine learning in general. See below for his fears of machine learning, thoughts on education, and tips for learning data science.

Stay tuned later this week for a complete review of the book!

The Master Algorithm book
The Master Algorithm book
Why This Book and Why Now?

Dr. Domingos explains the 2 primary reasons why the topic and timing are just perfect for the book.

  1. Real Need – Currently, machine learning is a topic of interest in society. Machine learning and data science are being discussed in news and politics. The one downfall, most people don’t really understand the topic. He does fault machine learning experts for not making the topic understandable to a broader audience. Many of the concepts from machine learning can be explained without complex mathematics formulas. The new book aims to do just that while exposing the topic of machine learning to others.
  2. Unity – Dr. Domingos explains the five camps of machine learning: symbolists, connectionists, evolutionaries, bayesians, and analogizers. He thinks right now is the time to start thinking and working toward combining the camps to form a single general purpose learner. More on those camps can be discovered in the book.

What are the limits of the Master Algorithm?

Not many! Dr. Domingos does not think the algorithm will perform magic, but he did state,

“It should truly be able to learn anything given the requisite data.”

The trick will be compiling the “requisite data”.

What are the biggest fears of the Master Algorithm?

As is emphasized numerous times in the book, Dr. Domingos does not envision The Master Algorithm creating bots that will eventually take over the world. No, the real problem is already a concern with machine learning.

Computers are making decisions for humans every day, and sometimes those decisions are wrong.

Also, he thinks machine learning will discover and expose things we do not like about ourselves. Then he envisions some challenges with ownership of that data and the algorithmic results.

How soon will we see the Master Algorithm?

Dr. Domingos is not sure if the algorithm will be discovered tomorrow, not for many years, or ever. He does think the next five years will see some combining of the best parts of the five camps.

What are some problems in the application of machine learning?

He is currently seeing a problem in the practice of applying machine learning. He sees companies take the latest research, which is a good thing, and turn it into a large engineering project. Eventually, those projects hit a wall of being too complex. That is why he thinks companies are going to start combining and refining machine learning projects to make them less complex and more maintainable.

What advice would you give to high school students or undergrads about pursuing machine learning/data science?

Dr. Domingos believes they (high school and undergraduate students) are the primary audience for the book. He did expand on the answer and provide a nice todo list for people getting into the field of data science and machine learning.

  1. Read The Master Algorithm
  2. Explore further readings – the end of the book contains details on further readings for each chapter
  3. Take an online course (MOOC) – many good choices
  4. Start implementing some algorithms – either on your own projects or in a competition such as Kaggle, this will help you identify some of the common pitfalls

How do you see machine learning affecting education?

He sees two clear ways in which machine learning will have an impact on education.

  1. Machine learning is something people in every field will need to know. It is becoming the new toolkit.
  2. Machine learning is going to personalize education. MOOCs are already starting to do this, but the future shows much more promise in this specific area.

Do you ever have plans to offer the Coursera Machine Learning course in a live format?

Luckily for us learners, Dr. Domingos does plan to offer the course in a live format. He always intended the course to happen that way, but some unexpected things arose, and the class never ran live. It doesn’t have a scheduled date yet, but the details will be posted on this blog when it does happen. In the mean time, all the lectures are available on the Cousera class page.

Finally, do you have a unique use of machine learning in your own life?

Dr. Domingos and few other professors at the University of Washington are in the initial steps of a project named eProf, for electronic Professor. The goal is to automate some of the responsibilities of a professor. The project is still in the discussion stages, but he thinks it would make a useful open source project. Hopefully, more to come on eProf in the future!

Remember, check back later this week for a complete review of the book!

Yinyang K-Means: A Drop-In Replacement of the Classic K-Means

This week; Yufei Ding, Yue Zhao, Xipeng Shen, Madanlal Musuvathi, and Todd Mytkowicz will be presenting Yinyang K-means at the 2015 International Conference on Machine Learning.

The algorithm guarantees the same results as traditional K-means, but it produces results with an order of magnitude higher performance.

An abstract of the paper and a PDF download can be accessed at Yinyang K-Means: A Drop-In Replacement of the Classic K-Means with Consistent Speedup.