Berkeley Undergrad Data Science Course and Textbook

The University of California at Berkeley has started The Berkeley Data Science Education Program. The goal is to build a data science education program throughout the next several years by engaging faculty and students from across the campus. The introductory data science course is targeting freshman and it is taught from a very applicable and interactive environment. The course videos, slides, labs, and notes are freely available for others to use. The course heavily uses Jupyter. Also, the course textbook is online at Computational and Inferential Thinking: The Foundations of Data Science.

A Couple of Current Data Science Competitions

Decoding Brain Signals

Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.

Big Data Viz Challenge

Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.

Free Stats book for Computer Scientists

Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.

Do’s and Don’ts of Data Science

Don’t Start with the Data
Do Start with a Good Question

Don’t think one person can do it all
Do build a well-rounded team

Don’t only use one tool
Do use the best tool for the job

Don’t brag about the size of your data
Do collect relevant data

Don’t ignore domain knowledge
Do consult a subject matter expert

Don’t publish a table of numbers
Do create informative charts

Don’t use just your own data
Do enhance your analysis with open data

Don’t do all the work yourself
Do partner with local universities

Don’t always build your own tools
Do use lots of open source tools

Don’t keep all your findings to yourself
Do share your analysis and results with the world!


Got any to add? Please leave a comment.

Tips for Future Data Scientists

While preparing a for a recent talk I gave to an undergraduate audience, I started compiling some tips for future data scientists. The tips are intended for students (undergraduate and graduate) or anyone else planning to enter the field of data science.

I asked a few of my data science friends and posted a question on Quora, As a data scientist, what tips would you have for a younger version of yourself?

What follows is a summary of the many tips.

Tips for Data Science

  • Be flexible and adaptable – There is no single tool or technique that always works best.
  • Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
  • Not all building models – Like the previous tip, you must have skills beyond just model building.
  • Know the fundamentals of structuring data – Gain an understanding of relational databases. Also learn how to collect and store good data. Not all data is useful.
  • Document what you do – This is important for others and your future self. Here is a subtip, learn version control.
  • Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
  • Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
  • Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
  • Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
  • Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
  • Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
  • Be polymath – It helps to be a person with a wide range of knowledge.

Thanks to Chad, Chad, Lee, Buck, and Justin for providing some of the tips.

Getting Started with Data Science Specialties

I frequently ask young people, particularly undergraduates, what they plan to do with their future. I am often less than enthused with the responses which sound something like this:

  • I hope to get a job doing statistics.
  • I just want to work with computers.
  • I want to be a data scientist.
  • I just want a job.

The responses are typically vague and void of direction. Most responses involve waiting for someone else to provide the guidance. You do not have to wait. You can get started today.

If you are just interested in getting a job, the rest of this post is not for you. If you want to make an impact with your data science career, the remainder of this post is for you.

Below is an explanation of numerous specialties in data science. You don’t need to learn them all. Just pick one and follow the first step. You will learn more along the way. Don’t stress about which one to pick, there is no wrong answer. Just pick one and start building.

Data Visualization

Data visualization is all about telling a story with data. Do you have a keen eye for color and design? Can you summarize complex data in a few simple charts? If you answer yes to those questions, then you just might be a good fit for data visualization.

First Step: Go to Data.gov and make an infographic

Data Science Educator

Are you the person always explaining your homework to others? This specialty might be for you. You can take a few different paths. One is the traditional university faculty approach. Another is more of a corporate training professional. The world needs both. Plus, if you are entrepreneurial, there are ample opportunities to consult as a data science educator. Businesses realize they need to know data science, and they are looking for training.

First Step: Start a video or blog with tutorials

Data Engineer

A data engineer is typically more interested in systems than just the machine learning. Data engineers are typically strong with computer science fundamentals. They love to build things that themselves and others can use. A good data engineer can also spend a lot of time cleaning data as well.

First Step: Build a solution (hint: Cortana Intelligence Solutions)

Data Programmer

Do you love to program? If so, you just might fall into this category. Data science has many needs for programmers. Everything from cleaning data to building data products needs programming.

First Step: Be on Github

Statistical Modeling (Machine Learning)

Some people just love the statistical modeling and machine learning. They love to tune models and squeeze the last bit of predictive power from a data set. If you love talking about regression, trees, random forests, AUC, cross-validation and boosting; then this specialty is most likely for you.

First Step: Enter Kaggle competitions.

Data Science Manager

If you are bossy, it does not mean you will make a good manager. The best managers know how to build strong teams and get out of the way. Managers will provide help and overall direction for projects. Plus, he/she should have a solid understanding of how data can help shape a team’s decisions.

First Step: Organize a group to help a non-profit analyze data (Similar to what DataKind does)

Data Science Researcher

A researcher is interested in pushing the boundaries of data science. Are you interested in creating your own machine learning algorithms? Do you want to build the next great data framework? Do you think data science can achieve something no one else has thought to try? If so, being a researcher is for you.

First Step: Go to graduate school

Data Science Unicorn

A data science unicorn is someone that knows all the specialties above and more. A unicorn understands all the topics of data science. Being a unicorn is not attainable for everyone, but a few people have become unicorns. If you think you can be a unicorn, go for it.

First Step: Start at visualization above

In Conclusion,

Simple: Pick a specialty and Go Make a Difference!


This post is based upon a talk I gave at Winona State University just before MUDAC. The original title was Go After Your Data Science Dreams.

Google Announces Public Data Sets

Somewhat lost in the hype of Google’s Cloud Machine Learning announcement (which is itself neat), was the release of Google’s Public Data Sets.

I think this has been previously happening, but now Google has an official location for these public data sets stored in BigQuery. You can:

  • Access and use the data in your applications
  • Request Google to host your own public data set

It will be fun to watch this site expand with more public datasets. Happy Exploration!

Midwest Undergraduate Data Analytics Competition

The 2016 Midwest Undergraduate Data Analytics Competition (MUDAC) will be held at Winona State University in Winona, Minnesota on April 2 and 3.

  • What is MUDAC?
    MUDAC is an intense 2-day analytics competition aimed at undergraduate students. Teams compete to solve a problem posed by an external organization.
  • Who can compete?
    Teams of 3 to 4 undergraduate students attending a school in Minnesota, Wisconsin, Iowa, Illinois, North Dakota, or South Dakota
  • Why attend MUDAC?
    • A fun learning experience
    • Friendly competition
    • Teamwork
    • Meet others with similar inteests
    • Learn about data science/analytic careers
    • Practice preparing and giving a presentation
    • Cash prizes for winning
    • Door prizes

The competition also includes a panel discussion with some local data professionals. I am honored to be one of those panelists.

If you attend or teach at a university in the upper Midwest and you are interested in data science, you should strongly consider bringing a team to MUDAC. I hope to see you there.

Learning To Be A Data Scientist