Tag Archives: statistics

Building Data Science Skills as an Undergraduate

While there are a growing number of universities that offer undergraduate data science degrees, for one reason or another those programs may not be perfect for everyone interested in data science. So, what do you do if you attend a school that does not offer a data science degree? This is a question frequently asked of me, so I thought I would elaborate on my typical response.

You Cannot Know It All

First off, you will never know all there is to know about data science. The field is vast and contains many sub-fields. Thus, as an undergraduate, a good plan is to learn the fundamentals. Then expand your knowledge/expertise as your education and career continue. Data Science is evolving rapidly and it requires continual learning. Hopefully, this is one of the reasons you are interested in the field.

My Recommended Approach

A good plan is to major in computer science or statistics and minor in the other. If your school doesn’t have either of those major, then take as many of those classes as you can. Next, choose a domain specific area such as business, chemistry, psychology, etc.; and gear your elective classes toward that domain area. This approach will give you a solid base understanding of the statistical and computational underpinnings of data science. You should also be well-prepared to find a job or continue your studies in graduate school.

Also, somewhat related, taking an art class or two might not be a bad idea. Visualization is very important to data science. Understanding color palettes and usage of space on a canvas are concepts that will serve you well. Plus, many people strong in computer science and statistical algorithms are lacking in artistic skills.

Some Enhancements to Your Education

If your location allows, consider attending local meetups. Finally, get involved with whatever projects you can (Kaggle, internships, open source, …).

Do you have any advice for undergraduates looking to study data science? If so, please leave a comment.

Are you and undergraduate with questions? Please ask in the comments below.

Know Your Probability Distributions

In data science and statistics, probability distributions can be very important. I have been meaning to create a listing of them. However, I no longer need to since the fine folks at Cloudera have already created a list at Common Probability Distributions: The Data Scientist’s Crib Sheet.

Learn the distributions and pick a favorite. (My favorite of the common ones is the normal distribution. I also like the Cauchy distribution which is much less common.)

Berkeley Undergrad Data Science Course and Textbook

The University of California at Berkeley has started The Berkeley Data Science Education Program. The goal is to build a data science education program throughout the next several years by engaging faculty and students from across the campus. The introductory data science course is targeting freshman and it is taught from a very applicable and interactive environment. The course videos, slides, labs, and notes are freely available for others to use. The course heavily uses Jupyter. Also, the course textbook is online at Computational and Inferential Thinking: The Foundations of Data Science.

Free Stats book for Computer Scientists

Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.

Tips for Future Data Scientists

While preparing a for a recent talk I gave to an undergraduate audience, I started compiling some tips for future data scientists. The tips are intended for students (undergraduate and graduate) or anyone else planning to enter the field of data science.

I asked a few of my data science friends and posted a question on Quora, As a data scientist, what tips would you have for a younger version of yourself?

What follows is a summary of the many tips.

Tips for Data Science

  • Be flexible and adaptable – There is no single tool or technique that always works best.
  • Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
  • Not all building models – Like the previous tip, you must have skills beyond just model building.
  • Know the fundamentals of structuring data – Gain an understanding of relational databases. Also learn how to collect and store good data. Not all data is useful.
  • Document what you do – This is important for others and your future self. Here is a subtip, learn version control.
  • Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
  • Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
  • Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
  • Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
  • Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
  • Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
  • Be polymath – It helps to be a person with a wide range of knowledge.

Thanks to Chad, Chad, Lee, Buck, and Justin for providing some of the tips.

Statistics for Hackers Slides by Jake VanderPlas

Jake provides an excellent slidedeck about using programming simulation to do statistics. Lots of great information packed into these slides.

Data Scientist vs Data Engineer

As the field of data science continues to grow and mature, it is nice to begin seeing some distinction in the roles of a data scientist. A new job title gaining popularity is the data engineer. In this post, I lay out some of the distinctions between the 2 roles.

Data Scientist vs Data Engineer Venn Diagram
Data Scientist vs Data Engineer Venn Diagram

Data Scientist

A data scientist is responsible for pulling insights from data. It is the data scientists job to pull data, create models, create data products, and tell a story. A data scientist should typically have interactions with customers and/or executives. A data scientist should love scrubbing a dataset for more and more understanding.

The main goal of a data scientist is to produce data products and tell the stories of the data. A data scientist would typically have stronger statistics and presentation skills than a data engineer.

Data Engineer

Data Engineering is more focused on the systems that store and retrieve data. A data engineer will be responsible for building and deploying storage systems that can adequately handle the needs. Sometimes the needs are fast real-time incoming data streams. Other times the needs are massive amounts of large video files. Still other times the needs are many many reads of the data.
In other words, a data engineer needs to build systems that can handle the 3 Vs of big data.

The main goal of data engineer is to make sure the data is properly stored and available to the data scientist and others that need access. A data engineer would typically have stronger software engineering and programming skills than a data scientist.

Conclusion

It is too early to tell if these 2 roles will ever have a clear distinction of responsibilities, but it is nice to see a little separation of responsibilities for the mythical all-in-one data scientist. Both of these roles are important to a properly functioning data science team.

Do you see other distinctions between the roles?

R vs Python, The Great Debate

Recently I have seen blogs/articles claiming Python is the best choice for data science and R is the new language for business. Honestly, both articles are truthful and good. Both Python and R are good. Why do we have to choose? Let’s use both.

Here is my opinion. I prefer R to Python when performing exploratory data analysis. R has so many packages for every possible statistical technique. The plots, although not beautiful by default, are quick and easy to create. However, I prefer Python when I need to pull data from an API or build a software system or website. Python is more than just a statistical analysis tool; it is a complete programming language. I might even end up using Java for a project in the near future.

There does not have to be a clear winner or one single language to use. Use the best tool for the job and get on with your data science. In the end, the world cares more what you produced not whether you used R or Python or something else.

Statistics.com Educational Programs

Statistics.com is an online institute for statistics education. They offer a variety of courses related to statistics and data science.

I have not personally taken any courses from statistics.com, but I have heard good things about the programs and courses.

Choosing a Data Science Graduate Program

Due to the large list of Colleges with Data Science Degrees, I receive a number of email inquires with questions about choosing a program. I have not attended any of the programs, and I am not sure how qualified I am to provide guidance. Anyhow, I will do my best to share what information I do have.

Originally, the list started out with 5 schools. Now the list is well over 100 schools, so I have not been able to keep up with all the intricate details of every program. There are not very many undergraduate options, and the list only contains a few PhD programs, so the information here will be focused on pursuing a masters degree.

Start by asking 2 questions:

  1. What are my current data science skills?
  2. What are my future data science goals?

Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:

  1. Mathematical Foundation (Calculus/Matrix Operations)
  2. Computing (DB, programming, machine learning, NoSQL)
  3. Communication (visualization, presentation, writing)
  4. Statistics (regression, trees, classification, diagnostics)
  5. Business (domain specific knowledge)

After seeing the above lists, this is where things get cloudy. Everyone brings a different set of existing skills, and everyone has different future goals. Here are a few scenarios that might clear things up.

Data Scientist

The most common approach is to attempt to build knowledge in all 5 topic areas. If this is your goal, find the topic areas where you are weakest and target a graduate program to help you bolster those weak skills. In the end, you will come out with a broad range of very desired skills.

Specialist

A different approach is to select one topic area and get really, really good. For example, maybe you want to be an expert on machine learning. If that is your goal, then maybe a traditional computer science graduate program is what is best. In the end, you will be well-suited to be an effective member of a data science team or pursue a PhD.

Data Manager

A third and also common approach is from people that want to help fill the expected void of 1.5 million data-savvy managers. These people do not necessarily want to know the deep details of the algorithms, but they would like an understanding of what the algorithms can do and when to use which algorithm. In this case, a graduate program from a business school (MBA) might be a good choice. Just make sure the program also involves coverage from the non-business topics of data science.

Example

I think NYU is the best example of a school that can help a person achieve just about any data science goal. The NYU program is a university-wide initiative, so the program is integrated with many departments (math, CS, Stats, Business, and others). Therefore, a student could possibly tailor a program to reach a variety of future goals. Plus, New York has a lot of companies solving interesting data science problems.

Conclusion

There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.

Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.