Tag Archives: statistics

Statistics.com Educational Programs

Statistics.com is an online institute for statistics education. They offer a variety of courses related to statistics and data science.

I have not personally taken any courses from statistics.com, but I have heard good things about the programs and courses.

Choosing a Data Science Graduate Program

Due to the large list of Colleges with Data Science Degrees, I receive a number of email inquires with questions about choosing a program. I have not attended any of the programs, and I am not sure how qualified I am to provide guidance. Anyhow, I will do my best to share what information I do have.

Originally, the list started out with 5 schools. Now the list is well over 100 schools, so I have not been able to keep up with all the intricate details of every program. There are not very many undergraduate options, and the list only contains a few PhD programs, so the information here will be focused on pursuing a masters degree.

Start by asking 2 questions:

  1. What are my current data science skills?
  2. What are my future data science goals?

Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:

  1. Mathematical Foundation (Calculus/Matrix Operations)
  2. Computing (DB, programming, machine learning, NoSQL)
  3. Communication (visualization, presentation, writing)
  4. Statistics (regression, trees, classification, diagnostics)
  5. Business (domain specific knowledge)

After seeing the above lists, this is where things get cloudy. Everyone brings a different set of existing skills, and everyone has different future goals. Here are a few scenarios that might clear things up.

Data Scientist

The most common approach is to attempt to build knowledge in all 5 topic areas. If this is your goal, find the topic areas where you are weakest and target a graduate program to help you bolster those weak skills. In the end, you will come out with a broad range of very desired skills.

Specialist

A different approach is to select one topic area and get really, really good. For example, maybe you want to be an expert on machine learning. If that is your goal, then maybe a traditional computer science graduate program is what is best. In the end, you will be well-suited to be an effective member of a data science team or pursue a PhD.

Data Manager

A third and also common approach is from people that want to help fill the expected void of 1.5 million data-savvy managers. These people do not necessarily want to know the deep details of the algorithms, but they would like an understanding of what the algorithms can do and when to use which algorithm. In this case, a graduate program from a business school (MBA) might be a good choice. Just make sure the program also involves coverage from the non-business topics of data science.

Example

I think NYU is the best example of a school that can help a person achieve just about any data science goal. The NYU program is a university-wide initiative, so the program is integrated with many departments (math, CS, Stats, Business, and others). Therefore, a student could possibly tailor a program to reach a variety of future goals. Plus, New York has a lot of companies solving interesting data science problems.

Conclusion

There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.

Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.

Data Science is more than just Statistics: Part 2

A few days ago I posted Data Science is more than just Statistics. I did not feel the post was complete so I am adding a part 2.

Data Collection

I received a couple comments about data scientists being involved with the collection of data. Yes, I would agree that is true. In order for data products to work, the correct data has to be collected. I probably did not explain this very well, but the approach to data collection is different. Typically, a statistics project will run some sort of rigorous experiment to collect data. The experiment will be very controlled and well-understood. In contrast, a data science project will collect data from existing systems, new system, sites on the web, sensors, and various other places. Most of the data does not come from a very controlled environment. By controlled, I mean specific number of users, specific type of users, specific time frame, and/or set constraints on the environment. This conglomeration of data is one of the reasons data scientists deal with such large datasets (there are other reasons too such as cheap hardware). I think it is more common for a data science project to deal with the whole population rather than a certain sampling of the population.

Importance of Statistics

In the previous post, I failed to emphasize the importance of statistics. I never said that statistics is not important to data science. Statistics is a critical element of data science. However, if you only know and study statistics, I feel you are missing other key elements of data science.

Thus, I stand by my previous statement, “if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.” Stay tuned for a post on choosing a data science graduate program.

Do you agree/disagree?

Data Science is more than just Statistics

I occasionally get comments and emails similar to the following question:

Should I attend a graduate program in data science or statistics?

I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.

Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.

Statistics

This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.

Data Science

On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.

Conclusion

Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.

Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.

Thoughts/Questions

What are your thoughts? Agree/Disagree?

Making Data tell a Story

I don’t think anybody does it better than Hans Rosling. In the following video he helps to explain population growth, child mortality, and fossil fuel usage based upon wealth. I love how he uses toy blocks and chips to help visualize his point.

See the original post from the Guardian, Hans Rosling: the man who’s making data cool

A very nice visualization of the Central Limit Theorem

The blog post, Central Limit Theorem Visualized in D3, was posted last week.

The post does 2 very nice things. First, it provides a nice visual of what the central limit theorem means. Second, it displays the wonderful power of the javascript library, D3.

Probabilistic Programming and Bayesian Methods for Hackers Online Book

Probabilistic Programming and Bayesian Methods for Hackers is an open source online book. The book is developed with iPython, so it can be read in a variety of formats: web, PDF, or locally with iPython installed.

Also, contributions are welcome via the Github repository for the book (or you can email the authors).

This is the first iPython project I have really looked at, and iPython looks very promising.

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?

Win-Vector Blog » Data Science, Machine Learning, and Statistics: what is in a name?.

This is an excellent write-up for the differences between:

  • Statistics
  • Machine Learning
  • Data Mining
  • Informatics
  • Big Data
  • Predictive Analytics
  • Data Science

Best Free Data Mining Tools

I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.

10 R packages

Yhat, a new predictive modeling startup, wrote up a nice blog post about
10 R Packages I wish I knew about earlier. It is worth reading through the list.


Special Thanks to Mark Nickel for pointing out this link.