Tag Archives: data science

Coursera Class on Recommender Systems

In about 1 month, the course, Introduction to Recommender Systems, will begin on Coursera. The course is being offered by the Computer Science and Engineering Department from the University of Minnesota.

The course is 14 weeks long and has 2 tracks:

  1. Programming Track – 6 different recommender systems will be programmed
  2. Concept Track – great for people that want to know about recommender systems, but don’t want program

Recommender systems are an important part of data science, and this course looks to provide an excellent in-depth overview of the topic.

How To Build Data Science Teams?

Companies everywhere are struggling to assemble data science teams. Here are a couple of videos to help answer the following questions and more.

  • How do you assemble a team?
  • What skills do you need?
  • Where do you look for data scientists?
    •  

      DJ Patil, one of the stars of the data science world, answers a bunch of great questions in this talk. It is a couple years old, but still relevant.

      What are the Characteristics to look for in a Data Scientist?

      • Curiosity
      • Passion for playing with data
      • History of having to manipulate data to solve problems

       

      What are the Key Data Science Skills?

      • Finding Data Sources
      • Working with large data sets despite constraints
      • Cleaning data
      • Merging data sets
      • Visualization
      • Building tools for others to use

       

      Where to look for data science team members?

      • Internal
      • Interns
      • Other fields (physics, neurology, sciences)
      • Academic counterparts

       

      Principles for Data Science Talent

      • Would we be willing to work on a startup together?
      • Can you knock the socks off in 90 days?
      • Will you be doing amazing things?

       

      David Dietrich of EMC just recently added some insight to DJ’s points about building data science teams. His philosophy is: Building data science teams is not the goal. Developing data science capabilities is the goal. The structure is not nearly as important as the work being done. Different organizations can be successful doing data science different ways. In the video he lays out the pros and cons of all the following strategies.

      Strategies to Assemble Data Science Capabilities

      1. Transforming – reposition/add/modify existing teams such as a reporting team
      2. Creating – just start from scratch
      3. As a Service – consultants or websites, new ones are appearing every day
      4. Crowdsourcing – competitions like the Netflix prize or Kaggle

       

      Now, go start developing data science capabilities!

Choosing a Data Science Graduate Program

Due to the large list of Colleges with Data Science Degrees, I receive a number of email inquires with questions about choosing a program. I have not attended any of the programs, and I am not sure how qualified I am to provide guidance. Anyhow, I will do my best to share what information I do have.

Originally, the list started out with 5 schools. Now the list is well over 100 schools, so I have not been able to keep up with all the intricate details of every program. There are not very many undergraduate options, and the list only contains a few PhD programs, so the information here will be focused on pursuing a masters degree.

Start by asking 2 questions:

  1. What are my current data science skills?
  2. What are my future data science goals?

Those 2 questions can provide a lot of guidance. Understand that data science consists of a number of different topic areas:

  1. Mathematical Foundation (Calculus/Matrix Operations)
  2. Computing (DB, programming, machine learning, NoSQL)
  3. Communication (visualization, presentation, writing)
  4. Statistics (regression, trees, classification, diagnostics)
  5. Business (domain specific knowledge)

After seeing the above lists, this is where things get cloudy. Everyone brings a different set of existing skills, and everyone has different future goals. Here are a few scenarios that might clear things up.

Data Scientist

The most common approach is to attempt to build knowledge in all 5 topic areas. If this is your goal, find the topic areas where you are weakest and target a graduate program to help you bolster those weak skills. In the end, you will come out with a broad range of very desired skills.

Specialist

A different approach is to select one topic area and get really, really good. For example, maybe you want to be an expert on machine learning. If that is your goal, then maybe a traditional computer science graduate program is what is best. In the end, you will be well-suited to be an effective member of a data science team or pursue a PhD.

Data Manager

A third and also common approach is from people that want to help fill the expected void of 1.5 million data-savvy managers. These people do not necessarily want to know the deep details of the algorithms, but they would like an understanding of what the algorithms can do and when to use which algorithm. In this case, a graduate program from a business school (MBA) might be a good choice. Just make sure the program also involves coverage from the non-business topics of data science.

Example

I think NYU is the best example of a school that can help a person achieve just about any data science goal. The NYU program is a university-wide initiative, so the program is integrated with many departments (math, CS, Stats, Business, and others). Therefore, a student could possibly tailor a program to reach a variety of future goals. Plus, New York has a lot of companies solving interesting data science problems.

Conclusion

There you have it. It does not narrow the choices down, but it should help to provide some guidance. Other factors to consider are length of a program and/or location.

Good Luck with your decision, and feel free to leave a comment if you have and good/bad experiences with any of the particular graduate programs.

Data Science is more than just Statistics: Part 2

A few days ago I posted Data Science is more than just Statistics. I did not feel the post was complete so I am adding a part 2.

Data Collection

I received a couple comments about data scientists being involved with the collection of data. Yes, I would agree that is true. In order for data products to work, the correct data has to be collected. I probably did not explain this very well, but the approach to data collection is different. Typically, a statistics project will run some sort of rigorous experiment to collect data. The experiment will be very controlled and well-understood. In contrast, a data science project will collect data from existing systems, new system, sites on the web, sensors, and various other places. Most of the data does not come from a very controlled environment. By controlled, I mean specific number of users, specific type of users, specific time frame, and/or set constraints on the environment. This conglomeration of data is one of the reasons data scientists deal with such large datasets (there are other reasons too such as cheap hardware). I think it is more common for a data science project to deal with the whole population rather than a certain sampling of the population.

Importance of Statistics

In the previous post, I failed to emphasize the importance of statistics. I never said that statistics is not important to data science. Statistics is a critical element of data science. However, if you only know and study statistics, I feel you are missing other key elements of data science.

Thus, I stand by my previous statement, “if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.” Stay tuned for a post on choosing a data science graduate program.

Do you agree/disagree?

Data Science is more than just Statistics

I occasionally get comments and emails similar to the following question:

Should I attend a graduate program in data science or statistics?

I believe there is some concern about the buzzword data science. People are unsure about getting a degree in a buzzword. I understand that. However, whether the term data science lasts or not, the techniques in data science are not going away.

Anyhow, this post is not intended to argue the merits of the term data science. This post is about the comparison of statistics to data science. They are not the same thing. The approach to problems is different from the very beginning.

Statistics

This is a common approach to a statistics problem. A problem is identified. Then a hypothesis is generated. In order to test that hypothesis, data needs to be collected via a very structured and well-defined experiment. The experiment is run and the hypothesis is validated or invalidated.

Data Science

On the other hand, the data science approach is slightly different. All of this data has already been collected or is currently being collected, what can be predicted from that data? How can existing data be used to help sell products, increase engagement, reach more people, etc.

Conclusion

Overall, statistics is more concerned with how the data is collected and why the outcomes happen. Data science is less concerned about collecting data (because it usually already exists) and more concerned about what the outcome is? Data science wants to predict that outcome.

Thus, if you just want to do statistics, join a statistics graduate program. If you want to data science, join a data science program.

Thoughts/Questions

What are your thoughts? Agree/Disagree?

Deep Learning – A Term To Know

Deep Learning is a new term that is starting to appear in the data science/machine learning news.

What is Deep Learning?

According to DeepLearning.net, the definition goes like this:

Deep Learning is a new area of Machine Learning research, which has been introduced with the objective of moving Machine Learning closer to one of its original goals: Artificial Intelligence.

Wikipedia provides the following defintion:

Deep learning is set of algorithms in machine learning that attempt to learn layered models of inputs, commonly neural networks. The layers in such models correspond to distinct levels of concepts, where higher-level concepts are defined from lower-level ones, and the same lower-level concepts can help to define many higher-level concepts.

Deep Learning is sometimes referred to as deep neural networks since much of deep learning focuses on artificial neural networks. Artificial neural networks are a technique in computer science modelled after the connections (synapses) of neurons in the brain. Artificial neural networks, sometimes just called neural nets, have been around for about 50 years, but advances in computer processing power and storage are finally allowing neural nets to improve solutions for complex problems such as speech recognition, computer vision, and Natural Language Processing (NLP).

Hopefully, this blog post provides some inspiration and useful links to help you learn more about deep learning.

How is Deep Learning being applied?

The following talk, Tera-scale Deep Learning, by Quoc V. Le of Stanford gives some indication of the size of problems to be tackled. The talk discusses work being done on a cluster of 2000 machines and more than 1,000,000,000 parameters.