Category Archives: Learn Data Science

This is a category for all things related to learning data science.

Data Science Education Week

Yes, I just declared this week as Data Science Education Week. As far as I know, I have no authority to do such a thing. Hey, I did it anyway. All week I will be posting information related to data science education. So, pull up a chair and get ready for some serious data science education topics.

Data Science Important To Tech Innovations

Inc Magazine recently reported on 6 Major Tech Innovations for 2012. I was surprised to see how much influence data science has on the list. I would say 4 of the 6 technologies are data science related.

  1. Predictive Technology – Data Science and Statistics are the driving force here, along with all the data that is now being collected
  2. Social Analytics – Yet another example of data science and statistics.  Analyzing social networks has become a hot topic.
  3. Speech for Business – This is putting Natural Language Processing (NLP) to work.  The best example is Siri from the iPhone.  Siri is constantly learning because of all the data collected when answering questions.  This area is covered with data science topics (NLP, machine learning, data storage/retrieval).  The end goal is to be able to use data science to help the business.
  4. Business Ready Storage – This might be more hardware related, but data needs to be stored quickly and securely.  It also needs to be available and in a format useful for reporting.

Can you think of other ways data science is being used for innovation?

Startup Idea: A Search Engine For Recent News

The Problem

I have a problem. This is a problem that I would guess many other people have. I have access to way too much information. I want less, but I also want the best and newest.

Stack of Copy Paper

How do I find the best and newest information on any topic?  There is a lot of new information everyday.  I spend a lot of time searching the internet for quality information on data science. I would love to be able to visit a page and get the latest and greatest information on data science, statistics, bigdata, and machine learning.  I would be hoping to get news articles and/or blog posts from the last couple days.

Possible Solutions

Here is a list of products I have seen and why they are not exactly what I am looking for:

  • – This site is close.  It emails a daily list of top articles, but the articles only come from my twitter followers.  The problem is: I may not be following the right people.  They did just do an excellent blog series about how people get news, so they may be working on something right now.
  • – This seems very promising, but not all the content is new.  It also only updates 1 or 2 times per day.  The slow updates make it difficult to easily and quickly get the latest news.  After a few weeks of training your searches and parameters, this might be really good for getting news on a daily basis.  The big problem for me is; I have to wait until the next day to see if my search parameters are correct.  It is not built to handle ah-hoc queries.  Here is the paper I am working on, Data Science 101 News.
  • Storify – This is a way to create a collection of information from various social networks.  Storify makes it really easy to find social media mentions, but it is not automated and doesn’t save much time.  Not that it matters much to me, but the final product is not real pretty.
  • Summify – Recently bought by Twitter, so the future is uncertain.
  • TweetedTimes – Same problem as  It only generates information based upon people I follow on twitter.
  • Google News – This is good for searching, but I think a better one exists or needs to be created.  Plus, the Google Privacy issues are a concern here.

Better Solution

This whole concept of filtering/searching/rating news sounds like a data science problem itself.  For starters, given a topic, what information has been tweeted the most? What new information has spread the quickest? This approach could be expanded to include Facebook likes and Google +1’s.  Also there are numerous other API’s that could be included as well.  What I really want is a product that will do this in realtime (or near realtime).  I want to be able to enter my search terms and get a list of the most recent quality information pertaining to those terms. I guess what I want is a search engine for recent news (but not Google).

Does anyone know if a product like this exists or is anyone working to build a similar product?

Update: This idea also goes along with Paul Graham’s Ambitious Startup Idea #1 – A New Search Engine.

Making Data Tell A Story

Jer Thorp gives a nice Ted Talk about making data more useful to humans. Jer is really good at making data tell a story. Watch the video to learn some more about what he does with data.

He also mentions an app he helped build. It is called OpenPaths, and it allows iPhone users to freely share their location information with researchers.

Python For Big Data

Travis Oliphant, the CEO of Continuum Analytics gives a nice presentation on Python and BigData. He argues that python is frequently used in bigdata, but it does not get a lot of attention from the bigdata community. The bigdata community only wants to talk about hadoop. Travis would like to see python have a larger role in bigdata.

He also provides a quick overview of NumPy and SciPy. Below is a video of his presentation.

How do you feel about python and bigdata? Do you use python for bigdata?

Data Scientist Job Analysis

A few weeks ago, I posted 16 Companies Hiring Data Scientists Right Now. I decided to do a bit of analysis on the job posts, so I took all the job posting and compiled them into one file.

The Problem

I wanted to determine 2 things:

  1. What words occurred most often in the job posts?
  2. What words occurred in the most jobs posts?

The questions are similar, but if you read closely, they are different.   I wrote some Java code to answer those questions. The raw results are posted here.


Honestly, nothing too surprising showed up. Not counting the common English words (and, to), the word data was the most popular. It occurred 167 times and it occurred at least once in all 16 job postings. That makes sense; a data scientist should know about data. I thought hadoop would occur in all job descriptions but it only appeared in 11 of the 16 job descriptions. Here are some other words I found interesting:

  • statistical occured 29 times and in 10 job descriptions
  • analysis occured 46 times and in 13 job descriptions
  • analytics occured 22 times and in 6 job descriptions
  • statistics occured 16 times and in 9 job descriptions
  • machine learning occured 14 times and in 9 job descriptions
  • phd occured 11 times and in 11 job descriptions
  • sql occured 12 times and in 10 job descriptions

On an interesting note, Python and R occurred in more job postings than Java (2 more to be exact).

Does anything in the results strike you as interesting?

Highlights of the White House BigData Research Initiative

Yesterday’s announcement included a few main highlights:

  • NIH will make 200TB of human genetic variation data freely available on Amazon Web Services
  • NSF will provide $2M in support of undergraduate education for studying graphical and visualization techniques of bigdata
  • DoD will announce some prize competitions in the coming months
  • Numerous other projects to increase data analysis across the US Government

Here is one note I did find interesting (or funny).  One of the speakers mentioned the need for workers with bigdata skills.  He mentioned the workforce needs 159,000 workers with data skills.  Then a couple minutes later he mention a need for data savvy managers.  He mentioned the workforce will need 1,500,000 managers with data knowledge.   I just thought those 2 numbers did not match up well.

Here are a few more links to other articles summarizing the Research Initiative:

White House is Announcing $200M BigData Research Initiative

Later today (2-3:45 pm ET), the White House will announce a $200M BigData research Initiative. Appropriately, it is being named the “Big Data Research and Development Initiative.”

The announcement will be broadcast live on Science360.

See this PDF for a listing of bigdata projects within the US Government.

I am excited to see how this will affect the education and training of data scientists.
What are your thoughts? Is this a good idea?

List of Free Courses Online

Having trouble keeping track of what schools offer what courses for free online? Problem solved!

Class Central maintains a updated list of courses from Coursera(Stanford), Udacity, MITx, and others as they become available. Not all of the courses are related to data science, but I still thought it was valuable to share the link.

Check it out and start learning.

Use Data Science to Help The World (Data Without Borders)

Data Without Borders

Jake Porway started Data Without Borders because he attended a hack-a-thon and the groups came up with apps that didn’t really better the world very much. I believe he used the word, “unfulfilling” to describe the apps. He decided to create a way to provide organizations (Government or non-profit) with access to data scientists. His thinking goes like this. There are lots of data scientists that love to work with data. There are great organization with lots of data. If the two can be matched together, what amazing things can be done? Data Without Borders hopes to find out.

Data Without Borders organizes a bunch of DataDives, which are weekend hack-a-thons that match up a group of data scientists and developers with data.

Jake concluded with some wonderful remarks:

What if we started using data not just to make better decisions about what kind of movies we wanted to see? What if we started using data to make betters decisions about what kind of a world we wanted to see?

What is Data Without Borders looking for?

Jake’s Presentation at PopTech