Tag Archives: data science

College Graduates Not Ready For Big Data

This infographic displays the need for colleges and universities to start preparing more data science graduates.

Data Science Education Week

Yes, I just declared this week as Data Science Education Week. As far as I know, I have no authority to do such a thing. Hey, I did it anyway. All week I will be posting information related to data science education. So, pull up a chair and get ready for some serious data science education topics.

Startup Idea: A Search Engine For Recent News

The Problem

I have a problem. This is a problem that I would guess many other people have. I have access to way too much information. I want less, but I also want the best and newest.

Stack of Copy Paper

How do I find the best and newest information on any topic?  There is a lot of new information everyday.  I spend a lot of time searching the internet for quality information on data science. I would love to be able to visit a page and get the latest and greatest information on data science, statistics, bigdata, and machine learning.  I would be hoping to get news articles and/or blog posts from the last couple days.

Possible Solutions

Here is a list of products I have seen and why they are not exactly what I am looking for:

  • News.me – This site is close.  It emails a daily list of top articles, but the articles only come from my twitter followers.  The problem is: I may not be following the right people.  They did just do an excellent blog series about how people get news, so they may be working on something right now.
  • Paper.li – This seems very promising, but not all the content is new.  It also only updates 1 or 2 times per day.  The slow updates make it difficult to easily and quickly get the latest news.  After a few weeks of training your searches and parameters, this might be really good for getting news on a daily basis.  The big problem for me is; I have to wait until the next day to see if my search parameters are correct.  It is not built to handle ah-hoc queries.  Here is the paper I am working on, Data Science 101 News.
  • Storify – This is a way to create a collection of information from various social networks.  Storify makes it really easy to find social media mentions, but it is not automated and doesn’t save much time.  Not that it matters much to me, but the final product is not real pretty.
  • Summify – Recently bought by Twitter, so the future is uncertain.
  • TweetedTimes – Same problem as News.me.  It only generates information based upon people I follow on twitter.
  • Google News – This is good for searching, but I think a better one exists or needs to be created.  Plus, the Google Privacy issues are a concern here.

Better Solution

This whole concept of filtering/searching/rating news sounds like a data science problem itself.  For starters, given a topic, what information has been tweeted the most? What new information has spread the quickest? This approach could be expanded to include Facebook likes and Google +1’s.  Also there are numerous other API’s that could be included as well.  What I really want is a product that will do this in realtime (or near realtime).  I want to be able to enter my search terms and get a list of the most recent quality information pertaining to those terms. I guess what I want is a search engine for recent news (but not Google).

Does anyone know if a product like this exists or is anyone working to build a similar product?

Update: This idea also goes along with Paul Graham’s Ambitious Startup Idea #1 – A New Search Engine.

Use Data Science to Help The World (Data Without Borders)

Data Without Borders

Jake Porway started Data Without Borders because he attended a hack-a-thon and the groups came up with apps that didn’t really better the world very much. I believe he used the word, “unfulfilling” to describe the apps. He decided to create a way to provide organizations (Government or non-profit) with access to data scientists. His thinking goes like this. There are lots of data scientists that love to work with data. There are great organization with lots of data. If the two can be matched together, what amazing things can be done? Data Without Borders hopes to find out.

Data Without Borders organizes a bunch of DataDives, which are weekend hack-a-thons that match up a group of data scientists and developers with data.

Jake concluded with some wonderful remarks:

What if we started using data not just to make better decisions about what kind of movies we wanted to see? What if we started using data to make betters decisions about what kind of a world we wanted to see?

What is Data Without Borders looking for?

Jake’s Presentation at PopTech

Think Stats – An Online Statistics Book For Programmers

Previously I mentioned that online statistics learning resources are not abundant.

Well, here is a new online book for learning statistics. It is geared towards programmers, and it looks to be a great fit for people wanting to learn data science.  Here is a small excerpt from the Preface:

It emphasizes the use of statistics to explore large datasets.

I have only had time to quickly browse the book, but it looks quite good.

Think Stats: Probability and Statistics for Programmers

(The book has a Creative Commons license, so it is free and OK to download)

Probabilistic Graphical Models Starts Today

The Coursera Probabilistic Graphical Models course officially starts today.  Sign up and start learning.

Kaggle Makes Data Science Fun

The tag line for Kaggle is “We’re making data science a sport.”  They have successfully created a way to turn data science into a competition.  It is both fun, and it yields excellent results.  There is also a portion of the site dedicated for classroom use.  It is called Kaggle in Class.

Here is how it works.  A company that needs some data analyzed can contact Kaggle and host a competition.  Then data scientists all over the world can compete to find the best solution. The company benefits from having many algorithms and techniques applied to the same data set.  Many more algorithms are applied than what the company could run without Kaggle.  The contestants benefit from networking, pre-cleaned data, and learning from others.  It is a win/win situation. Plus, the winner gets prize money.

Currently, the large featured competition is the Heritage Health Prize. It is a $3,000,000 competition to identify individuals that will be admitted to the hospital in the next year.  The contest lasts until April 2013.

This is definitely a site I want to be involved with in the future.  I just wish they could make it a spectator sport as well.

16 Companies Hiring Data Scientists Right Now

Data Scientist is the hot new job for 2012.  Does this job really exist?  Who hires these people? Are companies currently hiring? The answers are: yes, lots of companies, and yes. I decided to spend last night looking for companies that are currently hiring data scientists.  It did not take long to compile a pretty good list.

Data Scientist Job Openings

Company Location Link
Microsoft Redmond, WA Microsoft Sr. Data Scientist
Netflix Los Gatos, CA NetFlix Senior Data Scientist
Kaggle San Francisco, CA Kaggle Data Scientist
Greenplum San Mateo, CA Greenplum Data Scientist
Last.fm London Last.fm Data Scientist
Rackspace San Antonio, TX Rackspace Data Scientist
Amazon Seattle, WA Amazon Data Scientist/System Architect
Facebook Menlo Park, CA Facebook Data Scientist
Twitter San Francisco, CA Twitter Data Scientist
LinkedIn Mountain View, CA LinkedIn Data Scientist
Cobalt/ADP Cambridge, MA Cobalt Data Scientist
Ebay/Paypal San Jose, CA Paypal Data Scientist
Bunchball San Jose, CA Bunchball Data Scientist
A9 Palo Alto, CA Principal Engineer/Data Scientist
Acxiom Little Rock, AR Acxiom Data Scientist
Trulia San Francisco, CA Trulia Data Scientist – Data Science Lab

Do you know of any other companies hiring Data Scientists right now?

Learning Statistics for Data Science

Statistics – This is a topic that could use some more attention from the online community.
I would love to see Stanford (or Coursera) offer a free statistics course online much like the other free courses online.

I did find a series of Youtube videos by Daniel Judge, a Professor in the East Los Angeles College Mathematics Department. The videos start at the very beginning of statistics. I have watched a couple of the videos, and they seem quite good. Daniel does a nice job of explaining the information. Here is the first video in the series.

Stay tuned to the blog in case other stats options appear online. Also, please leave a comment if you know of some good online statistics resources.

What Makes a Good Data Scientist?

Jeremy Howard is the Chief Scientist at Kaggle. At the end of this interview, from the Strata Conference 2012, he identified 4 simple traits that a data scientist needs.

  1. Creativity
  2. Open-mindedness
  3. Tenacity
  4. A Good Skillset

Jeremy Howard of Kaggle at Strata 2012

In this brief interview he covers a range of other data science topics:

  • Big Data is an engineering problem
  • Analytics generate value/insight from data
  • Predictive Modeling is about answering a question – build a model to do that
  • Is Data Science about tools or people? – watch the video for Jeremy’s answer
  • And others…

See this previous post for more videos from Strata 2012.