Recent Resources for Open Data

Recently, a number of resources for publicly available datasets have been announced.

  • Kaggle becomes the the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
  • NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
  • Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.

For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.

Are you aware of any other resources that have been recently announced? If so, please leave a comment.

Data Science Ethical Framework

The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.

The framework is based around 6 clear principles:

  1. Start with clear user need and public benefit
  2. Use data and tools which have the minimum intrusion necessary
  3. Create robust data science models
  4. Be alert to public perceptions
  5. Be as open and accountable as possible
  6. Keep data secure

See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.

Machine Learning Yearning Book

Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.

Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.

How to Kickstart Your Data Science Career

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.

 

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20 10
3 Alice 2005 25 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.

SELECT

SELECT is the foundational keyword in SQL. SELECT can also filter on columns.  For example

SELECT Name, StartYear FROM Employees

returns

Name StartYear
Bob 2001
Sally 2004
Alice 2005
Fred 2004

 

WHERE

The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004

returns

EmployeeId Name StartYear Salary DepartmentId
2 Sally 2004 20 10
4 Fred 2004 12.5 20

 

GROUP BY

Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
FROM EMPLOYEES
GROUP BY StartYear

returns

StartYear Num AvgSalary
2001 1 10.5
2004 2 16.25
2005 1 25

 

JOIN

Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId DepartmentName
10 Sales
20 Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;

The results might look like:

EmpName DepName
Bob Sales
Sally Sales
Alice Engineering
Fred Engineering

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done.  Don’t forget, that you can nest queries and create really complicated joins.  It’s a lot more powerful than Excel, and gives you much better control of your data.  Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.

 

And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

Berkeley Undergrad Data Science Course and Textbook

The University of California at Berkeley has started The Berkeley Data Science Education Program. The goal is to build a data science education program throughout the next several years by engaging faculty and students from across the campus. The introductory data science course is targeting freshman and it is taught from a very applicable and interactive environment. The course videos, slides, labs, and notes are freely available for others to use. The course heavily uses Jupyter. Also, the course textbook is online at Computational and Inferential Thinking: The Foundations of Data Science.

A Couple of Current Data Science Competitions

Decoding Brain Signals

Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.

Big Data Viz Challenge

Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.

Free Stats book for Computer Scientists

Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.

Do’s and Don’ts of Data Science

Don’t Start with the Data
Do Start with a Good Question

Don’t think one person can do it all
Do build a well-rounded team

Don’t only use one tool
Do use the best tool for the job

Don’t brag about the size of your data
Do collect relevant data

Don’t ignore domain knowledge
Do consult a subject matter expert

Don’t publish a table of numbers
Do create informative charts

Don’t use just your own data
Do enhance your analysis with open data

Don’t do all the work yourself
Do partner with local universities

Don’t always build your own tools
Do use lots of open source tools

Don’t keep all your findings to yourself
Do share your analysis and results with the world!


Got any to add? Please leave a comment.

Learning To Be A Data Scientist