Looking for datasets for your next project? You are in luck because Google just launched Dataset Search. The name is self-explanatory. Go try it out.
DJ Patil, former US Chief Data Scientist and data science legend, has a nice video with helpful tips for people looking to get into data science.
- Strive for Curiosity
- Follow Ethics and Security
- Be part of a Team
- Solve a Local Problem
Chartio put together a nice online interactive tutorial for learning SQL.
SQL is almost an essential tool for data science. Rarely is there a project where knowing SQL is not necessary.
Recently, a number of resources for publicly available datasets have been announced.
- Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.
Are you aware of any other resources that have been recently announced? If so, please leave a comment.
Dat is an open source project focusing on data storage. In particular, the project wants to version control data. What is version control? In short it allows for tracking of history associated with something (typically source code files or documents). Dat takes the idea a bit further, and the data is versioned at the row level and not the file level. Plus, it is built for collaboration among teams.
Use the online tutorial to learn more.
Dat is currently in beta. This is going to be a very interesting project to watch. I can see many great use cases.
Robin Murphy has one of the coolest job titles I have ever read, Disaster Roboticist. At Texas A&M, she works on developing advanced robots for disaster recovery.
In this Ted talk, she outlines some of the capabilities of the robots and how the robots work. One quote at the end really caught my attention.
So really, “disaster robotics” is a misnomer. It’s not about the robots. It’s about the data.
The robots are collecting data and that data needs analysis!
The National Football League begins its regular season tonight. One feature you might not hear about is the addition of 2 RFID sensors on every player. Each stadium is equipped with receivers (not wide receivers) to capture the data emitted from the RFID tags. When the data is collected, it will be able to track players position, movement, speed, and acceleration. A company called Zebra Technologies is implementing the system.
It is a bit early to know exactly what the NFL teams will do with the data, but I think the NFL should open up the data. Analysis could be done for fantasy football. Data scientists could come up with some creative data visualizations. Plus, I think it contains great academic research potential.
As a side note, I am sure someone would start building some apps for the Microsoft Surface tablets.
See more at The IoT comes to the NFL
Just this week, I have become aware of 3 free online books for data science.
- Interactive Charts
- Geographic Plots
Frontiers in Massive Datasets
Frontiers in Massive Datasets is a report all about how science, business, communications, national security and others need to learn to handle massive amounts of data. Whether the data has been sitting in a database for years or it is now just screaming into the systems, massive data is now a problem for almost every industry. This report covers many of the topics that need to be addressed when dealing with big data. Here is a very brief overview of the topics:
- Building Models from Massive Data
- Real-time Algorithms
- 7 Computational Giants of Massive Data Analysis
Foundations of Data Science
Foundations of Data Science is a draft of textbook written by John Hopcroft and Ravindran Kannan. It is intended to be a text for computer science with an emphasis more on probability and statistics rather than discrete mathematics. The authors argue that knowledge of working with data is a necessary skill for computer scientists of the future. This is clearly the most technical and academic of the 3 books, but if that is your thing, your should really enjoy browsing through this book. Here are some of the topics.
- High-Dimensional Space
- Algorithms for Massive Data Problems
- Singular Value Decomposition
- Graphical Models
Here is a video of the final presentations of a data hackathon. You can watch the pitches, questions, and winners. If you are considering attending a data hackathon, this video should give you a good idea of what to expect at the end of a hackathon.
This video comes from the Critical Data Marathon held in London and Boston during September. This specific data hackathon focuses on health and medical data. I hope to post next time Critical Data schedules a hackathon.
Have you attended a data hackathon? What was it like?