Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.
Rules of Machine Learning:
Best Practices for ML Engineering
If you do data engineering or are involved with building data science systems, this document is worth a look.
Recently, a number of resources for publicly available datasets have been announced.
- Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
- NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
- Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.
Are you aware of any other resources that have been recently announced? If so, please leave a comment.
Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.
Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.
Somewhat lost in the hype of Google’s Cloud Machine Learning announcement (which is itself neat), was the release of Google’s Public Data Sets.
I think this has been previously happening, but now Google has an official location for these public data sets stored in BigQuery. You can:
- Access and use the data in your applications
- Request Google to host your own public data set
It will be fun to watch this site expand with more public datasets. Happy Exploration!
Google recently announced the launch of their own Massive Open Online Course (MOOC). The course is titled, Making Sense of Data, and it begins tomorrow, March 18, 2014.
The prerequisites are quite simple. All that is needed is: a google account, a web browser, and a basic knowledge of spreadsheets.
The content of the course will focus on Fusion Tables, which is a new experimental product from Google. Fusion Tables is a web application for visualizing, gathering, and sharing data. I am not familiar with Fusion Tables, but the description sounds very useful.
Here is the promotional video.
The Real Data on Facebook vs. Google+ is a great article about the popularity of different social networks. All the big social networks (Twitter, Facebook, Google+, LinkedIn, Pinterests, and even Myspace) are included. If you have ever wondered whether Google+ is dead or not, this article will help you out. After the data was gathered and analyzed, it is clear that Facebook is currently winning the social media battle.
The best part of the article is an interactive infographic. You can change the view of the infographic for different years and different business segments. Here is a direct link to just the Interactive Infographic, Facebook Dominates Social Networking.
Today, GitHub announced the release of archived public activity data called the GitHub public timeline. The dataset can be queried via the Google BigQuery tool.
To make things even more awesome, GitHub is also hosting a Data Challenge. The challenge is to play around with data and create the best visualization possible. You better start now, because the competition ends May 21st. I am not familiar with Google BigQuery so this might be a good time to learn.
This should not surprise anyone. GitHub is always doing cool things, especially for developer-minded people. If you don’t know, GitHub is the best place for hosting your source code.
Google with Incorrect Spelling
Just the other day, I was googling for “strata conference” information. I noticed that I had mistakenly typed “starta” instead of “strata“. I proceeded to backspace the incorrect letters and fix the spelling. Later in the day, I also noticed that I frequently mistype the letters “ar” and “ra”. That got me thinking.
Does Google Know How Poorly I Spell?
Since Google Instant was released in 2010, Google is now able to track every keystroke I type into the Google Search box. Thus, Google will know when I hit the backspace key. Using some data analysis, Google should be able to answer the following questions:
- What words are most commonly misspelled in Google searches? I would guess the answer would be a good indicator of the most commonly misspelled words in general.
- What words do I misspell the most often?
- How many letters get typed after the misspelling?
- What percentage of Google searches are completed without a backspace?
- Do people in certain parts of the world/country have better spelling?
- How often do people backspace a correctly spelled word, just to then spell it incorrectly? This could be amusing. I would also like to know what words.
Misspelling Don’t Matter
As it turns out, the misspellings don’t really matter that much. Google is smart enough to fix many spelling errors.
What other spelling questions could Google answer?