Don’t Start with the Data
Do Start with a Good Question
Don’t think one person can do it all
Do build a well-rounded team
Don’t only use one tool
Do use the best tool for the job
Don’t brag about the size of your data
Do collect relevant data
Don’t ignore domain knowledge
Do consult a subject matter expert
Don’t publish a table of numbers
Do create informative charts
Don’t use just your own data
Do enhance your analysis with open data
Don’t do all the work yourself
Do partner with local universities
Don’t always build your own tools
Do use lots of open source tools
Don’t keep all your findings to yourself
Do share your analysis and results with the world!
Got any to add? Please leave a comment.
Dat is an open source project focusing on data storage. In particular, the project wants to version control data. What is version control? In short it allows for tracking of history associated with something (typically source code files or documents). Dat takes the idea a bit further, and the data is versioned at the row level and not the file level. Plus, it is built for collaboration among teams.
Use the online tutorial to learn more.
Dat is currently in beta. This is going to be a very interesting project to watch. I can see many great use cases.
Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.
The second edition now includes 3 parts.
- Open Data Guide – The Why?, What? and How? of open data
- Value Stories – Stories of how open data is making a difference
- Resource Library – Videos, presentations, and publications about open data
Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.
Read the official announcement from the Open Knowledge Foundation.
Today, February 21, 2015 is Open Data Day.
What is it?
Around the globe, cities are hosting hackathons centered around open data. The rules are fairly open-ended as long as the event is open and uses open data.
Who is it for?
If you want to get involved, check the list of City’s hosting Open Data Day events.
If you are looking for some good datasets to use: try Data Sources for Cool Data Science Projects: Part 1 and Part 2.
Working with big data can often mean doing some cloud computing. If a public cloud like Amazon AWS is not an option, there are some open source alternatives. They all offer some level of compatibility with the AWS API for both EC2(compute) and S3(storage).
- Rackspace OpenStack
- Apache CloudStack
I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.
- Apache Mahout
See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.
50 Top Open Source Tools for Big Data – Datamation.
The list is about 6 months old, but it still covers all the ones I would have listed and quite a few more.
Open Source Software can be great. TechCruch lists 5 fairly new open source technologies for big data.
This is probably a good list to pay attention to for the near future.
- SAP Hana
If you are unfamiliar with some of software on the list, please read the article for more details.
Big Data Right Now: Five Trendy Open Source Technologies | TechCrunch.