Tag Archives: open source

Do’s and Don’ts of Data Science

Don’t Start with the Data
Do Start with a Good Question

Don’t think one person can do it all
Do build a well-rounded team

Don’t only use one tool
Do use the best tool for the job

Don’t brag about the size of your data
Do collect relevant data

Don’t ignore domain knowledge
Do consult a subject matter expert

Don’t publish a table of numbers
Do create informative charts

Don’t use just your own data
Do enhance your analysis with open data

Don’t do all the work yourself
Do partner with local universities

Don’t always build your own tools
Do use lots of open source tools

Don’t keep all your findings to yourself
Do share your analysis and results with the world!


Got any to add? Please leave a comment.

Dat – Version Controlled Data

Dat is an open source project focusing on data storage. In particular, the project wants to version control data. What is version control? In short it allows for tracking of history associated with something (typically source code files or documents). Dat takes the idea a bit further, and the data is versioned at the row level and not the file level. Plus, it is built for collaboration among teams.

Use the online tutorial to learn more.

Dat is currently in beta. This is going to be a very interesting project to watch. I can see many great use cases.

The New Open Data Handbook

Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.

The second edition now includes 3 parts.

  1. Open Data Guide – The Why?, What? and How? of open data
  2. Value Stories – Stories of how open data is making a difference
  3. Resource Library – Videos, presentations, and publications about open data

Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.

Read the official announcement from the Open Knowledge Foundation.

Open Data Day 2015

Today, February 21, 2015 is Open Data Day.

What is it?

Around the globe, cities are hosting hackathons centered around open data. The rules are fairly open-ended as long as the event is open and uses open data.

Who is it for?

  • Designers
  • Developers
  • Statisticians
  • Librarians
  • Citizens

If you want to get involved, check the list of City’s hosting Open Data Day events.

If you are looking for some good datasets to use: try Data Sources for Cool Data Science Projects: Part 1 and Part 2.

Open Source Alternatives to AWS

Working with big data can often mean doing some cloud computing. If a public cloud like Amazon AWS is not an option, there are some open source alternatives. They all offer some level of compatibility with the AWS API for both EC2(compute) and S3(storage).

  1. Rackspace OpenStack
  2. Apache CloudStack
  3. Eucalyptus
  4. OpenNubula

Best Free Data Mining Tools

I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.

50 Top Open Source Tools for Big Data – Datamation

50 Top Open Source Tools for Big Data – Datamation.

The list is about 6 months old, but it still covers all the ones I would have listed and quite a few more.

Big Data Right Now: Five Trendy Open Source Technologies | TechCrunch

Open Source Software can be great. TechCruch lists 5 fairly new open source technologies for big data.
This is probably a good list to pay attention to for the near future.

  • Storm
  • Drill
  • R
  • Gremlin
  • SAP Hana

If you are unfamiliar with some of software on the list, please read the article for more details.

Big Data Right Now: Five Trendy Open Source Technologies | TechCrunch.