Tag Archives: github

Awesome Data Science Colleges List

I recently compiled a huge list of colleges and universities with data science-related degree programs. The compiled list is available on Github as Awesome Data Science Colleges.

I encourage you to contribute to the list if you know of missing programs.

Tools For Writing a Data Science Dissertation

It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).

This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.

All these tools have free versions to get you started. A few have discounted upgrades for students.

  • Use an online LaTeX tool such as ShareLaTeX.
    How does this tool benefit you? It saves you from having to install a version of LaTeX, stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While LaTeX is not perfect, I do not know of any better tool for writing mathematical documents.
  • Use GitHub to store you data and source code
    At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
  • Use a Cloud Computing Platform such as Sense.
    Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
  • Use Create.ly for creating diagrams.
    Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.

There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.

Huge List of Big Data and Machine Learning Technologies

Onur Akpolat has put together A curated list of awesome big data frameworks and resources. The list is very extensive and includes: NoSQL databases, machine learning libraries, frameworks, filesystems and more.

On a similar note, Joseph Misiti has compiled a large list of machine learning specific resources. The list is titled, Awesome Machine Learning, and it includes resources for various languages, NLP, visualization, and more.

Both lists are on Github, so if you notice something missing from the list, feel free to add it. Contributions are welcome.

Hackathons with Data are Everywhere

It seems that competitions and meetups for hacking data are all over the place. Coding challenges have been around for a long time. Recently, it appears that data is being thrown into the mix. I think the idea is great. Instead of just hacking some app, why not hack with some data that might help people?

GitHub just concluded with the GitHub Data Challenge. Also, the World’s first ever global data science hackathon occurred last month. The Silicon Prairie has even hosted a couple data-centric hackathons. The Omaha World Herald newspaper organized Hack Omaha to turn government data into something useful. Open Iowa also did a very similar thing for developers, designer, and data-junkies in the Des Moines area. I did not personally attend either of these events, but I was surprised to see these similar types of events occurring in the midwest.

DataKind is busy organizing data dives all over the country, and Kaggle is currently organizing data science competitions for anyone regardless of location. By the way, Kaggle has become one of my favorite sites, and I will be blogging soon about how to quickly get involved.

Anyhow, it appears hackathons with data are here to stay. What data hackathons or competitions do you know about? Are you planning to attend?

Github Is Cool: They Like Data

Today, GitHub announced the release of archived public activity data called the GitHub public timeline. The dataset can be queried via the Google BigQuery tool.

To make things even more awesome, GitHub is also hosting a Data Challenge. The challenge is to play around with data and create the best visualization possible. You better start now, because the competition ends May 21st. I am not familiar with Google BigQuery so this might be a good time to learn.

This should not surprise anyone. GitHub is always doing cool things, especially for developer-minded people. If you don’t know, GitHub is the best place for hosting your source code.