It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).
This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.
All these tools have free versions to get you started. A few have discounted upgrades for students.
Use an online tool such as ShareLaTeX.
How does this tool benefit you? It saves you from having to install a version of , stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While is not perfect, I do not know of any better tool for writing mathematical documents.
Use GitHub to store you data and source code
At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
Use a Cloud Computing Platform such as Sense.
Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
Use Create.ly for creating diagrams.
Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.
There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.
is the trade in humans, most commonly for the purpose of sexual slavery, forced labor or commercial sexual exploitation for the trafficker or others; or for the extraction of organs or tissues, including surrogacy and ova removal; or for providing a spouse in the context of forced marriage.
Human trafficking is modern day slavery, and at any time more than 20 million people worldwide are victims (see the US Trafficking In Persons Report). Sex Trafficking is a specific form of human trafficking for the purposes of sexual exploitation.
What does this have to do with Data Science?
Well, data is being collected about victims of human trafficking and online advertisements targeting potential victims. Lots of data is being collected and not enough analysis is being done. Luckily, some organizations have teamed up to help fight human trafficking and sex slavery.
Startups like Palantir and SumAll are getting involved in the fight against human trafficking. Startups are not alone; governments and Universities are also getting involved.
SumAll.org, a non-profit spin-off of the startup SumAll.com, is an organization that provides data analytic capabilities to non-profit organizations making a social impact for good. One of the first projects of SumAll.org was Human Trafficking.
Rescue Forensics is a Y Combinator startup helping law enforcement collect online data to capture and prosecute human traffickers.
DARPA and Carnegie Mellon University are jointly working to use Natural Language Processing (NLP), computer vision, and machine learning to identify online ads used by sex traffickers.
Thorn is an organization that is specifically trying to stop the trafficking of children.
The issue of human trafficking is very complicated, and it will take many years and many people to solve. Therefore, it is a great time for you to get involved. Below are some organizations seeking volunteers
DataKind is an organization working to match up data scientists with data from non-profits. DataKind runs a number of data-dives/hackathons and special projects centered around using data science to help produce a positive global impact. Although DataKind works with a number of organizations and not all are focused on human trafficking, DataKind does work with a number of organizations that are fighting for global human rights. If you are interested in becoming involved with DataKind, please fill out the Get Involved Form. They are looking for volunteers regardless of your location.
SumAll.org is actively seeking volunteers and interns. They are seeking volunteers for data science, visualization, blogging, and KPI monitoring. Like DataKind, not all the projects are human trafficking, but they are focused on changing the world.
edX has just announced a new series of Big Data courses. The series consists of 2 courses focused around Apache Spark. If you are not familiar with Spark, it is a very fast engine for large-scale data processing. It claims to perform up to 100 times faster than hadoop. Here are the 2 courses:
I just finished my PhD in the Computational Science and Statistics program at South Dakota State University. My dissertation focused on the area of software analytics, sometimes called Data-Driven Software Engineering. Specifically, how does a Software Development Organization evaluate itself? Students have a G.P.A. (Grade Point Average), but organizations do not have a similar evaluation method.
The dissertation introduces the C.R.I. (Cumulative Result Indicator) to provide a single number to evaluate the performance of a software development organization. The C.R.I. focuses on 5 primary elements of a Software Development Organization.
C.R.I. demonstrates what data needs to be calculated, and how that data can be used to create a score. Naturally, this solution will not work in every situation, but it does provide a consistent method for evaluation, and it is flexible to allow only some of the elements or even additional elements.
There is the brief 1-minute overview of the dissertation. Feel free to read more of the details in the document below.
This is the first in a series of posts on Data-Driven Software Engineering. In the next few weeks, I will be posting more about the topic. Some posts will be excerpts from the dissertation, and others will be new thoughts on the topic. Stay Tuned!
Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.
The second edition now includes 3 parts.
Open Data Guide – The Why?, What? and How? of open data
Value Stories – Stories of how open data is making a difference
Resource Library – Videos, presentations, and publications about open data
Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.
The Data ScienceTech Institute (DSTI) in France is starting 2 new master’s degree programs in data science. Both programs are highly innovative and offer a strong industry focus. Classes begin in October 2015, and each program is limited to 30 students. Therefore, if you are interested, it is important to apply as soon as possible.
The other day, the faculty at DSTI were announced. I am honored to say I was selected as one of the faculty. Thus, I will serve as a visiting faculty member for portions of the program.
DSTI offers 2 master’s degree programs:
Data Scientist Designer – Located in Paris, this 2-year program is part-time and focused on working professionals looking to transition or enhance skills in the data science field. The course will rotate between 2 and 3 days a week.
Executive Big Data Analyst – Located in Nice along the French Riviera, this program is a more traditional intensive 16-month program targeting full-time students.
If you are in France or Europe or interested in studying in France, the programs from DSTI are definitely worth a look.