It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).
This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.
All these tools have free versions to get you started. A few have discounted upgrades for students.
- Use an online tool such as ShareLaTeX.
How does this tool benefit you? It saves you from having to install a version of , stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While is not perfect, I do not know of any better tool for writing mathematical documents.
- Use GitHub to store you data and source code
At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
- Use a Cloud Computing Platform such as Sense.
Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
- Use Create.ly for creating diagrams.
Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.
There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.
The twitter hashtag #SoDS is being used in 2015 to help people track and share what they are learning. The hashtag originated on the Becoming Data Scientist blog.
I recently wrote a post for Sense about a number of freely available learning opportunities this summer, Start Learning with the Summer of Data Science. The post covers:
- MOOCs starting soon
- Large list of open-access journals
If you are interested, go check out the post and start your #SoDS. Hurry, many of the opportunities start very soon.
What is Sex Trafficking?
It is a form of human trafficking and according to Wikipedia, human trafficking
is the trade in humans, most commonly for the purpose of sexual slavery, forced labor or commercial sexual exploitation for the trafficker or others; or for the extraction of organs or tissues, including surrogacy and ova removal; or for providing a spouse in the context of forced marriage.
Human trafficking is modern day slavery, and at any time more than 20 million people worldwide are victims (see the US Trafficking In Persons Report). Sex Trafficking is a specific form of human trafficking for the purposes of sexual exploitation.
What does this have to do with Data Science?
Well, data is being collected about victims of human trafficking and online advertisements targeting potential victims. Lots of data is being collected and not enough analysis is being done. Luckily, some organizations have teamed up to help fight human trafficking and sex slavery.
Startups like Palantir and SumAll are getting involved in the fight against human trafficking. Startups are not alone; governments and Universities are also getting involved.
- Palantir is working with Google and Polaris, an organization focused on eliminating human slavery in the U.S. and globally, to coordinate efforts of local human trafficking hotlines. Together they created the Global Human Trafficking Hotline Network. See the latest update from Google Ideas.
- SumAll.org, a non-profit spin-off of the startup SumAll.com, is an organization that provides data analytic capabilities to non-profit organizations making a social impact for good. One of the first projects of SumAll.org was Human Trafficking.
- Rescue Forensics is a Y Combinator startup helping law enforcement collect online data to capture and prosecute human traffickers.
- DARPA and Carnegie Mellon University are jointly working to use Natural Language Processing (NLP), computer vision, and machine learning to identify online ads used by sex traffickers.
- Thorn is an organization that is specifically trying to stop the trafficking of children.
- Even the Clinton Foundation is involved.
How can you get involved?
The issue of human trafficking is very complicated, and it will take many years and many people to solve. Therefore, it is a great time for you to get involved. Below are some organizations seeking volunteers
- DataKind is an organization working to match up data scientists with data from non-profits. DataKind runs a number of data-dives/hackathons and special projects centered around using data science to help produce a positive global impact. Although DataKind works with a number of organizations and not all are focused on human trafficking, DataKind does work with a number of organizations that are fighting for global human rights. If you are interested in becoming involved with DataKind, please fill out the Get Involved Form. They are looking for volunteers regardless of your location.
- SumAll.org is actively seeking volunteers and interns. They are seeking volunteers for data science, visualization, blogging, and KPI monitoring. Like DataKind, not all the projects are human trafficking, but they are focused on changing the world.
- There are also many local organizations working to fight human trafficking. Find an organization near you via the Global Modern Slavery Directory.
- Samaritan’s Purse, although not specifically seeking data volunteers, does a lot of work to combat human trafficking across the globe.
If you are a victim or have information about human trafficking, please call the National Human Trafficking Resource Center at 1 (888) 373-7888.
If you know of any other organizations working on data analysis of human trafficking, please leave a comment.
edX has just announced a new series of Big Data courses. The series consists of 2 courses focused around Apache Spark. If you are not familiar with Spark, it is a very fast engine for large-scale data processing. It claims to perform up to 100 times faster than hadoop. Here are the 2 courses:
- Introduction to Big Data with Apache Spark
- Scalable Machine Learning
The first course starts June 1, 2015, and lasts four weeks. The second course starts in late June and lasts five weeks.
The courses are free but verifiable certificates can be purchased for $50 per course.
If you have been hoping to learn Spark, this might be just the opportunity your were waiting for.
I just finished my PhD in the Computational Science and Statistics program at South Dakota State University. My dissertation focused on the area of software analytics, sometimes called Data-Driven Software Engineering. Specifically, how does a Software Development Organization evaluate itself? Students have a G.P.A. (Grade Point Average), but organizations do not have a similar evaluation method.
The dissertation introduces the C.R.I. (Cumulative Result Indicator) to provide a single number to evaluate the performance of a software development organization. The C.R.I. focuses on 5 primary elements of a Software Development Organization.
C.R.I. demonstrates what data needs to be calculated, and how that data can be used to create a score. Naturally, this solution will not work in every situation, but it does provide a consistent method for evaluation, and it is flexible to allow only some of the elements or even additional elements.
There is the brief 1-minute overview of the dissertation. Feel free to read more of the details in the document below.
The source and data files are available on Github, Dissertation Scoring SDO.
You can also see results of the analysis on Sense, Scoring an SDO.
This is the first in a series of posts on Data-Driven Software Engineering. In the next few weeks, I will be posting more about the topic. Some posts will be excerpts from the dissertation, and others will be new thoughts on the topic. Stay Tuned!
A new edition of Mining Massive Datasets is now available. It is used for a number of data mining courses at colleges across the US (and globe). Here are just a few of the topics from the book.
- Recommendation Systems
- Dimensionality Reduction
- Social Network Analysis
Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.
The second edition now includes 3 parts.
- Open Data Guide – The Why?, What? and How? of open data
- Value Stories – Stories of how open data is making a difference
- Resource Library – Videos, presentations, and publications about open data
Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.
Read the official announcement from the Open Knowledge Foundation.