Tools For Writing a Data Science Dissertation

It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).

This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.

All these tools have free versions to get you started. A few have discounted upgrades for students.

  • Use an online LaTeX tool such as ShareLaTeX.
    How does this tool benefit you? It saves you from having to install a version of LaTeX, stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While LaTeX is not perfect, I do not know of any better tool for writing mathematical documents.
  • Use GitHub to store you data and source code
    At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
  • Use a Cloud Computing Platform such as Sense.
    Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
  • Use Create.ly for creating diagrams.
    Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.

There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.

2015 Summer of Data Science Learning

The twitter hashtag #SoDS is being used in 2015 to help people track and share what they are learning. The hashtag originated on the Becoming Data Scientist blog.

I recently wrote a post for Sense about a number of freely available learning opportunities this summer, Start Learning with the Summer of Data Science. The post covers:

  • MOOCs starting soon
  • Large list of open-access journals

If you are interested, go check out the post and start your #SoDS. Hurry, many of the opportunities start very soon.

Data Science, Startups, and Sex Trafficking

What is Sex Trafficking?

It is a form of human trafficking and according to Wikipedia, human trafficking

is the trade in humans, most commonly for the purpose of sexual slavery, forced labor or commercial sexual exploitation for the trafficker or others; or for the extraction of organs or tissues, including surrogacy and ova removal; or for providing a spouse in the context of forced marriage.

Human trafficking is modern day slavery, and at any time more than 20 million people worldwide are victims (see the US Trafficking In Persons Report). Sex Trafficking is a specific form of human trafficking for the purposes of sexual exploitation.

What does this have to do with Data Science?

Well, data is being collected about victims of human trafficking and online advertisements targeting potential victims. Lots of data is being collected and not enough analysis is being done. Luckily, some organizations have teamed up to help fight human trafficking and sex slavery.

Startups like Palantir and SumAll are getting involved in the fight against human trafficking. Startups are not alone; governments and Universities are also getting involved.

  • Palantir is working with Google and Polaris, an organization focused on eliminating human slavery in the U.S. and globally, to coordinate efforts of local human trafficking hotlines. Together they created the Global Human Trafficking Hotline Network. See the latest update from Google Ideas.
  • SumAll.org, a non-profit spin-off of the startup SumAll.com, is an organization that provides data analytic capabilities to non-profit organizations making a social impact for good. One of the first projects of SumAll.org was Human Trafficking.
  • Rescue Forensics is a Y Combinator startup helping law enforcement collect online data to capture and prosecute human traffickers.
  • DARPA and Carnegie Mellon University are jointly working to use Natural Language Processing (NLP), computer vision, and machine learning to identify online ads used by sex traffickers.
  • Thorn is an organization that is specifically trying to stop the trafficking of children.
  • Even the Clinton Foundation is involved.

How can you get involved?

The issue of human trafficking is very complicated, and it will take many years and many people to solve. Therefore, it is a great time for you to get involved. Below are some organizations seeking volunteers

  • DataKind is an organization working to match up data scientists with data from non-profits. DataKind runs a number of data-dives/hackathons and special projects centered around using data science to help produce a positive global impact. Although DataKind works with a number of organizations and not all are focused on human trafficking, DataKind does work with a number of organizations that are fighting for global human rights. If you are interested in becoming involved with DataKind, please fill out the Get Involved Form. They are looking for volunteers regardless of your location.
  • SumAll.org is actively seeking volunteers and interns. They are seeking volunteers for data science, visualization, blogging, and KPI monitoring. Like DataKind, not all the projects are human trafficking, but they are focused on changing the world.
  • There are also many local organizations working to fight human trafficking. Find an organization near you via the Global Modern Slavery Directory.
  • Samaritan’s Purse, although not specifically seeking data volunteers, does a lot of work to combat human trafficking across the globe.

If you are a victim or have information about human trafficking, please call the National Human Trafficking Resource Center at 1 (888) 373-7888.

If you know of any other organizations working on data analysis of human trafficking, please leave a comment.

Learn Apache Spark this Summer with edX

edX has just announced a new series of Big Data courses. The series consists of 2 courses focused around Apache Spark. If you are not familiar with Spark, it is a very fast engine for large-scale data processing. It claims to perform up to 100 times faster than hadoop. Here are the 2 courses:

  1. Introduction to Big Data with Apache Spark
  2. Scalable Machine Learning

The first course starts June 1, 2015, and lasts four weeks. The second course starts in late June and lasts five weeks.

The courses are free but verifiable certificates can be purchased for $50 per course.

If you have been hoping to learn Spark, this might be just the opportunity your were waiting for.

Scoring A Software Development Organization With A Single Number

I just finished my PhD in the Computational Science and Statistics program at South Dakota State University. My dissertation focused on the area of software analytics, sometimes called Data-Driven Software Engineering. Specifically, how does a Software Development Organization evaluate itself? Students have a G.P.A. (Grade Point Average), but organizations do not have a similar evaluation method.

The dissertation introduces the C.R.I. (Cumulative Result Indicator) to provide a single number to evaluate the performance of a software development organization. The C.R.I. focuses on 5 primary elements of a Software Development Organization.

  1. Quality
  2. Availability
  3. Satisfaction
  4. Schedule
  5. Requirements

C.R.I. demonstrates what data needs to be calculated, and how that data can be used to create a score. Naturally, this solution will not work in every situation, but it does provide a consistent method for evaluation, and it is flexible to allow only some of the elements or even additional elements.

There is the brief 1-minute overview of the dissertation. Feel free to read more of the details in the document below.

The source and data files are available on Github, Dissertation Scoring SDO.

You can also see results of the analysis on Sense, Scoring an SDO.

This is the first in a series of posts on Data-Driven Software Engineering. In the next few weeks, I will be posting more about the topic. Some posts will be excerpts from the dissertation, and others will be new thoughts on the topic. Stay Tuned!

Free Book, Mining Massive Datasets, 2nd Edition

A new edition of Mining Massive Datasets is now available. It is used for a number of data mining courses at colleges across the US (and globe). Here are just a few of the topics from the book.

  • Map-reduce
  • Clustering
  • Recommendation Systems
  • Dimensionality Reduction
  • Social Network Analysis

The New Open Data Handbook

Originally published in 2012, the Open Data Handbook has released an second edition. The handbook is to be used as a guide for organizations or individuals interested in publishing and/or utilizing open data. The goal is ensuring data is open and that data is applied as often as possible.

The second edition now includes 3 parts.

  1. Open Data Guide – The Why?, What? and How? of open data
  2. Value Stories – Stories of how open data is making a difference
  3. Resource Library – Videos, presentations, and publications about open data

Following the theme of open, the Open Data Handbook is open sourced on Github. You are free and encouraged to contribute. There is even an extensive contribution guide if you are interested.

Read the official announcement from the Open Knowledge Foundation.

Data Science Wars: R vs. Python

The great team over at DataCamp, an online site for learning R , has put together another wonderful infographic. This time, the topic is Data Science Wars (R versus Python). This has been a rather hot topic for quite some time. I even wrote about the debate back in 2013, R vs Python, The Great Debate.

DataCamp did an amazing job packing information into the infographic. Honestly, it is impressive they were able to pack so much information into a single infographic. Some of the topics covered are:

  • History
  • Who uses the language?
  • Community
  • Purpose of the language
  • Popularity
  • And way more great stuff

Enough about the description. Have a look for yourself. It is packed with great arguments for your next “R vs Python” debate.

R vs Python for data analysis
R vs Python for data analysis

Deep Learning in 2015 at Oxford

Nando de Freitas taught a deep learning course at the University of Oxford. All of the videos are freely available. The playlist is a bit out of order, but starting with Lecture 1 is probably the best technique.

Data Science Tech Institute Visiting Faculty

The Data ScienceTech Institute (DSTI) in France is starting 2 new master’s degree programs in data science. Both programs are highly innovative and offer a strong industry focus. Classes begin in October 2015, and each program is limited to 30 students. Therefore, if you are interested, it is important to apply as soon as possible.

The other day, the faculty at DSTI were announced. I am honored to say I was selected as one of the faculty. Thus, I will serve as a visiting faculty member for portions of the program.

DSTI offers 2 master’s degree programs:

  1. Data Scientist Designer – Located in Paris, this 2-year program is part-time and focused on working professionals looking to transition or enhance skills in the data science field. The course will rotate between 2 and 3 days a week.
  2. Executive Big Data Analyst – Located in Nice along the French Riviera, this program is a more traditional intensive 16-month program targeting full-time students.

If you are in France or Europe or interested in studying in France, the programs from DSTI are definitely worth a look.