The Problem with Software Analytics

Software Analytics is the marriage of data science and software engineering.  It hopes to use data generated from software and software engineering processes to provide insights for creating better software.

The following is a quote from a software analytics round table discussion in 2013. All of the round table members are leading academics at prestigious universities.  Obviously, they were chosen because they are very accomplished and know the field.  Now, onto the quote.

Modern software services such as GitHub, BitBucket, Ohlol, Jira, FogBugz, and the like employ wide use of visualization and even bug effort estimation. We can pat ourselves on the backs even if those developers never read a single one of our papers.


Here is the source in IEEE Computer (which most likely you cannot access unless you are an academic): Roundtable: What’s Next in Software Analytics). For the non-academics an InfoQ reprint is available free online.

The academic research community cannot take credit for what Github, BitBucket, and others have done.  Yes, that academic research community is doing some excellent work, but most software practitioners are not seeing it because that research is being hidden in academic journals. The advancements might have occurred simultaneously and coincidentally, but there is not a clear causal relationship.  Unfortunately, the academic research is not getting into the hands of the software practitioners.

I would like to think the target audience of software engineering research would be software engineers, project managers, and developers. However, as this quote points out, those practitioners hardly ever see the research. If the research does not reach the intended audience, then there is a clear problem.  A problem that needs to be fixed.

Unfortunately, I do not yet know what the fix is. If you have any ideas, please leave a comment below.

If there is enough interest, maybe I will start something (just don’t know what that something is).

Spark Summit 2015 Livestream

Apache Spark is currently one of the hottest technologies in data science. That trend leads the Spark Summit 2015 to be one of the top conferences. Luckily, the conference organizers where kind enough to set up a free Spark Summit 2015 Livestream of the event. Here is a small glimpse of what will be covered:

  • Updates from Matei Zaharia, creator of Spark
  • Spark at NASA
  • Innovation with Spark
  • Three tracks of talks:
    1. Developer
    2. Data Science
    3. Applications
  • And much, much more

The livestream begins today, June 15, 2015, and continues through Wednesday. It appears keynotes and all three tracks will be available via livestream. If you cannot physically make the event, then this is probably the next best thing.

Tools For Writing a Data Science Dissertation

It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).

This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.

All these tools have free versions to get you started. A few have discounted upgrades for students.

  • Use an online \LaTeX tool such as ShareLaTeX.
    How does this tool benefit you? It saves you from having to install a version of \LaTeX, stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While \LaTeX is not perfect, I do not know of any better tool for writing mathematical documents.
  • Use GitHub to store you data and source code
    At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
  • Use a Cloud Computing Platform such as Sense.
    Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
  • Use Create.ly for creating diagrams.
    Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.

There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.

2015 Summer of Data Science Learning

The twitter hashtag #SoDS is being used in 2015 to help people track and share what they are learning. The hashtag originated on the Becoming Data Scientist blog.

I recently wrote a post for Sense about a number of freely available learning opportunities this summer, Start Learning with the Summer of Data Science. The post covers:

  • MOOCs starting soon
  • Large list of open-access journals

If you are interested, go check out the post and start your #SoDS. Hurry, many of the opportunities start very soon.

Learning To Be A Data Scientist

Follow

Get every new post delivered to your Inbox.

Join 6,023 other followers

%d bloggers like this: