Data Science for Social Good Summer Fellowship

The University of Chicago and Argonne National Labs are hosting Data Science for Social Good Summer Fellowship 2013. The Fellowship program is open to students at all levels whom are interested in working on real-world social problems. The program takes place in Chicago and the application deadline is April 1, 2013, so apply soon.

Online Textbook Publishing Platform?

About a week ago I posted a link to a free data mining textbook. Hacker News got wind of the book as well, and I am guessing a flood of traffic hit the textbook’s site. The flood happened to take the site completely down for a couple of days. It was a shame because the book is really good.

If you frequently read this blog, you will notice it has quite a number of links to free online textbooks. Each free online textbook is available a bit differently. Most are PDF downloads (either by chapter or the entire book) hosted at some person’s personal website or somewhere on a university’s website.

Here is my question. Does the web have a publishing platform for textbooks? Is there a startup working on something like this?


I am aware of wikibooks, but I just don’t hear much about the quality of the books. As a matter of fact, I just don’t hear much about wikibooks.

Big Data Journal: 5 articles to highlight

The inaugural issue of Big Data was published a few weeks ago. The journal is excellent. The articles are relevant, readable, and free. In the first issue, most of the articles were not super technical (meaning there was not a lot of equations or algorithms). I would like to highlight just 5 of the articles (feel free to read the others as well).

  1. Making Sense of Big Data – A nice brief discussion of the term big data and some goals for the journal.
  2. Big Data For Development – This is an introduction to United Nations Global Pulse, an initiative to use data to better understand human well-being.
  3. Broad Data: Exploring the Emerging Web of Data – This article is all about dealing with the explosion of open data becoming available.
  4. Data Science and Its Relationship to Big Data and Data-Driven Decision Making – The title is pretty self-explanatory. The article points out 7 fundamental concepts of data science.
  5. Educating the Next Generation of Data Scientists – This is a roundtable discussion all about data science and data science education.

Definition of Big Data

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

This definition is provided by Edd Dumbill, Editor-in-Chief of Big Data. It appeared in the March 2013 issue in the article, Making Sense of Big Data,.

R Graph Commands for Data Analysis

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands for creating some graphs.

Exploratory Graphs


boxplot
barchart
hist
plot
density

Final Graphs for a report

Final graphs need to look a little nicer. They must also have informative labels and a title and possibly a legend.

plot(data$column1, data$column2, pch=19, col='blue', cex=0.5,
xlab='X axis label', ylab='Y axis label', main='Title of Graph',
cex.lab=2, cex.axis=1.5)

legend(100,200, legend='Legend Info', col='blue', pch=19, cex=0.5)

Multipanels

It is often useful to display more than one graph at a time. Here is some code to display 2 graphs horizontally on the same panel.

par(mfrow=c(1,2))
plot(data$column1, data$column2)
plot(data$column3, data$column4)

Figure Captions


mtext(text='some caption')

Create a PDF


pdf(file='myfile.pdf',height=4,width=8)
par(mfrow=c(1,3))
hist(...)
mtext(text='caption',side=3,line=1)
plot(...)
mtext(...)
boxplot(...)
mtext(...)
dev.off()

A very similar thing can be done for PNG image files. Just use png() at the beginning instead.
Use dev.copy2pdf(file=’myfile.pdf’) to save an existing graph to a file.

9 problems with Real World Regression

This list comes from the Coursera Data Analysis Course.

Linear and Logistic Regression are some of the most common techniques applied in data analysis. Here is a list of possible problems with regression in the real world.

  1. Confounders – variable that is correlated with both the outcome and other variables in the model
  2. Complicated Interactions – how do the covariates interact
  3. Skewness – is the data not evenly distributed, heavy to one side or the other
  4. Outliers – data points that don’t fit the pattern
  5. Non-linear Patterns – not all datasets can be fit with a straight line
  6. Variance Changes
  7. Units/Scale issues – make sure the units are standard across the model
  8. Overloading Regression – too much complexity
  9. Correlation does not imply Causation

What other problems do you find when using Regression on real-world data

Do you know of other problems that are missing.