Big Data Journal: 5 articles to highlight

The inaugural issue of Big Data was published a few weeks ago. The journal is excellent. The articles are relevant, readable, and free. In the first issue, most of the articles were not super technical (meaning there was not a lot of equations or algorithms). I would like to highlight just 5 of the articles (feel free to read the others as well).

  1. Making Sense of Big Data – A nice brief discussion of the term big data and some goals for the journal.
  2. Big Data For Development – This is an introduction to United Nations Global Pulse, an initiative to use data to better understand human well-being.
  3. Broad Data: Exploring the Emerging Web of Data – This article is all about dealing with the explosion of open data becoming available.
  4. Data Science and Its Relationship to Big Data and Data-Driven Decision Making – The title is pretty self-explanatory. The article points out 7 fundamental concepts of data science.
  5. Educating the Next Generation of Data Scientists – This is a roundtable discussion all about data science and data science education.

Definition of Big Data

Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.

This definition is provided by Edd Dumbill, Editor-in-Chief of Big Data. It appeared in the March 2013 issue in the article, Making Sense of Big Data,.

R Graph Commands for Data Analysis

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands for creating some graphs.

Exploratory Graphs


boxplot
barchart
hist
plot
density

Final Graphs for a report

Final graphs need to look a little nicer. They must also have informative labels and a title and possibly a legend.

plot(data$column1, data$column2, pch=19, col='blue', cex=0.5,
xlab='X axis label', ylab='Y axis label', main='Title of Graph',
cex.lab=2, cex.axis=1.5)

legend(100,200, legend='Legend Info', col='blue', pch=19, cex=0.5)

Multipanels

It is often useful to display more than one graph at a time. Here is some code to display 2 graphs horizontally on the same panel.

par(mfrow=c(1,2))
plot(data$column1, data$column2)
plot(data$column3, data$column4)

Figure Captions


mtext(text='some caption')

Create a PDF


pdf(file='myfile.pdf',height=4,width=8)
par(mfrow=c(1,3))
hist(...)
mtext(text='caption',side=3,line=1)
plot(...)
mtext(...)
boxplot(...)
mtext(...)
dev.off()

A very similar thing can be done for PNG image files. Just use png() at the beginning instead.
Use dev.copy2pdf(file=’myfile.pdf’) to save an existing graph to a file.

9 problems with Real World Regression

This list comes from the Coursera Data Analysis Course.

Linear and Logistic Regression are some of the most common techniques applied in data analysis. Here is a list of possible problems with regression in the real world.

  1. Confounders – variable that is correlated with both the outcome and other variables in the model
  2. Complicated Interactions – how do the covariates interact
  3. Skewness – is the data not evenly distributed, heavy to one side or the other
  4. Outliers – data points that don’t fit the pattern
  5. Non-linear Patterns – not all datasets can be fit with a straight line
  6. Variance Changes
  7. Units/Scale issues – make sure the units are standard across the model
  8. Overloading Regression – too much complexity
  9. Correlation does not imply Causation

What other problems do you find when using Regression on real-world data

Do you know of other problems that are missing.

Data Mining Map

Dr. Saed Sayad, a Professor at the University of Toronto, has created a great diagram about the elements of Data Mining. It is map (mathematical tree structure) that shows many of the common techniques in data mining and when to apply each one. Note: you can click on individual elements in the map for more details.

Programmer's Guide to Data Mining – A free ebook

Ron Zacharski is currently writing a data mining book, A Programmer’s Guide to Data Mining. The book is targeted at programmers that want to know when and how to apply recommendation engines and other data mining techniques. The book is still in the writing phase, but I can say the first couple chapters are excellent. The book will always be available for free download.

If you are a programmer that is looking to add some recommendations to a website, I would highly suggest taking a look at this book.

First Steps to Data Analysis in R

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands that should useful for obtaining data and looking at data in R. Ideally these commands are useful for steps 4, 5, and 6 of the 11 Steps to Data Analysis.

Load the data and just look at it


download.file('http://location.com', 'localfile.csv')
data <- read.csv('localfile.csv')
dim(data)
names(data)
quantile(data$column)
hist(data$column)
head(data)
summary(data)
str(data)
unique(data$column)
length(unique(data$column))
table(data$column) - count of how many times each value appears in the column
table(data$column1, data$column2)

any(data$column < 100)
all(data$column > 100)

colsums(data)
colmeans(data, na.rm=T)
rowMeans(data, na.rm=T)

Look for missing values


is.na(data$column)
sum(is.na(data$column))
table(data$column, useNA="ifAny")

For more information on any R command, just type ? in the R console. For example, if you want to know more about the dim command, just type ?dim

D3.js Gallery Data

I believe Christophe Viau put this list together. It is a very impressive list of D3.js examples. Each example includes the graph and the code to generate it.

D3.js Gallery Data – temporarily in view mode – Google Docs.

A more interactive and visual view of the examples can be found at this new, not yet complete, D3 Gallery.