Category Archives: Data Analysis

R Commands for Cleaning Data

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

  • sub() replace the first occurrence
  • gsub() replaces all occurrences

Quantitative Variables in Ranges

  • cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
  • cut2(data$col, g=6) return a factor variable with 6 groups
  • cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

  • merge() for combining data frames
  • sort() sorting an array
  • order(data$col, na.last=T) returns indexes for the ordered row
  • data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
  • melt() in the reshape2 package, this is for reshaping data
  • rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.

R Graph Commands for Data Analysis

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands for creating some graphs.

Exploratory Graphs


boxplot
barchart
hist
plot
density

Final Graphs for a report

Final graphs need to look a little nicer. They must also have informative labels and a title and possibly a legend.

plot(data$column1, data$column2, pch=19, col='blue', cex=0.5,
xlab='X axis label', ylab='Y axis label', main='Title of Graph',
cex.lab=2, cex.axis=1.5)

legend(100,200, legend='Legend Info', col='blue', pch=19, cex=0.5)

Multipanels

It is often useful to display more than one graph at a time. Here is some code to display 2 graphs horizontally on the same panel.

par(mfrow=c(1,2))
plot(data$column1, data$column2)
plot(data$column3, data$column4)

Figure Captions


mtext(text='some caption')

Create a PDF


pdf(file='myfile.pdf',height=4,width=8)
par(mfrow=c(1,3))
hist(...)
mtext(text='caption',side=3,line=1)
plot(...)
mtext(...)
boxplot(...)
mtext(...)
dev.off()

A very similar thing can be done for PNG image files. Just use png() at the beginning instead.
Use dev.copy2pdf(file=’myfile.pdf’) to save an existing graph to a file.

9 problems with Real World Regression

This list comes from the Coursera Data Analysis Course.

Linear and Logistic Regression are some of the most common techniques applied in data analysis. Here is a list of possible problems with regression in the real world.

  1. Confounders – variable that is correlated with both the outcome and other variables in the model
  2. Complicated Interactions – how do the covariates interact
  3. Skewness – is the data not evenly distributed, heavy to one side or the other
  4. Outliers – data points that don’t fit the pattern
  5. Non-linear Patterns – not all datasets can be fit with a straight line
  6. Variance Changes
  7. Units/Scale issues – make sure the units are standard across the model
  8. Overloading Regression – too much complexity
  9. Correlation does not imply Causation

What other problems do you find when using Regression on real-world data

Do you know of other problems that are missing.

First Steps to Data Analysis in R

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands that should useful for obtaining data and looking at data in R. Ideally these commands are useful for steps 4, 5, and 6 of the 11 Steps to Data Analysis.

Load the data and just look at it


download.file('http://location.com', 'localfile.csv')
data <- read.csv('localfile.csv')
dim(data)
names(data)
quantile(data$column)
hist(data$column)
head(data)
summary(data)
str(data)
unique(data$column)
length(unique(data$column))
table(data$column) - count of how many times each value appears in the column
table(data$column1, data$column2)

any(data$column < 100)
all(data$column > 100)

colsums(data)
colmeans(data, na.rm=T)
rowMeans(data, na.rm=T)

Look for missing values


is.na(data$column)
sum(is.na(data$column))
table(data$column, useNA="ifAny")

For more information on any R command, just type ? in the R console. For example, if you want to know more about the dim command, just type ?dim

Levels of Data Analysis

The list is ordered according to the level of difficulty.

  • Descriptive just describe the data, common for census type of data
  • Exploratory find relationships that were not clear beforehand, useful for defining future studies, remember correlation does not imply causation
  • Inferential use a small dataset to say something about a larger population, most common goal of statistical analysis
  • Predictive use data from some object to predict something(values) for another object, important to measure the right values and to use as much data as possible
  • Causal what happens to one variable when you force another variable to change, usually requires a randomized study, this is the gold standard of data analysis
  • Mechanistic understanding the exact changes in variables that lead to changes in other variables for individual objects, typically from engineering and physical sciences, data analysis can be used to infer the parameters if the equations are known

This list comes from information presented in the first week of the Coursera Data Analysis class.

Data Analysis by Data Type

Data analysis is performed in many different fields and on many different types of data. Most fields call it something different. The following list comes straight from Jeff Leek’s Data Analysis Coursera class.

Name of Data Analysis by Data Type

The type of analysis is very similar for all fields, but what separates data science and machine learning from the others is the 3 V’s of big data. Data science and machine learning deal with a greater Volume of data, Variety of data, and Velocity (speed at which new data appears) of data. Because it is becoming cheaper and easier to store massive amounts of data than ever before, I think the other fields are beginning to realize the potential in big data. Signal processing is definitely becoming an area with big data, due to the fact that electrical sensors are everywhere.

What are your thoughts? Do you see any real differences in the data analysis performed for the data types above?

Data Analysis at Coursera

The Coursera Data Analysis course started yesterday. This course would be an excellent follow-up to the Computing with Data Analysis course. For a bit more about the course, check out this video explaining the content. The course consists of lectures, quizzes, and some data analysis assignments. There is still plenty of time to signup and start analyzing.