Tag Archives: cleaning

R Commands for Cleaning Data

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

  • sub() replace the first occurrence
  • gsub() replaces all occurrences

Quantitative Variables in Ranges

  • cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
  • cut2(data$col, g=6) return a factor variable with 6 groups
  • cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

  • merge() for combining data frames
  • sort() sorting an array
  • order(data$col, na.last=T) returns indexes for the ordered row
  • data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
  • melt() in the reshape2 package, this is for reshaping data
  • rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.