R is a hugely popular language among data scientists and statisticians. One of the difficulties with open-source R is the memory constraint. All the data needs to be loaded into a data.frame. Microsoft solves this problem with the RevoScaleR package of the Microsoft R Server. Just launched this week is an EdX course on Analyzing Big Data with Microsoft R Server.
According the syllabus:
Upon completion, you will know how to use R for big-data problems.
Full Disclosure: I work at Microsoft, and the course instructor, Seth Mottaghinejad, is one of my colleagues.
Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.
If you are looking to get started in the field of data science in 2014, then DataCamp just might be the site for you. DataCamp, formerly DataMind provides a tutorial for interactive data analysis in the browser. The data analysis is taught using R.
The DataCamp platform provides:
Courses to learn data science
A Platform to create new courses
If you are familiar with Codecademy, DataCamp follows very much the same model except for data analysis instead of programming. This is definitely a site to watch in 2014.
Here is my opinion. I prefer R to Python when performing exploratory data analysis. R has so many packages for every possible statistical technique. The plots, although not beautiful by default, are quick and easy to create. However, I prefer Python when I need to pull data from an API or build a software system or website. Python is more than just a statistical analysis tool; it is a complete programming language. I might even end up using Java for a project in the near future.
There does not have to be a clear winner or one single language to use. Use the best tool for the job and get on with your data science. In the end, the world cares more what you produced not whether you used R or Python or something else.
These programs helped Engineers and Managers transform into Hadoop Developers/Data Scientists, get industry certifications, revolutionize their workspace and establish exciting careers.
•Taught by experts who are Carnegie Mellon, Johns Hopkins and Stanford University’s alumni with Fortune 50 experience
•Applied and interactive classes
•Classes ranked among the top 1% and 5% of all classes in the world in piazza
•1/3rd the cost of other similar programs
•95% Success with Cloudera and EMC2
Here are some R commands that might serve helpful for cleaning data.
sub() replace the first occurrence
gsub() replaces all occurrences
Quantitative Variables in Ranges
cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
cut2(data$col, g=6) return a factor variable with 6 groups
cut2(data$col, m=25) return a factor variable with at least 25 observations in each group
merge() for combining data frames
sort() sorting an array
order(data$col, na.last=T) returns indexes for the ordered row
data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
melt() in the reshape2 package, this is for reshaping data
rbind() adding more rows to a data frame
Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.