Tag Archives: R

Learn to Analyze Big Data with R – Free Course

R is a hugely popular language among data scientists and statisticians. One of the difficulties with open-source R is the memory constraint. All the data needs to be loaded into a data.frame. Microsoft solves this problem with the RevoScaleR package of the Microsoft R Server. Just launched this week is an EdX course on
Analyzing Big Data with Microsoft R Server.

According the syllabus:

Upon completion, you will know how to use R for big-data problems.

Full Disclosure: I work at Microsoft, and the course instructor, Seth Mottaghinejad, is one of my colleagues.

Free Stats book for Computer Scientists

Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.

Introduction to Microsoft R Open (Webinar)

Tomorrow, January 28, 2016, David Smith will present a webinar titled Introduction to Microsoft R Open. David is the R Community Lead at Microsoft. The webinar will discuss:

  • Introduction to R
  • History of R
  • Enhancements of Microsoft R Open (Microsoft’s enhanced distribution of open-source R)
  • CRAN Time Machine
  • Reproducible Data Analysis

If you are looking to get started with R or get more from R, this webinar will be worth your time.

Plus, the webinar is the first in a series of Microsoft webinars focused on R.

Full Disclosure: I work for Microsoft, and I will be helping (in a very minimal capacity) with the webinar.

Data Science Wars: R vs. Python

The great team over at DataCamp, an online site for learning R , has put together another wonderful infographic. This time, the topic is Data Science Wars (R versus Python). This has been a rather hot topic for quite some time. I even wrote about the debate back in 2013, R vs Python, The Great Debate.

DataCamp did an amazing job packing information into the infographic. Honestly, it is impressive they were able to pack so much information into a single infographic. Some of the topics covered are:

  • History
  • Who uses the language?
  • Community
  • Purpose of the language
  • Popularity
  • And way more great stuff

Enough about the description. Have a look for yourself. It is packed with great arguments for your next “R vs Python” debate.

R vs Python for data analysis
R vs Python for data analysis

Learn Data Science Online with DataCamp

If you are looking to get started in the field of data science in 2014, then DataCamp just might be the site for you. DataCamp, formerly DataMind provides a tutorial for interactive data analysis in the browser. The data analysis is taught using R.

The DataCamp platform provides:

  1. Courses to learn data science
  2. A Platform to create new courses

If you are familiar with Codecademy, DataCamp follows very much the same model except for data analysis instead of programming. This is definitely a site to watch in 2014.

The Interactive Data Analysis Tutorial

DataCamp  tutorial
Interactive Data Analysis Tutorial

DataCamp Profile

DataCamp profile
DataCamp profile

R vs Python, The Great Debate

Recently I have seen blogs/articles claiming Python is the best choice for data science and R is the new language for business. Honestly, both articles are truthful and good. Both Python and R are good. Why do we have to choose? Let’s use both.

Here is my opinion. I prefer R to Python when performing exploratory data analysis. R has so many packages for every possible statistical technique. The plots, although not beautiful by default, are quick and easy to create. However, I prefer Python when I need to pull data from an API or build a software system or website. Python is more than just a statistical analysis tool; it is a complete programming language. I might even end up using Java for a project in the near future.

There does not have to be a clear winner or one single language to use. Use the best tool for the job and get on with your data science. In the end, the world cares more what you produced not whether you used R or Python or something else.

International School of Engineering Programs Beginning Soon

I recently received the following information.

International School of Engineering is announcing their 3rd batch of live e-Learning certificate programs starting 4-Sep-2013 in “Engineering Big Data with R and Hadoop Ecosystem” and “Essentials of Applied Predictive Analytics” (http://goo.gl/kHckP).

These programs helped Engineers and Managers transform into Hadoop Developers/Data Scientists, get industry certifications, revolutionize their workspace and establish exciting careers.


•Taught by experts who are Carnegie Mellon, Johns Hopkins and Stanford University’s alumni with Fortune 50 experience
•Applied and interactive classes
•Classes ranked among the top 1% and 5% of all classes in the world in piazza
•1/3rd the cost of other similar programs
•95% Success with Cloudera and EMC2

For details visit http://goo.gl/bPJEF

For any queries mail us at elearning@insofe.edu.in or call us at +91 9502334561/2/3

R Commands for Cleaning Data

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

String Replacement

  • sub() replace the first occurrence
  • gsub() replaces all occurrences

Quantitative Variables in Ranges

  • cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
  • cut2(data$col, g=6) return a factor variable with 6 groups
  • cut2(data$col, m=25) return a factor variable with at least 25 observations in each group

Manipulating Rows/Columns

  • merge() for combining data frames
  • sort() sorting an array
  • order(data$col, na.last=T) returns indexes for the ordered row
  • data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
  • melt() in the reshape2 package, this is for reshaping data
  • rbind() adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.