Computing For Data Analysis Week 1 Overview

Week 1 of the Computing for Data Analysis course focused mostly on getting R and RStudio installed. Then it focused on some of the basics of the R language. Here are some of the topics

• History of R
• How to get help `help()`
• Data types in R
• numeric (real numbers)
• character (strings)
• integer (counting numbers)
• complex (imaginary)
• logical (TRUE/FALSE)
• Groupings of data
• vector (all the same data type) ``` v <- c(1.4, 2.5, 1.7) v <- 1:10```
• list (NOT all same data type)``` lst <- list("a", 3.5, TRUE, "word", 4+5i)```
• matrix (2-dimensional vector)``` m <- matrix(1:20, nrow=4, ncol=5)```
• Factor is for categorical data``` f <- factor(c("big","small","big","big")) table(f)```
• Missing Values
• NaN `is.nan()` (Not a Number)
• NA `is.na()` (Not Available)
• Reading/Writing data``` d <- read.table("file.txt") d <- read.csv("file.csv") write.table("outFile.txt") ```
• Better Reading data``` initial <- read.csv("data.csv", nrow=10) classes <- sapply(initial, class) fullData <- read.csv("data.csv", nrow=2000, colClasses=classes)```
• The `str()` function for displaying information about the structure of an object

If you hurry, there still might be time to enroll in the class and finish the homework for full credit. Week 1 was not too intensive.

Coursera Computational Methods of Data Analysis Started

The Coursera class Computational Methods of Data Analysis started yesterday. There is still plenty of time to enroll in the class.

This course assumes a good familiarity with calculus, linear algebra, and some basic programming. Thus, if your math background is weak or needs a refresher, you may not want to take this course. However, if you have a solid math background, the course starts right into Fourier Analysis. The course topics look good, and image analysis is one of the central themes of the course. The software Matlab (\$99 for student edition) is recommended, however Octave (Free) is acceptable.

Elements of Statistical Learning Textbook (Free)

The Elements of Statistical Learning textbook is available for free. It is a classic, widely-used textbooks for statistics and machine learning. Here is a far from complete list of some of the topics:

• Supervised Learning
• Linear/Logistic Regression
• Regularization
• Model Selection
• Trees
• Neural Networks
• Support Vector Machines
• Random Forests
• Unsupervised Learning
• Clustering

As you can see, the book is quite extensive.

Note: This book has been available for a quite a while, but I realized I have not added a link to it on my blog.

Publisher Looking for Data Science Authors

New Street Communications is looking for authors. According to the call for proposals:

…especially interested to hear from professionals in the fields of IT, Data Science, Big Data and Cloud Computing.

If you have ever thought about writing a data science book, now might be a good time.

UC Berkeley Course Lectures: Analyzing Big Data With Twitter | Analyzing Big Data with Twitter

The link includes videos and lecture notes from the course.

50 Top Open Source Tools for Big Data – Datamation

The list is about 6 months old, but it still covers all the ones I would have listed and quite a few more.

Top ten algorithms in data mining (2007) [pdf] | Hacker News

The discussion below the link is also very good.

If you are curious, here are the 10 algorithms, and the paper is displayed below.

1. C4.5
2. k-Means
3. SVM
4. Apriori
5. EM
6. PageRank