The list is ordered according to the level of difficulty.
- Descriptive just describe the data, common for census type of data
- Exploratory find relationships that were not clear beforehand, useful for defining future studies, remember correlation does not imply causation
- Inferential use a small dataset to say something about a larger population, most common goal of statistical analysis
- Predictive use data from some object to predict something(values) for another object, important to measure the right values and to use as much data as possible
- Causal what happens to one variable when you force another variable to change, usually requires a randomized study, this is the gold standard of data analysis
- Mechanistic understanding the exact changes in variables that lead to changes in other variables for individual objects, typically from engineering and physical sciences, data analysis can be used to infer the parameters if the equations are known
This list comes from information presented in the first week of the Coursera Data Analysis class.
The Coursera Data Analysis course started yesterday. This course would be an excellent follow-up to the Computing with Data Analysis course. For a bit more about the course, check out this video explaining the content. The course consists of lectures, quizzes, and some data analysis assignments. There is still plenty of time to signup and start analyzing.
Jeff Leak, instructor of the upcoming Coursera Data Analysis course, wrote up a nice blog post, The Landscape of Data Analysis, explaining the topics to be covered in the course. The topics look good. He also made a video explaining how data science fits in with other disciplines such as: computer science, medicine, statistics, and so on. The video is short (less than 5 minutes), so it is definitely worth the time.
Week 1 of the Computing for Data Analysis course focused mostly on getting R and RStudio installed. Then it focused on some of the basics of the R language. Here are some of the topics
- History of R
- How to get help
- Data types in R
- numeric (real numbers)
- character (strings)
- integer (counting numbers)
- complex (imaginary)
- logical (TRUE/FALSE)
- Groupings of data
- vector (all the same data type)
v <- c(1.4, 2.5, 1.7)
v <- 1:10
- list (NOT all same data type)
lst <- list("a", 3.5, TRUE, "word", 4+5i)
- matrix (2-dimensional vector)
m <- matrix(1:20, nrow=4, ncol=5)
- Factor is for categorical data
f <- factor(c("big","small","big","big"))
- Missing Values
is.nan() (Not a Number)
is.na() (Not Available)
- Reading/Writing data
d <- read.table("file.txt")
d <- read.csv("file.csv")
- Better Reading data
initial <- read.csv("data.csv", nrow=10)
classes <- sapply(initial, class)
fullData <- read.csv("data.csv", nrow=2000, colClasses=classes)
str() function for displaying information about the structure of an object
If you hurry, there still might be time to enroll in the class and finish the homework for full credit. Week 1 was not too intensive.
The Coursera class Computational Methods of Data Analysis started yesterday. There is still plenty of time to enroll in the class.
This course assumes a good familiarity with calculus, linear algebra, and some basic programming. Thus, if your math background is weak or needs a refresher, you may not want to take this course. However, if you have a solid math background, the course starts right into Fourier Analysis. The course topics look good, and image analysis is one of the central themes of the course. The software Matlab ($99 for student edition) is recommended, however Octave (Free) is acceptable.
Coursera’s Computing for Data Analysis starts today. Enroll now and start learning R and data analysis.
Coursera has some excellent courses coming up in 2013. Here are some potential curriculum paths for someone looking to learn data science.
Either sequence requires/recommends some basic programming experience. If you are unfamiliar with programming, you still have a couple weeks to get familiar with some basic programming concepts. Some good places to start would be either Coursera’s Computer Science 101 or Codecademy’s Python tutorial.
Data Science Curriculum #1
If you are new to programming, this would be the recommend sequence. The first course focuses on programming.
Data Science Curriculum #2
Neither of the Coursera machine learning (Stanford or U of Washington) courses are scheduled for 2013, but either of them would be a great (maybe necessary) follow up course. Hopefully, one of those courses will be starting in July or shortly there after.
After completing one of the above sequences combined with a machine learning course, a person should be skilled enough to begin doing useful data science work. (Note: A new job as a data scientist is not guaranteed, but the courses won’t hurt your chances.) Plus, Coursera offers numerous other classes that could be taken at a later time to increase depth in certain areas of data science (Natural Language Processing, Image Processing, and more).
Happy Learning in 2013!
If you are interested in more ways to learn data science, please check out Data Science 201, coming in 2013.
Yesterday, Coursera announced that students will soon be able to earn college credits for some of the courses. See the blog post with the college credit announcement.
Just Announced, Coursera adds 17 new universities. Those universities include Columbia and Brown, as well as a few international universities.
A few notable courses for data science are: a new machine learning course from the University of Washington, Linear Algebra from Brown, and Natural Language Processing by Michael Collins from Columbia.
See the following pages to seed what other courses are now available.