**Process mining is a bridge between data mining and business process modeling**. Process Mining can be used to study event and log files to extract meaning.

The Coursera course, Process Mining: Data science in Action, starts November 12, 2014.

**Process mining is a bridge between data mining and business process modeling**. Process Mining can be used to study event and log files to extract meaning.

The Coursera course, Process Mining: Data science in Action, starts November 12, 2014.

Advertisements

Machine Learning is a term that can mean different things to different people. Andrew Ng, cofounder of Coursera and Professor at Stanford, provides two definitions in his popular Machine Learning Course. The first definition comes from Arthur Samuel around 1959.

Field of study that gives computers the ability to learn without being explicitly programmed.

The second definition comes from Tom Mitchell’s 1997 Machine Learning textbook. This definition is a bit more formal and rigorous. This book defines a *well-posed learning problem* as:

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Machine learning can be broken down into a few categories. The two most popular are supervised and unsupervised learning. A couple other categories are recommender systems and reinforcement learning.

Probably the most common category of machine learning, *supervised learning* is concerned with fitting a model to labeled data. Labeled data is data that has the correct answer supplied. Regression and Classification are the most common types of problems in supervised learning.

*Unsupervised learning* deals with unlabeled data. Therefore, the goal of unsupervised learning is to find structure in the data. Clustering is probably the most common technique.

*Recommender systems* deal with making recommendations based upon previously collected data. *Reinforcement learning* is concerned with maximizing the reward of a given agent(person, business, etc).

Most of the above information comes from the Coursera Machine Learning Course. There is still time to sign up since the first assignments are not due until the end of the week.

The highly anticipated Coursera class, Introduction to Data Science, started yesterday. It looks good so far. Why not join 72,000 other students interested in learning data science?

This post is notes from the Coursera Data Analysis Course.

Here are some R commands that might serve helpful for cleaning data.

**sub()**replace the first occurrence**gsub()**replaces all occurrences

**cut(data$col, seq(0,100, by=10))**breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on**cut2(data$col, g=6)**return a factor variable with 6 groups**cut2(data$col, m=25)**return a factor variable with at least 25 observations in each group

**merge()**for combining data frames**sort()**sorting an array**order(data$col, na.last=T)**returns indexes for the ordered row**data[order(data$col, na.last=T),]**reorders the entire data frame based upon the*col***melt()**in the reshape2 package, this is for reshaping data**rbind()**adding more rows to a data frame

Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands for creating some graphs.

boxplot

barchart

hist

plot

density

Final graphs need to look a little nicer. They must also have informative labels and a title and possibly a legend.

plot(data$column1, data$column2, pch=19, col='blue', cex=0.5,

xlab='X axis label', ylab='Y axis label', main='Title of Graph',

cex.lab=2, cex.axis=1.5)

```
```

`legend(100,200, legend='Legend Info', col='blue', pch=19, cex=0.5)`

It is often useful to display more than one graph at a time. Here is some code to display 2 graphs horizontally on the same panel.

par(mfrow=c(1,2))

plot(data$column1, data$column2)

plot(data$column3, data$column4)

mtext(text='some caption')

pdf(file='myfile.pdf',height=4,width=8)

par(mfrow=c(1,3))

hist(...)

mtext(text='caption',side=3,line=1)

plot(...)

mtext(...)

boxplot(...)

mtext(...)

dev.off()

A very similar thing can be done for PNG image files. Just use * png() * at the beginning instead.

Use *dev.copy2pdf(file=’myfile.pdf’)* to save an existing graph to a file.

This list comes from the Coursera Data Analysis Course.

Linear and Logistic Regression are some of the most common techniques applied in data analysis. Here is a list of possible problems with regression in the real world.

**Confounders**– variable that is correlated with both the outcome and other variables in the model**Complicated Interactions**– how do the covariates interact**Skewness**– is the data not evenly distributed, heavy to one side or the other**Outliers**– data points that don’t fit the pattern**Non-linear Patterns**– not all datasets can be fit with a straight line**Variance Changes****Units/Scale issues**– make sure the units are standard across the model**Overloading Regression**– too much complexity**Correlation does not imply Causation**

What other problems do you find when using Regression on real-world data

Do you know of other problems that are missing.

This post is notes from the Coursera Data Analysis Course.

Here are some basic R commands that should useful for obtaining data and looking at data in R. Ideally these commands are useful for steps 4, 5, and 6 of the 11 Steps to Data Analysis.

download.file('http://location.com', 'localfile.csv')

data <- read.csv('localfile.csv')

dim(data)

names(data)

quantile(data$column)

hist(data$column)

head(data)

summary(data)

str(data)

unique(data$column)

length(unique(data$column))

table(data$column) - count of how many times each value appears in the column

table(data$column1, data$column2)

```
```any(data$column < 100)

all(data$column > 100)

`colsums(data)`

colmeans(data, na.rm=T)

rowMeans(data, na.rm=T)

is.na(data$column)

sum(is.na(data$column))

table(data$column, useNA="ifAny")

For more information on any R command, just type ? in the R console. For example, if you want to know more about the *dim* command, just type *?dim*

%d bloggers like this: