After numerous semesters of teaching an introductory course on machine learning, Max Welling, Professor at University of California Irvine, decided to compile an introductory textbook titled, A First Encounter with Machine Learning (PDF link).
- BigML – A great interface. Just upload your data and it shows basic information for each column such as a histogram and mean values. See the Gallery for some examples of the final models.
- Wise.io – Just launched, but it looks to be a serious contender. It was started by a team from UC Berkeley.
- Precog – Taking a slightly different approach, Precog is both a platform and an online IDE for data science. The IDE supports Quirrel, hyped as R for big data.
- Ersatz – Ersatz is currently in private beta, but they are building a web platform for building deep neural networks.
Am I missing any startups on this list?
While not really startups, the following 2 links might also fit here.
In just the past month, a couple of great resources for learning python have been created.
- Getting started with Python: Tips, Tools and Resources – If you are new to python, this is a great place to start. It contains a brief description and links to books, tutorials, and MOOCs.
- Getting Started With Python for Data Scientists – This focuses more on tools specifically for data science.
Combined together, the previous links should provide a person all the resources necessary to begin doing some data science with the python language.
This is a great article and list of topics to remember when working on big data projects. Here is the list.
- Gather business requirements before gathering data
- Implementing big data is a business decision not IT
- Use Agile and Iterative Approach to Implementation
- Evaluate data requirements
- Ease skills shortage with standards and governance
- Optimize knowledge transfer with a center of excellence
- Embrace and plan your sandbox for prototype and performance
- Align with the cloud operating model
- Associate big data with enterprise data
- Embed analytics and decision-making using intelligence into operational workflow/routine
See the original article, 10 Big Data Implementation Best Practices, for details.
This post is notes from the Coursera Data Analysis Course.
Here are some R commands that might serve helpful for cleaning data.
- sub() replace the first occurrence
- gsub() replaces all occurrences
Quantitative Variables in Ranges
- cut(data$col, seq(0,100, by=10)) breaks the data up by the range it falls into, in this example: whether the observation is between 0 and 10, 10 and 20, 20 and 30, and so on
- cut2(data$col, g=6) return a factor variable with 6 groups
- cut2(data$col, m=25) return a factor variable with at least 25 observations in each group
- merge() for combining data frames
- sort() sorting an array
- order(data$col, na.last=T) returns indexes for the ordered row
- data[order(data$col, na.last=T),] reorders the entire data frame based upon the col
- melt() in the reshape2 package, this is for reshaping data
- rbind() adding more rows to a data frame
Obviously, these functions have other parameters to do a lot more. There are also a number of other helpful R functions, but these are enough to get you started. Check the R help (?functionname) for more details.
Here are a few of the recent Strata videos I would recommend.
- Video Games: The Biggest Big Data Challenge – Video Games generate Big Data
- Big Data on Small Devices: Data Science goes Mobile – How data can help build better mobile apps
- Distributed Environmental Data: On the Ground at the Data Sensing Lab – using sensors to track people at a conference, hard to explain and very interesting so just watch the video