If you have the necessary background in math, statistics, and computer science; then it is a good time to learn some data science specific skills. Coursera just recently launched a course specifically devoted to Data Science. It is titled: Introduction to Data Science. The course is being taught by Bill Howe of the University of Washington’s eScience Institute. I believe this course is an excellent place to start. I am very excited about this course.
Other Data Science Learning Resources
Here is a listing of other materials that could be helpful to learning data science.
Statistics is an important component of data science. Thus, it would be nice to have some resources available.
Learn Statistics For Free Online
Well, here is a list of free statistics resources available online. All of these are fairly introductory, but I am guessing more advanced topics will be coming from these same organizations.
In addition to the free resources online, there are other options as well.
- Statistics.com – courses are about $400-$500 but programs lead to certificates
- Most all local colleges will offer courses in statistics
What other resources are available for learning statistics?
Math is one of the key building blocks of data science. While you cannot do a lot of data science with just calculus and linear algebra, both topics are essential for more advanced topics in data science such as machine learning, algorithms, and advanced statistics. Here are some freely available resources for learning both topics.
Matrix Operations/Linear Algebra
Other Math Options
The following 2 courses from Coursera maybe good for a person learning to think mathematically.
O’Reilly and Data Scientist DJ Patil just released a new free report titled, Data Jujitsu: The Art of Turning Data Into Product. If you are interested in building data products, the report is excellent and definitely worth your time.
What does the report cover?
The report provides a definition for a data product. It then covers a process for taking an idea from concept to reality. The main point is to use some shortcuts and get the product out fast. Then if people like the product, and only then, spend some time really enhancing the algorithms.
It is no surprise that I enjoy talking about data science and using data to make better decisions. Unfortunately, everyone is not as excited about data as I am. I occasionally face some resistance. One of the most common arguments goes something like this:
Why would I put in the extra work to find data that will tell me what I already know?
Because you might not really know what you think you know.
People are very comfortable making decisions based off gut feelings, assumptions, and personal biases. They obviously think the decisions are the correct ones. Many times the gut feeling is correct. However, the problem is those times when the gut feeling is not the right decision. In those cases, I would prefer to have some good data behind my choice. Here are a couple scenarios:
A Website Scenario
We should choose the white background because it produces 75% more signups than the blue background.
We should choose white because I got a good feeling people will like it better.
A Healthcare Scenario
You should try these pills because numerous tests have shown good results for people your age and weight.
I got a hunch these pills will make you better.
Both of these decisions are made everyday. Surely the doctor cannot knowingly prescribe a harmful drug, but many times the prescription is just for the standard drug matching the symptoms. Are all patients standard? Personally, I would feel more confident about the first decision in each scenario. But hey, I like data.
Why are people so skeptical to use data for better decision making? I do not know. Maybe some people are afraid the data might not match their assumptions. Maybe they just don’t want to put in the extra effort. I probably need some better data to properly answer this question.
Have you faced similar resistance? How do you convince people to use data for decision making?
In case you missed the announcement yesterday, Coursera added 12 new universities and over 100 new courses. The exciting part for people learning data science is a new category of courses: Statistics, Data Analysis, and Scientific Computing. None of the courses have started yet. Most are scheduled for this fall or early 2013. The courses look very good.
Are you excited about these new courses?
This infographic by Deep Blue Analytics does a nice job of explaining why there is so much excitement around bigdata.
- The world is generating a lot of data
- There are not enough people to analyze that data
Springer has just release a new data science journal named EPJ Data Science. The journal is open access which means that articles are freely available online. That catch is that people whom submit articles must pay a fee for publication. Sometimes the fee will be covered by the author’s university or company. Anyhow, if you are interested in data science research, this journal is probably worth following.
Are you interested in academic journals?
Does this excite you?
Electronic Doctor Visit
I recently received a message from one of the local hospitals. It stated that I can now have an electronic visit with my doctor. Here is how I understand it works. I fill out a brief questionnaire explaining some of my symptoms and submit it online. Within one day, my doctor will review my submission and respond. Obviously, this electronic visit should only be used for minor medical issues such as a common cold or a prescription update.
Being the type of person I am, I initially questioned why the hospital was really doing this. Sure the hospital will be able to help more patients and make more money, but is there something more?
Think of the data that is collected in this process: a patient entered description of the symptoms and the doctors diagnosis. It appears the hospital is building a training set of data with description of symptoms and a diagnosis. It is a very short step to apply a machine learning algorithm or two and totally automate the process. Maybe this is already done and my doctor just signs off on the result.
Here is how envision the system working:
- Use some natural language processing to identify the symptoms
- Match the symptoms to some known illness via machine learning
- Report the diagnosis and treatment
- Prescribe medicine if necessary
What Do You Think?
How do you feel about this process? I am sure there are some companies working on just this problem. Who are those companies?
Note: Yes, I know this data is currently collected by hospitals, but a human (nurse or doctor) interprets what another human is saying before entering the data. The electronic visit just made me realize how easy it would be to automate a doctor’s job for common problems.
Jeffrey M. Stanton, member of Syracuse University’s iSchool, just released an open-source ebook about data science. Obviously this book is intended to be used in the curriculum for the new Data Science Certificate Program. In particular, it will be used for two courses on analytics and visualization.
The book is available in the iTunes store or as a PDF. See the book website to get your copy.