It can be a long and difficult task. It takes dedication, a good topic, a helpful advisor, some meetings, and a bit of paperwork. However, it is not impossible, and here are some tools to make it easier (hopefully).
This is not intended to be a guide for selecting a topic. I am not qualified to provide that type of advice, but I will say, choose both a topic and an advisor you find interesting. This is intended to be a collection of tools I found useful during my journey. I do not think the list is specific to data science; it could easily apply to: mathematics, statistics, computer science, engineering, or any other highly quantitative field.
All these tools have free versions to get you started. A few have discounted upgrades for students.
- Use an online tool such as ShareLaTeX.
How does this tool benefit you? It saves you from having to install a version of , stores history of your previous versions of the document, and allows you to write on any machine with an internet connection. In addition, ShareLaTeX has existing templates for many, many Universities. Students can even get half-priced premium accounts to collaborate and sync with Github and Dropbox. While is not perfect, I do not know of any better tool for writing mathematical documents.
- Use GitHub to store you data and source code
At some point in time, hopefully you will want to share your results. GitHub is the defacto standard for sharing open source code. It also works very well for storing data as well, even large datasets. You might also discover another open source project you want to get involved with. As a definite bonus, many future non-academic employers encourage a GitHub account during the application process. Thus, the sooner you start the better.
- Use a Cloud Computing Platform such as Sense.
Don’t spend your time building a cluster of computers unless your dissertation topic involves cluster computing. Solve your own problem, not infrastructure problems. Sense and others provide access to massive computing power for cheap or low cost. Plus, it provides collaboration, sharing, scheduling, notifications, analysis recreation, and many other features you might find beneficial.
- Use Create.ly for creating diagrams.
Creating flowcharts and technical diagrams can be a pain. Especially if you do not have expensive diagram software. Creately is a simple solution to this problem.
There is your list of helpful tools for writing a data science dissertation. Do you have any tools you think I missed? If so, please leave a comment.
Coursera has some excellent courses coming up in 2013. Here are some potential curriculum paths for someone looking to learn data science.
Either sequence requires/recommends some basic programming experience. If you are unfamiliar with programming, you still have a couple weeks to get familiar with some basic programming concepts. Some good places to start would be either Coursera’s Computer Science 101 or Codecademy’s Python tutorial.
Data Science Curriculum #1
If you are new to programming, this would be the recommend sequence. The first course focuses on programming.
Data Science Curriculum #2
Neither of the Coursera machine learning (Stanford or U of Washington) courses are scheduled for 2013, but either of them would be a great (maybe necessary) follow up course. Hopefully, one of those courses will be starting in July or shortly there after.
After completing one of the above sequences combined with a machine learning course, a person should be skilled enough to begin doing useful data science work. (Note: A new job as a data scientist is not guaranteed, but the courses won’t hurt your chances.) Plus, Coursera offers numerous other classes that could be taken at a later time to increase depth in certain areas of data science (Natural Language Processing, Image Processing, and more).
Happy Learning in 2013!
If you are interested in more ways to learn data science, please check out Data Science 201, coming in 2013.
If you have the necessary background in math, statistics, and computer science; then it is a good time to learn some data science specific skills. Coursera just recently launched a course specifically devoted to Data Science. It is titled: Introduction to Data Science. The course is being taught by Bill Howe of the University of Washington’s eScience Institute. I believe this course is an excellent place to start. I am very excited about this course.
Other Data Science Learning Resources
Here is a listing of other materials that could be helpful to learning data science.
Many aspects of computer science are fundamental to data science. A good data scientist has to be able to transform/extract/manipulate lots of data. Computer programming is the main technique for such operations. Here are numerous resources to help you learn the fundamentals of computer science.
Online Computer Science Courses: Introductory Level
If you are not familiar with computer programming, this list is a good place to start.
Online Computer Science Courses: More Advanced
Two More Helpful Resources
Stack Overflow is a great site for answering all of your programming questions. It is good for beginners as well as more advanced programmers. Also, if you start writing a lot of code, Github is a great place to store that code.
Statistics is an important component of data science. Thus, it would be nice to have some resources available.
Learn Statistics For Free Online
Well, here is a list of free statistics resources available online. All of these are fairly introductory, but I am guessing more advanced topics will be coming from these same organizations.
In addition to the free resources online, there are other options as well.
- Statistics.com – courses are about $400-$500 but programs lead to certificates
- Most all local colleges will offer courses in statistics
What other resources are available for learning statistics?
Math is one of the key building blocks of data science. While you cannot do a lot of data science with just calculus and linear algebra, both topics are essential for more advanced topics in data science such as machine learning, algorithms, and advanced statistics. Here are some freely available resources for learning both topics.
Matrix Operations/Linear Algebra
Other Math Options
The following 2 courses from Coursera maybe good for a person learning to think mathematically.
This is not intended to be mapped to a set of college courses. It is intended to be a listing of necessary skills for a data scientist. For a definition of data scientist, see this previous post.
- Calculus – not directly important to data science, but the knowledge is important to understand the statistics and machine learning
- Matrix Operations
- Regression – Linear and Logistic
- Bayesian Statistics
- R – stats
- Octave – machine learning
- Basic Programming – Java, C/C++, and Python seem to be good language choices
- Machine Learning
- Database Knowledge – not limited to just relational databases
- Data Visualization – how to make data look good: maps, graphs, etc
- Presentation – story telling, be comfortable explaining data to others
Do you have anything to add/remove from the list?