Jenna Dutcher, community relations manager for the datascience@berkeley online master’s program, interviewed more than 40 thought leaders to answer this one simple question: What is big data? (Full disclosure: I was honored to be asked to provided a definition on the list.)
The answers are quite diverse and definitely worth reading.
I thought Hal Varian, Chief Economist at Google, provided one of the simplest and best definitions.
Big data means data that cannot fit easily into a standard relational database.
See the full list of What is Big Data?
Which definition is your favorite? How would you define big data?
It was a 2-week intensive course focused on machine learning for big data. Some of the top academics in machine learning gave presentations. Most of the videos are fairly long (around 1 hour each), but a whole lot of material is covered.
All the CMU Machine Learning Summer School Videos are on Youtube.
Here is one lecture by Alex Smola on Scalable Machine Learning.
Onur Akpolat has put together A curated list of awesome big data frameworks and resources. The list is very extensive and includes: NoSQL databases, machine learning libraries, frameworks, filesystems and more.
On a similar note, Joseph Misiti has compiled a large list of machine learning specific resources. The list is titled, Awesome Machine Learning, and it includes resources for various languages, NLP, visualization, and more.
Both lists are on Github, so if you notice something missing from the list, feel free to add it. Contributions are welcome.
Lots of Big Data Jobs
iCrunchData, one of the most popular data science job sites, keeps an index of the data science job market. Recently, the index just passed 500,000 big data jobs posted online. That is a phenomenal number, and it just goes to show the massive need for more people with big data skills. Also of note, analytics jobs are at nearly 250,000 and even statistics jobs are approaching 70,000 according to the index.
The Jobs Pay Well
DataJobs, another popular data science job site, recently published Big Data Salaries: An Inside Look. DataJobs breaks down the salaries by job title and experience level. Here are some of the details:
- An entry level data analyst should expect a yearly salary in the range of $50,000 to $75,000. A more experienced data analyst should expect as high as $110,000.
- The range for a data scientist goes from $85,000 up to $170,000.
- An analytics manager, depending upon the number of direct reports, can command a salary up to $240,000 for 10 or more directs.
- A big data engineer can expect a salary of $70,000 to $165,000, depending upon level of experience and the company.
If you have the right skills, right now is an excellent time to find a big data job. If you don’t yet have the skills, it is a good time to start learning because the current trend of open big data jobs is showing no signs of slowing down.
Big Dive is a 5-week Big Data training program offered in Italy. The program is not free, but it has a great layout of topics. The program focuses on 3 main themes:
- Data Science
Big Dive runs from early June to mid-July, and the admissions deadline is April 27, 2014.
This infographic is packed with good data. I especially enjoyed the section about big data startups that were acquired in 2013.
The topic of internet security has been around for many years, but recently the topics of data science and security have joined forces. Many security applications collect vast amounts of data. Also, many security application operate based upon activity. Data Science can help collect all the past activity and machine learning can be used to help predict new activity as malicious or not. Anyhow, here are 2 recent articles on the combination of security and data science.
It is back-to-school time, and here are some papers to keep you busy this school year. All the papers are free. This list is far from exhaustive, but these are some important papers in data science and big data.
- PageRank – This is the paper that explains the algorithm behind Google search.
- MapReduce – This paper explains a programming model for processing large datasets. In particular, it is the programming model used in hadoop.
- Google File System – Part of hadoop is HDFS. HDFS is an open-source version of the distributed file system explained in this paper.
These are 2 of the papers that drove/started the NoSQL debate. Each paper describes a different type of storage system intended to be massively scabable.
- Random Forests – One of the most popular machine learning techniques. It is heavily used in Kaggle competitions, even by the winners.
Are there any other papers you feel should be on the list?
Alteryx is offering the book, Big Data Analytics For Dummies, for free. If you are new to the term big data, this book provides a brief (about 40 pages) overview of the topic and what big data should be able to do for your company.
You have to register, but it is worth it for the free book.