R is a hugely popular language among data scientists and statisticians. One of the difficulties with open-source R is the memory constraint. All the data needs to be loaded into a data.frame. Microsoft solves this problem with the RevoScaleR package of the Microsoft R Server. Just launched this week is an EdX course on Analyzing Big Data with Microsoft R Server.
According the syllabus:
Upon completion, you will know how to use R for big-data problems.
Full Disclosure: I work at Microsoft, and the course instructor, Seth Mottaghinejad, is one of my colleagues.
The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.
As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.
More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.
Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:
A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.
Frontend and backend development
Operating system development
A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.
Advanced data structures
Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
Building ETL/data pipelines
A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.
Business Intelligence dashboards
Evolving Data Teams
These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.
At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.
Our World in Data is data visualization site for exploring the history of civilization. The site was created by Max Roser. Our World in Data contains tons of information about many aspects of people’s lives. It also includes numerous visuals (like the one below) which can be easily shared or embedded on other sites.
Beware, the site is addicting, and you might spend a lot of time exploring data.
Today, I am proud to welcome a guest post by Claire Gilbert, Data Analyst at Gongos. For more on Gongos, see the description at the end of the post.
It’s fair to say that for those who run in business intelligence circles, many admire the work of Fast Forward Labs CEO and Founder Hilary Mason. Perhaps what resonates most with her fans is the moniker she places on data scientists as being ‘awesome nerds’—those who embody the perfect skillsets of math and stats, coding, and communication. She asserts that these individuals have the technical expertise to not only conduct the really, really complex work—but also have the ability to explain the impact of that work to a non-technical audience.
As insights and analytics organizations strive to assemble their own group of ‘awesome nerds,’ there are two ways to consider Hilary’s depiction. Most organizations struggle by taking the first route—searching for those very expensive, highly rare unicorns—individuals that independently sit at this critical intersection of genius. Besides the fact that it would be even more expensive to clone these data scientists, there is simply not enough bandwidth in their day to fulfill on their awesomeness 24/7.
To quote Aristotle, one of the earliest scientists of our time, “the whole is greater than the sum of its parts,” which brings us to the notion of the team. Rather than seeking out those highly sought-after individuals with skills in all three camps, consider creating a collective of individuals with skills from each camp. After all, no one person can solve for the depth and breadth of an organization’s growing data science needs. It takes a specialist such as a mathematician to dive deep; as well as a multidisciplinary mind who can comprehend the breadth, to truly achieve the perfect team.
Team Dynamics of the Data Kind
The ultimate charge for any data science team is to be a problem-solving machine—one that constantly churns in an ever-changing climate. Faced with an increasing abundance of data, which in turn gives rise to once-unanswerable business questions, has led clients to expect new levels of complexity in insights. This chain reaction brings with it a unique set of challenges not previously met by a prescribed methodology. As the sets of inputs become more diverse, so too should the skillsets to answer them. While all three characteristics of the ‘awesome nerd’ are indispensable, it’s the collective of ‘nerds’ that will become the driving force in today’s data world.
True to the construct, no two pieces should operate independent of the third. Furthermore, finding and honing balance within a data science team will result in the highest degree of accuracy and relevancy possible.
Let’s look at the makeup of a perfectly balanced team:
This trained academic builds advanced models based on inputs, while understanding the theory and requirements for the results to be leveraged correctly.
This hands-on ‘architect’ is in charge of cleaning, managing and reshaping data, as well as building simulators or other highly technical tools that result in user-friendly data.
This business ‘translator’ applies an organizational lens to bring previous knowledge to the table in order to connect technical skill sets to client needs.
It’s the interdependence of these skillsets that completes the team and its ability to deliver fully on the promise of data:
A Mathematician/Statistician’s work relies heavily on the Coder/Programmer’s skills. The notion of garbage-in/garbage-out very much applies here. If the Coder hasn’t sourced and managed the data judiciously, the Mathematician cannot build usable models. Both then rely on the knowledge of the Communicator/Content Expert. Even if the data is perfect, and the results statistically correct, the output cannot be activated against unless it is directly relevant to the business challenge. Furthermore, teams out of balance will be faced with hurdles for which they are not adequately prepared, and output that is not adequately delivered.
To Buy or to Build?
In today’s world of high velocity and high volume of data, companies are faced with a choice. Traditional programmers like those who have coded surveys and collected data are currently integrated in the work streams of most insights organizations. However, many of them are not classically trained in math and/or statistics. Likewise, existing quantitative-minded, client-facing talents can be leveraged in the rebuilding of a team. Training either of these existing individuals who have a bent in math and/or stats is possible, yet is a time-intensive process that calls for patience. If organizations value and believe in their existing talent and choose to go this route, it will then point to the gaps that need to be filled—or bought—to build the ‘perfect’ team.
Organizations have long known the value of data, but no matter how large and detailed it gets, without the human dimension, it will fail to live up to its $30 billion valuation by 2019. The interpretation, distillation and curation of all kinds of data by a team in equilibrium will propel this growth and underscore the importance of data science.
Many people think Hilary’s notion of “awesome nerds” applies only to individuals. But in practice, we must realize this kind of market potential, the team must embody the constitution of awesomeness.
As organizations assemble and recruit teams, perhaps their mission statement quite simply should be…
“If you can find the nerds, keep them, but in the absence of an office full of unicorns, create one.”
Gongos, Inc. is a decision intelligence company that partners with Global 1000 corporations to help build the capability and competency in making great consumer-minded decisions. Gongos brings a consultative approach in developing growth strategies propelled by its clients’ insights, analytics, strategy and innovation groups.
Enlisting the multidisciplinary talents of researchers, data scientists and curators, the company fuels a culture of learning both internally and within its clients’ organizations. Gongos also works with clients to develop strategic frameworks to navigate the change required for executional excellence. It serves organizations in the consumer products, financial services, healthcare, lifestyle, retail, and automotive spaces.
Recently, a number of resources for publicly available datasets have been announced.
Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.
The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.
The framework is based around 6 clear principles:
Start with clear user need and public benefit
Use data and tools which have the minimum intrusion necessary
Create robust data science models
Be alert to public perceptions
Be as open and accountable as possible
Keep data secure
See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.
Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.
Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.
This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.
If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset. Why? Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes). Also, Excel has low size limits, making it suitable for “small data”, not “big data.”
In this blog entry we’ll talk about SQL. This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel. SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure. Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary). Critically, each column must be of the same “type”. Here is a sample of the table Employees:
SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN. We’ll go through these each individually.
SELECT is the foundational keyword in SQL. SELECT can also filter on columns. For example
SELECT Name, StartYear FROM Employees
The WHERE clause filters the rows. For example
SELECT * FROM Employees WHERE StartYear=2004
Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,
SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
GROUP BY StartYear
Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:
We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:
SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;
The results might look like:
We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.
Conclusion and Further Reading
With these basic commands, you can get a lot of basic data processing done. Don’t forget, that you can nest queries and create really complicated joins. It’s a lot more powerful than Excel, and gives you much better control of your data. Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.
For a tutorial about R’s Dataframes, checkout this page.
And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!