Data Science and the Perfect Team

Today, I am proud to welcome a guest post by Claire Gilbert, Data Analyst at Gongos. For more on Gongos, see the description at the end of the post.

It’s fair to say that for those who run in business intelligence circles, many admire the work of Fast Forward Labs CEO and Founder Hilary Mason. Perhaps what resonates most with her fans is the moniker she places on data scientists as being ‘awesome nerds’—those who embody the perfect skillsets of math and stats, coding, and communication. She asserts that these individuals have the technical expertise to not only conduct the really, really complex work—but also have the ability to explain the impact of that work to a non-technical audience.

As insights and analytics organizations strive to assemble their own group of ‘awesome nerds,’ there are two ways to consider Hilary’s depiction. Most organizations struggle by taking the first route—searching for those very expensive, highly rare unicorns—individuals that independently sit at this critical intersection of genius. Besides the fact that it would be even more expensive to clone these data scientists, there is simply not enough bandwidth in their day to fulfill on their awesomeness 24/7.

To quote Aristotle, one of the earliest scientists of our time, “the whole is greater than the sum of its parts,” which brings us to the notion of the team. Rather than seeking out those highly sought-after individuals with skills in all three camps, consider creating a collective of individuals with skills from each camp. After all, no one person can solve for the depth and breadth of an organization’s growing data science needs. It takes a specialist such as a mathematician to dive deep; as well as a multidisciplinary mind who can comprehend the breadth, to truly achieve the perfect team.

Awesome Nerds

Team Dynamics of the Data Kind

The ultimate charge for any data science team is to be a problem-solving machine—one that constantly churns in an ever-changing climate. Faced with an increasing abundance of data, which in turn gives rise to once-unanswerable business questions, has led clients to expect new levels of complexity in insights. This chain reaction brings with it a unique set of challenges not previously met by a prescribed methodology. As the sets of inputs become more diverse, so too should the skillsets to answer them. While all three characteristics of the ‘awesome nerd’ are indispensable, it’s the collective of ‘nerds’ that will become the driving force in today’s data world.
True to the construct, no two pieces should operate independent of the third. Furthermore, finding and honing balance within a data science team will result in the highest degree of accuracy and relevancy possible.
Let’s look at the makeup of a perfectly balanced team:

  • Mathematician/Statistician:
    This trained academic builds advanced models based on inputs, while understanding the theory and requirements for the results to be leveraged correctly.
  • Coder/Programmer:
    This hands-on ‘architect’ is in charge of cleaning, managing and reshaping data, as well as building simulators or other highly technical tools that result in user-friendly data.
  • Communicator/Content Expert:
    This business ‘translator’ applies an organizational lens to bring previous knowledge to the table in order to connect technical skill sets to client needs.

It’s the interdependence of these skillsets that completes the team and its ability to deliver fully on the promise of data:
A Mathematician/Statistician’s work relies heavily on the Coder/Programmer’s skills. The notion of garbage-in/garbage-out very much applies here. If the Coder hasn’t sourced and managed the data judiciously, the Mathematician cannot build usable models. Both then rely on the knowledge of the Communicator/Content Expert. Even if the data is perfect, and the results statistically correct, the output cannot be activated against unless it is directly relevant to the business challenge. Furthermore, teams out of balance will be faced with hurdles for which they are not adequately prepared, and output that is not adequately delivered.

To Buy or to Build?

In today’s world of high velocity and high volume of data, companies are faced with a choice. Traditional programmers like those who have coded surveys and collected data are currently integrated in the work streams of most insights organizations. However, many of them are not classically trained in math and/or statistics. Likewise, existing quantitative-minded, client-facing talents can be leveraged in the rebuilding of a team. Training either of these existing individuals who have a bent in math and/or stats is possible, yet is a time-intensive process that calls for patience. If organizations value and believe in their existing talent and choose to go this route, it will then point to the gaps that need to be filled—or bought—to build the ‘perfect’ team.
Organizations have long known the value of data, but no matter how large and detailed it gets, without the human dimension, it will fail to live up to its $30 billion valuation by 2019. The interpretation, distillation and curation of all kinds of data by a team in equilibrium will propel this growth and underscore the importance of data science.
Many people think Hilary’s notion of “awesome nerds” applies only to individuals. But in practice, we must realize this kind of market potential, the team must embody the constitution of awesomeness.
As organizations assemble and recruit teams, perhaps their mission statement quite simply should be…
If you can find the nerds, keep them, but in the absence of an office full of unicorns, create one.

About Gongos

Gongos, Inc. is a decision intelligence company that partners with Global 1000 corporations to help build the capability and competency in making great consumer-minded decisions. Gongos brings a consultative approach in developing growth strategies propelled by its clients’ insights, analytics, strategy and innovation groups.

Enlisting the multidisciplinary talents of researchers, data scientists and curators, the company fuels a culture of learning both internally and within its clients’ organizations. Gongos also works with clients to develop strategic frameworks to navigate the change required for executional excellence. It serves organizations in the consumer products, financial services, healthcare, lifestyle, retail, and automotive spaces.

Recent Resources for Open Data

Recently, a number of resources for publicly available datasets have been announced.

  • Kaggle becomes the the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
  • NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
  • Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.

For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.

Are you aware of any other resources that have been recently announced? If so, please leave a comment.

Data Science Ethical Framework

The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.

The framework is based around 6 clear principles:

  1. Start with clear user need and public benefit
  2. Use data and tools which have the minimum intrusion necessary
  3. Create robust data science models
  4. Be alert to public perceptions
  5. Be as open and accountable as possible
  6. Keep data secure

See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.

Machine Learning Yearning Book

Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.

Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.

How to Kickstart Your Data Science Career

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.


If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20 10
3 Alice 2005 25 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.


SELECT is the foundational keyword in SQL. SELECT can also filter on columns.  For example

SELECT Name, StartYear FROM Employees


Name StartYear
Bob 2001
Sally 2004
Alice 2005
Fred 2004



The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004


EmployeeId Name StartYear Salary DepartmentId
2 Sally 2004 20 10
4 Fred 2004 12.5 20



Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
GROUP BY StartYear


StartYear Num AvgSalary
2001 1 10.5
2004 2 16.25
2005 1 25



Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId DepartmentName
10 Sales
20 Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;

The results might look like:

EmpName DepName
Bob Sales
Sally Sales
Alice Engineering
Fred Engineering

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done.  Don’t forget, that you can nest queries and create really complicated joins.  It’s a lot more powerful than Excel, and gives you much better control of your data.  Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.


And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

Berkeley Undergrad Data Science Course and Textbook

The University of California at Berkeley has started The Berkeley Data Science Education Program. The goal is to build a data science education program throughout the next several years by engaging faculty and students from across the campus. The introductory data science course is targeting freshman and it is taught from a very applicable and interactive environment. The course videos, slides, labs, and notes are freely available for others to use. The course heavily uses Jupyter. Also, the course textbook is online at Computational and Inferential Thinking: The Foundations of Data Science.

A Couple of Current Data Science Competitions

Decoding Brain Signals

Microsoft has recently announced a machine learning competition platform. As part of the launch, one of the first competitions is the prediction of brain signals. It has $5000 in prizes, and submissions are accepted thru June 30, 2016.

Big Data Viz Challenge

Google and Tableau have teamed up to offer a big data visualization contest. The rules are fairly simple, just create an awesome visualization using at least the GDELT data set. Finalist will receive prizes worth over $5000 and even some will get tours of Tableau and Google facilities. The contest runs thru May 16, 2016.

Free Stats book for Computer Scientists

Professor Norm Matloff from the University of California, Davis has published From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science which is an open textbook. It approaches statistics from a computer science perspective. Dr. Matloff has been both a professor of statistics and computer science so he is well suited to write such a textbook. This would a good choice of a textbook for a statistics course targeted at primarily computer scientists. It uses the R programming language. The book starts by building the foundations of probability before entering statistics.

Do’s and Don’ts of Data Science

Don’t Start with the Data
Do Start with a Good Question

Don’t think one person can do it all
Do build a well-rounded team

Don’t only use one tool
Do use the best tool for the job

Don’t brag about the size of your data
Do collect relevant data

Don’t ignore domain knowledge
Do consult a subject matter expert

Don’t publish a table of numbers
Do create informative charts

Don’t use just your own data
Do enhance your analysis with open data

Don’t do all the work yourself
Do partner with local universities

Don’t always build your own tools
Do use lots of open source tools

Don’t keep all your findings to yourself
Do share your analysis and results with the world!

Got any to add? Please leave a comment.

Learning To Be A Data Scientist