This is just a short list of a few books that I have have recently discovered online.
Model-Based Machine Learning – Chapters of this book become available as they are being written. It introduces machine learning via case studies instead of just focusing on the algorithms.
Foundations of Data Science – This is a much more academic-focused book which could be used at the undergraduate or graduate level. It covers many of the topics one would expect: machine learning, streaming, clustering and more.
Today, I am proud to welcome a guest post by Claire Gilbert, Data Analyst at Gongos. For more on Gongos, see the description at the end of the post.
It’s fair to say that for those who run in business intelligence circles, many admire the work of Fast Forward Labs CEO and Founder Hilary Mason. Perhaps what resonates most with her fans is the moniker she places on data scientists as being ‘awesome nerds’—those who embody the perfect skillsets of math and stats, coding, and communication. She asserts that these individuals have the technical expertise to not only conduct the really, really complex work—but also have the ability to explain the impact of that work to a non-technical audience.
As insights and analytics organizations strive to assemble their own group of ‘awesome nerds,’ there are two ways to consider Hilary’s depiction. Most organizations struggle by taking the first route—searching for those very expensive, highly rare unicorns—individuals that independently sit at this critical intersection of genius. Besides the fact that it would be even more expensive to clone these data scientists, there is simply not enough bandwidth in their day to fulfill on their awesomeness 24/7.
To quote Aristotle, one of the earliest scientists of our time, “the whole is greater than the sum of its parts,” which brings us to the notion of the team. Rather than seeking out those highly sought-after individuals with skills in all three camps, consider creating a collective of individuals with skills from each camp. After all, no one person can solve for the depth and breadth of an organization’s growing data science needs. It takes a specialist such as a mathematician to dive deep; as well as a multidisciplinary mind who can comprehend the breadth, to truly achieve the perfect team.
Team Dynamics of the Data Kind
The ultimate charge for any data science team is to be a problem-solving machine—one that constantly churns in an ever-changing climate. Faced with an increasing abundance of data, which in turn gives rise to once-unanswerable business questions, has led clients to expect new levels of complexity in insights. This chain reaction brings with it a unique set of challenges not previously met by a prescribed methodology. As the sets of inputs become more diverse, so too should the skillsets to answer them. While all three characteristics of the ‘awesome nerd’ are indispensable, it’s the collective of ‘nerds’ that will become the driving force in today’s data world.
True to the construct, no two pieces should operate independent of the third. Furthermore, finding and honing balance within a data science team will result in the highest degree of accuracy and relevancy possible.
Let’s look at the makeup of a perfectly balanced team:
This trained academic builds advanced models based on inputs, while understanding the theory and requirements for the results to be leveraged correctly.
This hands-on ‘architect’ is in charge of cleaning, managing and reshaping data, as well as building simulators or other highly technical tools that result in user-friendly data.
This business ‘translator’ applies an organizational lens to bring previous knowledge to the table in order to connect technical skill sets to client needs.
It’s the interdependence of these skillsets that completes the team and its ability to deliver fully on the promise of data:
A Mathematician/Statistician’s work relies heavily on the Coder/Programmer’s skills. The notion of garbage-in/garbage-out very much applies here. If the Coder hasn’t sourced and managed the data judiciously, the Mathematician cannot build usable models. Both then rely on the knowledge of the Communicator/Content Expert. Even if the data is perfect, and the results statistically correct, the output cannot be activated against unless it is directly relevant to the business challenge. Furthermore, teams out of balance will be faced with hurdles for which they are not adequately prepared, and output that is not adequately delivered.
To Buy or to Build?
In today’s world of high velocity and high volume of data, companies are faced with a choice. Traditional programmers like those who have coded surveys and collected data are currently integrated in the work streams of most insights organizations. However, many of them are not classically trained in math and/or statistics. Likewise, existing quantitative-minded, client-facing talents can be leveraged in the rebuilding of a team. Training either of these existing individuals who have a bent in math and/or stats is possible, yet is a time-intensive process that calls for patience. If organizations value and believe in their existing talent and choose to go this route, it will then point to the gaps that need to be filled—or bought—to build the ‘perfect’ team.
Organizations have long known the value of data, but no matter how large and detailed it gets, without the human dimension, it will fail to live up to its $30 billion valuation by 2019. The interpretation, distillation and curation of all kinds of data by a team in equilibrium will propel this growth and underscore the importance of data science.
Many people think Hilary’s notion of “awesome nerds” applies only to individuals. But in practice, we must realize this kind of market potential, the team must embody the constitution of awesomeness.
As organizations assemble and recruit teams, perhaps their mission statement quite simply should be…
“If you can find the nerds, keep them, but in the absence of an office full of unicorns, create one.”
Gongos, Inc. is a decision intelligence company that partners with Global 1000 corporations to help build the capability and competency in making great consumer-minded decisions. Gongos brings a consultative approach in developing growth strategies propelled by its clients’ insights, analytics, strategy and innovation groups.
Enlisting the multidisciplinary talents of researchers, data scientists and curators, the company fuels a culture of learning both internally and within its clients’ organizations. Gongos also works with clients to develop strategic frameworks to navigate the change required for executional excellence. It serves organizations in the consumer products, financial services, healthcare, lifestyle, retail, and automotive spaces.
The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.
The framework is based around 6 clear principles:
Start with clear user need and public benefit
Use data and tools which have the minimum intrusion necessary
Create robust data science models
Be alert to public perceptions
Be as open and accountable as possible
Keep data secure
See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.
Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.
Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.
This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.
If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset. Why? Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes). Also, Excel has low size limits, making it suitable for “small data”, not “big data.”
In this blog entry we’ll talk about SQL. This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel. SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure. Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary). Critically, each column must be of the same “type”. Here is a sample of the table Employees:
SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN. We’ll go through these each individually.
SELECT is the foundational keyword in SQL. SELECT can also filter on columns. For example
SELECT Name, StartYear FROM Employees
The WHERE clause filters the rows. For example
SELECT * FROM Employees WHERE StartYear=2004
Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,
SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
GROUP BY StartYear
Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:
We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:
SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;
The results might look like:
We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.
Conclusion and Further Reading
With these basic commands, you can get a lot of basic data processing done. Don’t forget, that you can nest queries and create really complicated joins. It’s a lot more powerful than Excel, and gives you much better control of your data. Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.
For a tutorial about R’s Dataframes, checkout this page.
And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!
While preparing a for a recent talk I gave to an undergraduate audience, I started compiling some tips for future data scientists. The tips are intended for students (undergraduate and graduate) or anyone else planning to enter the field of data science.
Be flexible and adaptable – There is no single tool or technique that always works best.
Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
Not all building models – Like the previous tip, you must have skills beyond just model building.
Know the fundamentals of structuring data – Gain an understanding of relational databases. Also learn how to collect and store good data. Not all data is useful.
Document what you do – This is important for others and your future self. Here is a subtip, learn version control.
Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
Be polymath – It helps to be a person with a wide range of knowledge.
Thanks to Chad, Chad, Lee, Buck, and Justin for providing some of the tips.
I frequently ask young people, particularly undergraduates, what they plan to do with their future. I am often less than enthused with the responses which sound something like this:
I hope to get a job doing statistics.
I just want to work with computers.
I want to be a data scientist.
I just want a job.
The responses are typically vague and void of direction. Most responses involve waiting for someone else to provide the guidance. You do not have to wait. You can get started today.
If you are just interested in getting a job, the rest of this post is not for you. If you want to make an impact with your data science career, the remainder of this post is for you.
Below is an explanation of numerous specialties in data science. You don’t need to learn them all. Just pick one and follow the first step. You will learn more along the way. Don’t stress about which one to pick, there is no wrong answer. Just pick one and start building.
Data visualization is all about telling a story with data. Do you have a keen eye for color and design? Can you summarize complex data in a few simple charts? If you answer yes to those questions, then you just might be a good fit for data visualization.
First Step: Go to Data.gov and make an infographic
Data Science Educator
Are you the person always explaining your homework to others? This specialty might be for you. You can take a few different paths. One is the traditional university faculty approach. Another is more of a corporate training professional. The world needs both. Plus, if you are entrepreneurial, there are ample opportunities to consult as a data science educator. Businesses realize they need to know data science, and they are looking for training.
First Step: Start a video or blog with tutorials
A data engineer is typically more interested in systems than just the machine learning. Data engineers are typically strong with computer science fundamentals. They love to build things that themselves and others can use. A good data engineer can also spend a lot of time cleaning data as well.
Some people just love the statistical modeling and machine learning. They love to tune models and squeeze the last bit of predictive power from a data set. If you love talking about regression, trees, random forests, AUC, cross-validation and boosting; then this specialty is most likely for you.
If you are bossy, it does not mean you will make a good manager. The best managers know how to build strong teams and get out of the way. Managers will provide help and overall direction for projects. Plus, he/she should have a solid understanding of how data can help shape a team’s decisions.
First Step: Organize a group to help a non-profit analyze data (Similar to what DataKind does)
Data Science Researcher
A researcher is interested in pushing the boundaries of data science. Are you interested in creating your own machine learning algorithms? Do you want to build the next great data framework? Do you think data science can achieve something no one else has thought to try? If so, being a researcher is for you.
First Step: Go to graduate school
Data Science Unicorn
A data science unicorn is someone that knows all the specialties above and more. A unicorn understands all the topics of data science. Being a unicorn is not attainable for everyone, but a few people have become unicorns. If you think you can be a unicorn, go for it.
First Step: Start at visualization above
Simple: Pick a specialty and Go Make a Difference!
This post is based upon a talk I gave at Winona State University just before MUDAC. The original title was Go After Your Data Science Dreams.
If you are looking to learn data science but do not have the time or money for a full master’s degree, Data Society might be your answer. Data Society is an online data analysis skills training program that is designed by educators and curated by data science experts. The learning experience is online and includes:
Printable step-by-step guides
Reusable Coding Templates
Opportunities to build a Portforlio
There is one other completely awesome feature of Data Society. For every membership purchased, they provide a free membership to help someone in need. Data Society is currently running a Kickstarter to build a community for learning data science. Your support would be greatly appreciated (I am not involved in the project but I am always happy to share innovative educational opportunities for data science).
Recently, I was able to get a brief interview with Merav Yuravlivker, one of the founders of Data Society.
There are many data science learning resources on the web, how is Data Society different?
We understand that most people do want to learn these skills, but don’t feel like they have the time, the money or the background. We eliminate all those barriers to entry by providing short lessons that are taught intuitively with real data sets. It’s the first platform that’s designed with working professionals in mind. Not only do we teach our students how to analyze data, but we also have a separate track for managers that teaches them how to implement data-driven strategies in their teams, how to hire a data scientist, and how to communicate effectively with their employees. Our courses are not just videos, each course includes ready-made data analysis templates in R that decrease the time it takes to do the work, a step-by-step printable guide that can be used as a reference for every stage of the analysis, and live, dynamic forums where students can get all of their questions answers by the Data Society team as well as other students. In short, we provide everything someone needs to learn new skills in a much shorter amount of time.
What is the Kickstarter about?
Our Kickstarter campaign is about building that community around learning data science and helping others solve problems – we’ve already released the first three courses in our curriculum, and we’re excited to give our supporters an opportunity to see exactly how their contributions can make an impact. Our mission is to increase data literacy across the workforce – we know that data analysis skills are widely valued and sought-after, which is why we’re partnering with non-profits who help veterans and low-income individuals get back to work. For every membership bought off of Kickstarter, we will give one to someone who can use these skills to become more marketable and improve their life.
Is there anything else you would like to tell me about Data Society?
The most frequent compliment we get from students is that they didn’t feel intimidated to learn data analysis skills. As an educator, that is the biggest reward to me because we’re opening up possibilities for individuals who didn’t think that they had the ability to analyze data and pull insights from it. Our goal is not to turn everyone into a data scientist, but rather to give everyone the ability and confidence to get new data, look at the data they already have, ask “How can this data help me solve this problem?”, and then discover those insights that will help them make better decisions.