Tag Archives: data science

Recent Resources for Open Data

Recently, a number of resources for publicly available datasets have been announced.

  • Kaggle becomes the place for Open Data – I think this is big news! Kaggle just announced Kaggle Datasets which aims to be a repository for publicly available datasets. This is great for organizations that want to release data, but do not necessarily want the overhead of running an open data portal. Hopefully it will gain some traction and become an exceptional resource for open data.
  • NASA Opens Research – NASA just announced all research papers funded by NASA will be publicly available. It appears the research articles will all be available at PubMed Central, and the data available at NASA’s Data Portal.
  • Google Robotics Data – Google continues to do interesting things, and this topic is definitely that. It is a dataset about how robots grasp objects (Google Brain Robot Data). I am not overly familiar with this topic, so if you want to know more, see their blog post, Deep Learning for Robots.

For more options of open data, see Data Sources for Cool Data Science Projects Part 1 and Part 2.

Are you aware of any other resources that have been recently announced? If so, please leave a comment.

Data Science Ethical Framework

The UK government has taken the first step in providing a solid grounding for the future of data science ethics. Recently, they published a “beta” version of the Data Science Ethical Framework.

The framework is based around 6 clear principles:

  1. Start with clear user need and public benefit
  2. Use data and tools which have the minimum intrusion necessary
  3. Create robust data science models
  4. Be alert to public perceptions
  5. Be as open and accountable as possible
  6. Keep data secure

See the above link for further details. The framework is somewhat specific to the UK, but it would be nice to see other countries/organizations adopt a similar framework. Even DJ Patil, U.S. Chief Data Scientist, has stated the importance of ethics in all data science curriculum.

Machine Learning Yearning Book

Andrew Ng [Co-Founder of Coursera, Stanford Professor, Chief Scientist at Baidu, and All-Around Machine Learning Expert] is writing a book during the summer of 2016. The book is titled, Machine Learning Yearning. It you visit the site and signup quickly you can get draft copies of the chapters as they become available.

Andrew is an excellent teacher. His MOOCs are wildly successful, and I expect his book to be excellent as well.

How to Kickstart Your Data Science Career

This is a guest post from Michael Li of The Data Incubator. The The Data Incubator runs a free eight week data science fellowship to help transition their Fellows from Academia to Industry. This post runs through some of the toolsets you’ll need to know to kickstart your Data Science Career.

 

If you’re an aspiring data scientist but still processing your data in Excel, you might want to upgrade your toolset.  Why?  Firstly, while advanced features like Excel Pivot tables can do a lot, they don’t offer nearly the flexibility, control, and power of tools like SQL, or their functional equivalents in Python (Pandas) or R (Dataframes).  Also, Excel has low size limits, making it suitable for “small data”, not  “big data.”

In this blog entry we’ll talk about SQL.  This should cover your “medium data” needs, which we’ll define as the next level of data where the rows do not fit the 1 million row restriction in Excel.  SQL stores data in tables, which you can think of as a spreadsheet layout but with more structure.  Each row represents a specific record, (e.g. an employee at your company) and each column of a table corresponds to an attribute (e.g. name, department id, salary).  Critically, each column must be of the same “type”.  Here is a sample of the table Employees:

EmployeeId Name StartYear Salary DepartmentId
1 Bob 2001 10.5 10
2 Sally 2004 20 10
3 Alice 2005 25 20
4 Fred 2004 12.5 20

SQL has many keywords which compose its query language but the ones most relevant to data scientists are SELECT, WHERE, GROUP BY, JOIN.  We’ll go through these each individually.

SELECT

SELECT is the foundational keyword in SQL. SELECT can also filter on columns.  For example

SELECT Name, StartYear FROM Employees

returns

Name StartYear
Bob 2001
Sally 2004
Alice 2005
Fred 2004

 

WHERE

The WHERE clause filters the rows. For example

SELECT * FROM Employees WHERE StartYear=2004

returns

EmployeeId Name StartYear Salary DepartmentId
2 Sally 2004 20 10
4 Fred 2004 12.5 20

 

GROUP BY

Next, the GROUP BY clause allows for combining rows using different functions like COUNT (count) and AVG (average). For example,

SELECT StartYear, COUNT(*) as Num, AVG(Salary) as AvgSalary
FROM EMPLOYEES
GROUP BY StartYear

returns

StartYear Num AvgSalary
2001 1 10.5
2004 2 16.25
2005 1 25

 

JOIN

Finally, the JOIN clause allows us to join in other tables. For example, assume we have a table called Departments:

DepartmentId DepartmentName
10 Sales
20 Engineering

We could use JOIN to combine the Employees and Departments tables based ON the DepartmentId fields:

SELECT Employees.Name AS EmpName, Departments.DepartmentName AS DepName
FROM Employees JOIN Departments
ON Employees.DepartmentId = Departments.DepartmentId;

The results might look like:

EmpName DepName
Bob Sales
Sally Sales
Alice Engineering
Fred Engineering

We’ve ignored a lot of details about joins: e.g. there are actually (at least) 4 types of joins, but hopefully this gives you a good picture.

Conclusion and Further Reading

With these basic commands, you can get a lot of basic data processing done.  Don’t forget, that you can nest queries and create really complicated joins.  It’s a lot more powerful than Excel, and gives you much better control of your data.  Of course, there’s a lot more to SQL than what we’ve mentioned and this is only intended to wet your appetite and give you a taste of what you’re missing.

 

And when you’re ready to step it up from “medium data” to “big data”, you should apply for a fellowship at The Data Incubator where we work with current-generation data-processing technologies like MapReduce and Spark!

Berkeley Undergrad Data Science Course and Textbook

The University of California at Berkeley has started The Berkeley Data Science Education Program. The goal is to build a data science education program throughout the next several years by engaging faculty and students from across the campus. The introductory data science course is targeting freshman and it is taught from a very applicable and interactive environment. The course videos, slides, labs, and notes are freely available for others to use. The course heavily uses Jupyter. Also, the course textbook is online at Computational and Inferential Thinking: The Foundations of Data Science.

Do’s and Don’ts of Data Science

Don’t Start with the Data
Do Start with a Good Question

Don’t think one person can do it all
Do build a well-rounded team

Don’t only use one tool
Do use the best tool for the job

Don’t brag about the size of your data
Do collect relevant data

Don’t ignore domain knowledge
Do consult a subject matter expert

Don’t publish a table of numbers
Do create informative charts

Don’t use just your own data
Do enhance your analysis with open data

Don’t do all the work yourself
Do partner with local universities

Don’t always build your own tools
Do use lots of open source tools

Don’t keep all your findings to yourself
Do share your analysis and results with the world!


Got any to add? Please leave a comment.

Tips for Future Data Scientists

While preparing a for a recent talk I gave to an undergraduate audience, I started compiling some tips for future data scientists. The tips are intended for students (undergraduate and graduate) or anyone else planning to enter the field of data science.

I asked a few of my data science friends and posted a question on Quora, As a data scientist, what tips would you have for a younger version of yourself?

What follows is a summary of the many tips.

Tips for Data Science

  • Be flexible and adaptable – There is no single tool or technique that always works best.
  • Cleaning data is most of the work – Knowing where to find the right data, how to access the data, and how to properly format/standardize the data is a huge task. It usually takes more time than the actual analysis.
  • Not all building models – Like the previous tip, you must have skills beyond just model building.
  • Know the fundamentals of structuring data – Gain an understanding of relational databases. Also learn how to collect and store good data. Not all data is useful.
  • Document what you do – This is important for others and your future self. Here is a subtip, learn version control.
  • Know the business – Every business has different goals. It is not enough to do analysis just because you love data and numbers. Know how your analysis can make more money, positively impact more customers, or save more lives. This is very important when getting others to support your work.
  • Practice explaining your work – Presentation is essential for data scientists. Even if you think you are an excellent presenter, it always helps to practice. You don’t have to be comfortable in front of an audience, but you must be capable in front of an audience. Take every opportunity you can get to be in front of a crowd. Plus, it helps to build your reputation as an expert.
  • Spreadsheets are useful – Although they lack some of the computational power of other tools, spreadsheets are still widely used and understood by the business world. Don’t be afraid to use a spreadsheet if it can get the job done.
  • Don’t assume the audience understands – Many (non-data science) audiences will not have a solid understanding of math. Most will have lost their basic college and high school mathematics skills. Explain concepts such as correlation and avoid equations. Audiences understand visuals, so use them to explain concepts.
  • Be ready to continually learn – I do not know a single data scientist who has stopped learning. The field is large and expanding daily.
  • Learn the basics – Once you have a firm understanding of the basics in mathematics, statistics, and computer programming; it will be much simpler to continue learning new data science techniques.
  • Be polymath – It helps to be a person with a wide range of knowledge.

Thanks to Chad, Chad, Lee, Buck, and Justin for providing some of the tips.

Getting Started with Data Science Specialties

I frequently ask young people, particularly undergraduates, what they plan to do with their future. I am often less than enthused with the responses which sound something like this:

  • I hope to get a job doing statistics.
  • I just want to work with computers.
  • I want to be a data scientist.
  • I just want a job.

The responses are typically vague and void of direction. Most responses involve waiting for someone else to provide the guidance. You do not have to wait. You can get started today.

If you are just interested in getting a job, the rest of this post is not for you. If you want to make an impact with your data science career, the remainder of this post is for you.

Below is an explanation of numerous specialties in data science. You don’t need to learn them all. Just pick one and follow the first step. You will learn more along the way. Don’t stress about which one to pick, there is no wrong answer. Just pick one and start building.

Data Visualization

Data visualization is all about telling a story with data. Do you have a keen eye for color and design? Can you summarize complex data in a few simple charts? If you answer yes to those questions, then you just might be a good fit for data visualization.

First Step: Go to Data.gov and make an infographic

Data Science Educator

Are you the person always explaining your homework to others? This specialty might be for you. You can take a few different paths. One is the traditional university faculty approach. Another is more of a corporate training professional. The world needs both. Plus, if you are entrepreneurial, there are ample opportunities to consult as a data science educator. Businesses realize they need to know data science, and they are looking for training.

First Step: Start a video or blog with tutorials

Data Engineer

A data engineer is typically more interested in systems than just the machine learning. Data engineers are typically strong with computer science fundamentals. They love to build things that themselves and others can use. A good data engineer can also spend a lot of time cleaning data as well.

First Step: Build a solution (hint: Cortana Intelligence Solutions)

Data Programmer

Do you love to program? If so, you just might fall into this category. Data science has many needs for programmers. Everything from cleaning data to building data products needs programming.

First Step: Be on Github

Statistical Modeling (Machine Learning)

Some people just love the statistical modeling and machine learning. They love to tune models and squeeze the last bit of predictive power from a data set. If you love talking about regression, trees, random forests, AUC, cross-validation and boosting; then this specialty is most likely for you.

First Step: Enter Kaggle competitions.

Data Science Manager

If you are bossy, it does not mean you will make a good manager. The best managers know how to build strong teams and get out of the way. Managers will provide help and overall direction for projects. Plus, he/she should have a solid understanding of how data can help shape a team’s decisions.

First Step: Organize a group to help a non-profit analyze data (Similar to what DataKind does)

Data Science Researcher

A researcher is interested in pushing the boundaries of data science. Are you interested in creating your own machine learning algorithms? Do you want to build the next great data framework? Do you think data science can achieve something no one else has thought to try? If so, being a researcher is for you.

First Step: Go to graduate school

Data Science Unicorn

A data science unicorn is someone that knows all the specialties above and more. A unicorn understands all the topics of data science. Being a unicorn is not attainable for everyone, but a few people have become unicorns. If you think you can be a unicorn, go for it.

First Step: Start at visualization above

In Conclusion,

Simple: Pick a specialty and Go Make a Difference!


This post is based upon a talk I gave at Winona State University just before MUDAC. The original title was Go After Your Data Science Dreams.

Midwest Undergraduate Data Analytics Competition

The 2016 Midwest Undergraduate Data Analytics Competition (MUDAC) will be held at Winona State University in Winona, Minnesota on April 2 and 3.

  • What is MUDAC?
    MUDAC is an intense 2-day analytics competition aimed at undergraduate students. Teams compete to solve a problem posed by an external organization.
  • Who can compete?
    Teams of 3 to 4 undergraduate students attending a school in Minnesota, Wisconsin, Iowa, Illinois, North Dakota, or South Dakota
  • Why attend MUDAC?
    • A fun learning experience
    • Friendly competition
    • Teamwork
    • Meet others with similar inteests
    • Learn about data science/analytic careers
    • Practice preparing and giving a presentation
    • Cash prizes for winning
    • Door prizes

The competition also includes a panel discussion with some local data professionals. I am honored to be one of those panelists.

If you attend or teach at a university in the upper Midwest and you are interested in data science, you should strongly consider bringing a team to MUDAC. I hope to see you there.

Data Society – A new Data Science Learning Platform

If you are looking to learn data science but do not have the time or money for a full master’s degree, Data Society might be your answer. Data Society is an online data analysis skills training program that is designed by educators and curated by data science experts. The learning experience is online and includes:

  • Training Videos
  • Printable step-by-step guides
  • Forums
  • Reusable Coding Templates
  • Data Sets
  • Opportunities to build a Portforlio

There is one other completely awesome feature of Data Society. For every membership purchased, they provide a free membership to help someone in need. Data Society is currently running a Kickstarter to build a community for learning data science. Your support would be greatly appreciated (I am not involved in the project but I am always happy to share innovative educational opportunities for data science).

Recently, I was able to get a brief interview with Merav Yuravlivker, one of the founders of Data Society.

There are many data science learning resources on the web, how is Data Society different?

We understand that most people do want to learn these skills, but don’t feel like they have the time, the money or the background. We eliminate all those barriers to entry by providing short lessons that are taught intuitively with real data sets. It’s the first platform that’s designed with working professionals in mind. Not only do we teach our students how to analyze data, but we also have a separate track for managers that teaches them how to implement data-driven strategies in their teams, how to hire a data scientist, and how to communicate effectively with their employees. Our courses are not just videos, each course includes ready-made data analysis templates in R that decrease the time it takes to do the work, a step-by-step printable guide that can be used as a reference for every stage of the analysis, and live, dynamic forums where students can get all of their questions answers by the Data Society team as well as other students. In short, we provide everything someone needs to learn new skills in a much shorter amount of time.

What is the Kickstarter about?

Our Kickstarter campaign is about building that community around learning data science and helping others solve problems – we’ve already released the first three courses in our curriculum, and we’re excited to give our supporters an opportunity to see exactly how their contributions can make an impact. Our mission is to increase data literacy across the workforce – we know that data analysis skills are widely valued and sought-after, which is why we’re partnering with non-profits who help veterans and low-income individuals get back to work. For every membership bought off of Kickstarter, we will give one to someone who can use these skills to become more marketable and improve their life.

Is there anything else you would like to tell me about Data Society?

The most frequent compliment we get from students is that they didn’t feel intimidated to learn data analysis skills. As an educator, that is the biggest reward to me because we’re opening up possibilities for individuals who didn’t think that they had the ability to analyze data and pull insights from it. Our goal is not to turn everyone into a data scientist, but rather to give everyone the ability and confidence to get new data, look at the data they already have, ask “How can this data help me solve this problem?”, and then discover those insights that will help them make better decisions.

Why Data Science?
Why Data Science?