Tag Archives: data scientist

What is on the Microsoft Data Science Certification Exam?

I took and passed the exam during the beta period. These are my memories of the topics on the exam. You can get this information as the Microsoft Azure Data Scientist Checklist.

General Overview

Below is the basic structure of the DP-100: Designing and Implementing a Data Science Solution on Azure. Passing the exam will qualify you for the Azure Data Scientist Associate certification. It can be taken in a traditional exam center or at your home (but you will be watched via video camera).

  • 180 minutes
  • 60 questions (45 Q&A, 15 about Case Studies)

The exam can be broken down into 4 components: Machine Learning, Azure ML Studio, Azure Products, and Python. Below is a breakdown of the topics I remember from the exam.

Machine Learning

These are topics which would be covered in a traditional machine learning course. Here are some of the specific topics I remember.

  • Evaluation of Linear Regression
  • Evaluation of Classification
  • Positive/negative skew
  • Poisson regression
  • Fisher’s exact test
  • Pearson
  • PCA
  • Deep learning – high-level, what is is for
  • Neural Networks (RNN vs CNN vs DCN vs GAN)

Azure ML Studio

Azure ML Studio is a major focus of the exam, so you need to be fluent in how to use it. Questions ranged from the basics of how to import data all the way to specifics about certain modules.

  • Model Building
  • Model Evalutation
  • SMOTE
  • MICE
  • feature extraction
  • missing data questions
  • AutoML

Azure Products

There were a number of questions from this category. The question would present you a scenario problem and ask which products would be useful for solving the problem. The questions did not go very deep into any of the products, but you will need to know the purpose of these products.

  • Azure Machine Learning Service
  • Blob storage – specifically how to get data in/out
  • Azure Notebooks
  • Azure Cognitive Services (high level)
  • Kubernetes
  • HDInsight
  • Data Science Virtual Machine

Python

Python was the language of choice for the exam, so focus on it.

  • Scikit-learn
  • Azure Machine Learning SDK for Python
  • Hyperparameters

Not on the exam

The following topics were not covered on my exam. The exam questions are pulled from a pool of questions, so it is possible these topics may be cover on a different person’s exam. In any case, these are definitely not major portions of the exam.

  • R
  • Power BI
  • Publishing Azure ML models

Again, if you want this information in an easy to follow checklist, just visit, Microsoft Azure Data Scientist Checklist.

Real Talk with A Data Scientist: The Future of Data Wrangling

Sponsored Post by T.J. DeGroat of Springboard

At Springboard, we recently sat down with Michael Beaumier, a data scientist at Google, to discuss his transition into the field, what the interview process is like, the future of data wrangling, and the advice he has for aspiring data professionals.

The full video Q&A is below, but here are some of the highlights.

You started off with a Ph.D. in physics and now you’re a data scientist. How did that transition happen?

I think in physics one of the things that attracted me most to the field that I studied, which was particle physics, was the ability to leverage computer science mathematical modeling and data visualization to solve big questions. The longer I was in that field, the more I realized that that process of problem-solving with those techniques and tools, I actually enjoyed that more than the physics itself. So it was a natural transition for me to move to a career that felt like it had higher impact, a wide variety of work, and good career opportunity—and data science was that was that move for me.

Before Google, you worked at Mercedes. What were some of the projects that you worked on there?

The big problem that we were solving at Mercedes is basically: how do you build machine learning into an environment where there’s limited-to-no data connectivity? One of those pieces was: how do you do that in a car interface with a very low-power computer like a Raspberry Pi and basically create an interface that can customize itself to the users’ needs based on context? This was released as MBUX in 2017 and that was a pretty cool project because there were a lot of challenges that you wouldn’t normally have to solve if you had access to a massive data set or connectivity.

We pulled questions from some of our Springboard students and this one is from Miguel, who is in our data science vertical. He asked, “Do you have a process for thinking about the data before you start implementing the tools?”

That’s a great question and I think I would add another piece to that, which is that it kind of depends on where you’re at. The Mercedes answer would be that you have to build the tools, and usually the way that you approach that is you think about the data you have and what kind of features you have. And specifically, where is the information hiding that sort of links your input to output? And what kind of transformation do you need to apply to the data set to enable modeling? That’s what I would say.

The key is to think about—do all your work up front, I would say, because models are kind of a dime a dozen, with scikit-learn you can just like, it’s the same API so you can just model.fit a thousand times and then combine the outputs and then do a model on top of that. But all the work now, I think, is: how do you think about engineering features and building the model into a pipeline?

Another one of our data science students wants to know a little bit more about your interview process at Google. What was that like and how did you search for your job?

I actually interviewed at Google twice. The first time I failed utterly and the second time I got through. There are many paths into interviewing at Google and I think a pretty common one, if you don’t know anyone at Google, is that a recruiter reaches out to you—once you have a job and that’s all posted on LinkedIn and your LinkedIn profile looks pretty good, you’re gonna get contacted by Facebook, Amazon, Google, Netflix. Like, they’re gonna talk to you. So when you take the recruiter intake process, usually what they will do is they’ll kind of hold your hand through the whole process. The first piece is a technical screen. If the recruiter feels like that’s unnecessary, they may move you directly to an onsite.

I’ve interviewed with a lot of companies and one of the things that I always do is I just ask the recruiter: what are the metrics that you will be evaluating me on in these interviews and what topics will be covered? I think some people feel like that’s off-limits, but, you know, this isn’t school, you’re allowed to ask what’s on the test before you get the test, so I encourage people to do that. it’s not cheating.

I think a lot of people, when they interview, they get really fixated on “I got to get to the right answer” and maybe freeze because they get really nervous. And these are totally natural reactions. I would say with the interview process, if you think of it more of a performance, you’ll have a much better time, you’ll probably enjoy it more, and I think the outcome will be better.

We also have a question from Guy, who is one of our community managers. He wanted to know, “How do you develop a knack for data wrangling? Do you have a process for that?”

Honestly, I think that data wrangling is going to be one of the things that is replaced by modeling. There’s a whole field called data learning, where basically a model tries to learn—you have a data set and it may be super messy and you just throw that into a data learning model and the model figures out what’s important. OK, but that’s not the question. The question is: how do you develop an intuition for data?

So, it’s tricky because sometimes it, I would say it depends on the size of the data set. When the data set becomes very large in terms of the number of observations or features… you know, I work on projects that regularly have tens of thousands of features and so that just can’t fit into your head as a human being, so I think the only way in that space that you can develop an intuition for data is actually from a modeling perspective. So, you try models, you understand what over and underfitting is, and how to tune your model…

In the case where you have a smaller number of features, I think the best way to build an intuition for that is to just try it. There’s this data set that I send out to people that’s from the UCI Machine Learning Repository. It’s the Auto MPG Data Set and I love it because it has a mix of categorical features. It has numeric features that are really actually categorical features and then the goal is: can you model the miles per gallon of these cars? It’s so rich and it’s so small and there’s only like 300 observations and like eight features, but you can do a lot with it. And so I think the crucial piece to developing intuition is to practice. And if you’re not practicing, you can’t develop that intuition.

Just to kind of like put the nail in that, you need to be comfortable with handling categorical features, you need to understand how to turn numerical features into categorical features, you need to understand how to treat numerical features, you need to understand what normalization is and what models are robust to feature normalization and what models it doesn’t matter for. And there’s no easy way, I would just say: try it. And when people ask me, like, help me prepare for data science interviews, I just tell them to do data challenges over and over again and then come talk to me and tell me how you feel about the static data challenges.

We have one last question from a student, Cory. He asks, “How important is SQL in comparison to Python in 2019?”

That’s an interesting question because I think people love putting SQL on this mega pedestal as like a core thing that all data scientists have to use. My opinion on it is that if your data set fits into memory, it’s like it literally doesn’t matter what you do because computers are fast and human brains are slow and you’re not going to notice the difference between doing something in a pandas dataframe versus an R data frame versus SQL.

However, you have to know SQL. Like, the universe has decided that in data science interviews, you must know SQL. So, the question is: what’s important in 2019? I would say if your data set doesn’t fit into memory, then you have to use SQL and if it does, it doesn’t matter—and there’s like a million caveats to that statement.

This Q&A has been lightly edited and condensed for clarity.

Ready to start (or grow) your data science career? Check out Springboard’s Data Science Career Track—you’ll learn the skills and get the personalized guidance you need to land the job you want.

Help set the standards for a Data Scientist

The field of data science is moving fast. People are claiming to be data scientists; yet the knowledge, experience, and backgrounds of those people can be very different. Different is not bad. However, there a little standards around what exactly a data scientist is.

Sticking with this week’s theme of “What is a Data Scientist”, an organization titled, Initiative for Analytics and Data Science Standards (IADSS) has kicked-off a research study at global scale. The study aims to gain insight about the analytics profession in the industry and help support the development of standards regarding analytics role definitions, required skills and career advancement paths. This will help set some industry standards which in turn could support the healthy growth of the analytics market.  

If you want to be a part of this initiative and help collectively define industry standards, I encourage you to take part in the research. The survey takes approximately 5 minutes and answers for the survey will be kept anonymous. More details are provided at introduction pages of the survey at Data Science Industry Standards Research Survey

Currently, over 12 million users on LinkedIn claim to have data science and analytics capabilities. The field could use some standards around different roles and necessary skills.

Typical Data Scientist 2019

The profile of a data scientist is changing slightly as the profession becomes more solidified. Data Science 365 conducts a study to determine some of the characteristics of a “typical data scientist.” The below infographic covers a wealth of information from programming languages used to educational backgrounds to locations. It is definitely worth looking at to understand the attributes of a data scientist in 2019.

typical data scientist 2019 infographic

Data Science in 30 Minutes with Jake Porway of DataKind

This month, The Data Incubator is hosting another free webinar on data science. This time it features an interview with Jake Porway, founder of DataKind. Jake and DataKind are doing amazing things, so I hope you can check out the webinar.

Below are the webinar details:

Data Science in 30 Minutes: Using Data Science in the Service of Humanity

Abstract: With his non-profit, DataKind, Jake Porway connects data scientists with social causes. Picture Doctors Without Borders for data nerds—with machine learning and AI engineers parachuting in to help the UN with humanitarian tasks, like tracking virus outbreaks using mobile data. “Data is like a bucket of crude oil,” Porway says. “Potentially great, but only if someone knows how to refine it (data scientists) and someone else has vehicles that will run on it (the social sector).” In this talk, Porway discusses the strategies of DataKind, its projects and the future of big data to service humanity.

The free online webinar is:

Thu, November 15, 2018 @ 5:30 PM – 6:30 PM EST

You can register at Data Science in 30 minutes.

Learn Data Science Youtube Channel

Transitioning to a career in data science can be full of unanswered questions. I am here to help you get answers to those questions.

Today, I am launching a new Youtube Channel, Learn Data Science. I will select a question and make a video providing an answer. I will provide some of the answers, and I may have some guests answer the questions as well.

If you have any questions about becoming a data scientist, please leave a comment.

A Data Science Career with Kirk Borne, Free Webinar

Once again, The Data Incubator, is hosting another Data Science in 30 minutes webinar. This one features the career of Kirk Borne.

Renowned data scientist, Kirk Borne will take viewers on a journey through his career in science and technology explaining how the industry-and himself have evolved over the last 4 decades. Starting with skipping lunches in high school to a systematic twitter obsession, Kirk will shed light on his road to success in the data science industry.

Kirk is universally considered one of the most (if not the most) influential voices in data science. If you are interested in a career in data science, this is a webinar you will not want to miss.

The webinar is 5:30 Eastern Time on August 29, 2017, and registrations are currently being accepted. It is free.