Real Talk with A Data Scientist: The Future of Data Wrangling

Sponsored Post by T.J. DeGroat of Springboard

At Springboard, we recently sat down with Michael Beaumier, a data scientist at Google, to discuss his transition into the field, what the interview process is like, the future of data wrangling, and the advice he has for aspiring data professionals.

The full video Q&A is below, but here are some of the highlights.

You started off with a Ph.D. in physics and now you’re a data scientist. How did that transition happen?

I think in physics one of the things that attracted me most to the field that I studied, which was particle physics, was the ability to leverage computer science mathematical modeling and data visualization to solve big questions. The longer I was in that field, the more I realized that that process of problem-solving with those techniques and tools, I actually enjoyed that more than the physics itself. So it was a natural transition for me to move to a career that felt like it had higher impact, a wide variety of work, and good career opportunity—and data science was that was that move for me.

Before Google, you worked at Mercedes. What were some of the projects that you worked on there?

The big problem that we were solving at Mercedes is basically: how do you build machine learning into an environment where there’s limited-to-no data connectivity? One of those pieces was: how do you do that in a car interface with a very low-power computer like a Raspberry Pi and basically create an interface that can customize itself to the users’ needs based on context? This was released as MBUX in 2017 and that was a pretty cool project because there were a lot of challenges that you wouldn’t normally have to solve if you had access to a massive data set or connectivity.

We pulled questions from some of our Springboard students and this one is from Miguel, who is in our data science vertical. He asked, “Do you have a process for thinking about the data before you start implementing the tools?”

That’s a great question and I think I would add another piece to that, which is that it kind of depends on where you’re at. The Mercedes answer would be that you have to build the tools, and usually the way that you approach that is you think about the data you have and what kind of features you have. And specifically, where is the information hiding that sort of links your input to output? And what kind of transformation do you need to apply to the data set to enable modeling? That’s what I would say.

The key is to think about—do all your work up front, I would say, because models are kind of a dime a dozen, with scikit-learn you can just like, it’s the same API so you can just model.fit a thousand times and then combine the outputs and then do a model on top of that. But all the work now, I think, is: how do you think about engineering features and building the model into a pipeline?

Another one of our data science students wants to know a little bit more about your interview process at Google. What was that like and how did you search for your job?

I actually interviewed at Google twice. The first time I failed utterly and the second time I got through. There are many paths into interviewing at Google and I think a pretty common one, if you don’t know anyone at Google, is that a recruiter reaches out to you—once you have a job and that’s all posted on LinkedIn and your LinkedIn profile looks pretty good, you’re gonna get contacted by Facebook, Amazon, Google, Netflix. Like, they’re gonna talk to you. So when you take the recruiter intake process, usually what they will do is they’ll kind of hold your hand through the whole process. The first piece is a technical screen. If the recruiter feels like that’s unnecessary, they may move you directly to an onsite.

I’ve interviewed with a lot of companies and one of the things that I always do is I just ask the recruiter: what are the metrics that you will be evaluating me on in these interviews and what topics will be covered? I think some people feel like that’s off-limits, but, you know, this isn’t school, you’re allowed to ask what’s on the test before you get the test, so I encourage people to do that. it’s not cheating.

I think a lot of people, when they interview, they get really fixated on “I got to get to the right answer” and maybe freeze because they get really nervous. And these are totally natural reactions. I would say with the interview process, if you think of it more of a performance, you’ll have a much better time, you’ll probably enjoy it more, and I think the outcome will be better.

We also have a question from Guy, who is one of our community managers. He wanted to know, “How do you develop a knack for data wrangling? Do you have a process for that?”

Honestly, I think that data wrangling is going to be one of the things that is replaced by modeling. There’s a whole field called data learning, where basically a model tries to learn—you have a data set and it may be super messy and you just throw that into a data learning model and the model figures out what’s important. OK, but that’s not the question. The question is: how do you develop an intuition for data?

So, it’s tricky because sometimes it, I would say it depends on the size of the data set. When the data set becomes very large in terms of the number of observations or features… you know, I work on projects that regularly have tens of thousands of features and so that just can’t fit into your head as a human being, so I think the only way in that space that you can develop an intuition for data is actually from a modeling perspective. So, you try models, you understand what over and underfitting is, and how to tune your model…

In the case where you have a smaller number of features, I think the best way to build an intuition for that is to just try it. There’s this data set that I send out to people that’s from the UCI Machine Learning Repository. It’s the Auto MPG Data Set and I love it because it has a mix of categorical features. It has numeric features that are really actually categorical features and then the goal is: can you model the miles per gallon of these cars? It’s so rich and it’s so small and there’s only like 300 observations and like eight features, but you can do a lot with it. And so I think the crucial piece to developing intuition is to practice. And if you’re not practicing, you can’t develop that intuition.

Just to kind of like put the nail in that, you need to be comfortable with handling categorical features, you need to understand how to turn numerical features into categorical features, you need to understand how to treat numerical features, you need to understand what normalization is and what models are robust to feature normalization and what models it doesn’t matter for. And there’s no easy way, I would just say: try it. And when people ask me, like, help me prepare for data science interviews, I just tell them to do data challenges over and over again and then come talk to me and tell me how you feel about the static data challenges.

We have one last question from a student, Cory. He asks, “How important is SQL in comparison to Python in 2019?”

That’s an interesting question because I think people love putting SQL on this mega pedestal as like a core thing that all data scientists have to use. My opinion on it is that if your data set fits into memory, it’s like it literally doesn’t matter what you do because computers are fast and human brains are slow and you’re not going to notice the difference between doing something in a pandas dataframe versus an R data frame versus SQL.

However, you have to know SQL. Like, the universe has decided that in data science interviews, you must know SQL. So, the question is: what’s important in 2019? I would say if your data set doesn’t fit into memory, then you have to use SQL and if it does, it doesn’t matter—and there’s like a million caveats to that statement.

This Q&A has been lightly edited and condensed for clarity.

Ready to start (or grow) your data science career? Check out Springboard’s Data Science Career Track—you’ll learn the skills and get the personalized guidance you need to land the job you want.

Data Science News for April 29, 2019

Here is the latest data science news for the week of April 29, 2019.

From Data Science 101

General Data Science

What do you think? Did I miss any big news in the data science world?

GoLang for Data Science

While it is not one of the popular programming languages for data science, The Go Programming Language (aka Golang) has surfaced for me a few times in the past few years as an option for data science. I decided to do some searching and find some conclusions about whether golang is a good choice for data science.

Popularity of Go and Data Science

As the following figure from Google Trends demonstrates, golang and data science became trendy topics at about the same time and grew at a similar rate.

The timely trends may have created the desire to merge the two technologies together.

Golang Projects for Data Science

Some internet searching will reveal a number of interesting Golang/Data Science projects on Github. Unfortunately, many of the projects had good initial traction but have dwindled in activity over the last couple years. Below is a listing of some of the data science related projects for Golang.

  • Gopher Data – Gophers doing data analysis, no schedule events, last blog post was 2017
  • Gopher Notes – Golang in Jupyter Notebooks
  • Lgo – Interactive programming with Jupyter for Golang
  • Gota – Data frames for Go, “The API is still in flux so use at your own risk.”
  • qframe – Immutable data frames for Go, better speed than Gota but not as well documented
  • GoLearn – Machine Learning for Go
  • Gorgonia – Library for machine learning in Go
  • Go Sklearn – Port of sci-kit learn from Python, still active but only a couple committers, early but promising
  • Gonum – Numerical library for Go, very promising and active

Golang Data Science Books

There have even been a couple books written about the topic.

Thoughts from the Community

The “Go for Data Science” debate has been discussed numerous times over the past few years. Below is a listing of some of those discussions and the key take aways.

Reasons to use Golang for Data Science

  • Performance
  • Concurrency
  • Strong Developer Ecosystem
  • Basic Data Science packages are available

Reasons Not to use Golang for Data Science

  • Limited support from the data science community for Golang
  • Significantly increased time for exploratory analysis
  • Less flexibility to try other optimization and ML techniques
  • The data science community has not really adopted golang

Summary

In short, Golang is not widely used for exploratory data science, but rewriting your algorithms in Golang might be a good idea.

Finding Azure Updates

Microsoft Azure has an abundance of data science capabilities (and non-data science capabilities). It can be challenging to keep up with the latest updates/releases. Luckily, Azure has a page to let you know exactly what has changed. You just need to know where to find it, and the following video will help you find that page.

Also, if you are still interested in earning a Microsoft Data Science Certification, join the Study Group.

Top Companies to work for if you are a data scientist

LinkedIn’s 2017 report had put Data Scientist as the second fastest growing profession and it’s number one on 2019’s list of most promising jobs. There are three main reasons why data science has been rated as a top job according to research. Firstly, the number of available job openings is rapidly increasing and the highest in comparison to other jobs, data science has an extremely high job satisfaction rating, and the median annual salary base is undeniably desirable.

While data science is unquestionably a fantastic career path regarding the impressive ratings and the fact that it is such an in-demand job, statistics show that there will be no slowing down for the surprisingly rapid increase for the demand of data scientists around the globe.

Checkout the top 5 companies to work for if you are a data scientist based on employee reviews, job satisfaction ratings, and CEO approval.

#1 Dataiku

Dataiku is a top-rated computer software company that was founded in 2013 and its headquarters can be found in New York. This company develops collaborative data science software and according to Glassdoor reviews, 99% of the employees that work for Dataiku would recommend working at this company and 100% approve of the CEO. This shows that the vast majority of the employees are satisfied with the company and they are also a top choice for data science and machine learning positions based on annual pay packages.

Checkout: Dataiku Careers

#2 StreamSets

StreamSets was founded in 2014, its headquarter is located in San Francisco, California. The company develops a DataOps platform that can allow business to manage streaming data flows. An impressive 98% of individuals employed at this company would recommend it to their friends and 100% of the employees here also approve of the CEO. StreamSets is a top option for data management and integration.

Checkout: StreamSets Careers

#3 1010 Data

1010 Data has its headquarter in the New York and the company has over 15 years of experience in handling data analytics with over 850 clients across various industries. It is ranked as the third best company to opt for as a data scientist, 1010 Data is also a great option with 96% of employees recommending the company and 99% of employees approving of the CEO.

Checkout: 1010 Data Careers

#4 Reltio

Reltio is based in Redwood Shores, California and the company was founded in 2011. This top-rated company is recommended by 96% of its employees and a top choice for data management and integration. Even though it is fourth on the list according to statistics, it is still a fantastic company to expand your experience as a data scientist.

Checkout: Reltio Careers

#5 Looker

Looker was founded in 2012 and its headquarters are located in Santa Cruz, California. Looker is suggested as a great company to opt for by 95% of their happy employees and 93% of the employees that work at Looker approve of the CEO. This company is great for business analytics.

Checkout: Looker Careers

How can you get a job as a data scientist?

Having a degree in Data Science, Computer Science, Mathematics, Statistics, Social Science, Engineering with additional knowledge of Python, R Programming, Hadoop increases the possibility of getting a starting position job. Plenty of universities offer specialized data science program both online and offline. In recent times, we have observed a rise in online masters in data science, because of the convenience it offers to professionals, especially those looking to switch careers.

Build a portfolio using real data to complete projects that can showcase your abilities as a data scientist. You could also opt for an internship to further develop your skills and knowledge as a data scientist.

%d bloggers like this: