Tag Archives: data engineering

Getting Your First Job in Data Science

Getting your first data science job might be challenging, but it’s possible to achieve this goal with the right resources.

Before jumping into a data science career, there are a few questions you should be able to answer:

  • How do you break into the profession?
  • What skills do you need to become a data scientist?
  • Where are the best data science jobs?

First, it’s important to understand what data science is. To do data science, you have to be able to process large datasets and utilize programming, math, and technical communication skills. You also need to have a sense of intellectual curiosity to understand the world through data. To help complete the picture around data science, let’s dive into the different roles within data science.

The Different Data Science Roles

Data science teams come together to solve some of the hardest data problems an organization might face. Each individual of the team will have a different part of the skill set required to complete a project from end to end.

Data Scientists

Data scientists are the bridge between programming and algorithmic thinking. A data scientist can run a project from end-to-end. They can clean large amounts of data, explore data sets to find trends, build predictive models, and create a story around their findings.

Data Analysts

Data analysts sift through data and provide helpful reports and visualizations. You can think of this role as the first step on the way to a job as a data scientist or as a career path in of itself.

Data Engineers

Data engineers typically handle large amounts of data and lay the groundwork for data scientists to do their jobs effectively. They are responsible for managing database systems, scaling data architecture to multiple servers, and writing complex queries to sift through the data.

The Data Science Process

Now that you have a general understanding of the different roles within data science, you might be asking yourself “what do data scientists actually do?

Data scientists can appear to be wizards who pull out their crystal balls (MacBook Pros), chant a bunch of mumbo-jumbo (machine learning, random forests, deep networks, Bayesian posteriors) and produce amazingly detailed predictions of what the future will hold.

Data science isn’t magic mumbo-jumbo though, and the more precise we get about to clarify this, the better. The power of data science comes from a deep understanding of statistics,algorithms, programming, and communication skills. More importantly, data science is about applying these  skill sets in a disciplined and systematic manner. We apply these skill sets via the data science process. Let’s look at the data science process broken down into 6 steps.

Step 1: Frame the problem

Before you can start solving a problem, you need to ask the right questions so you can frame the problem.

Step 2: Collect the raw data needed for your problem

Now, you should think through what raw data you need to solve your problem and find ways to get that data.

Step 3: Process the data for analysis

After you collect the data, you’ll need to begin processing it and checking for common errors that could corrupt your analysis.

Step 4: Explore the data

Once you have finished cleaning your data, you can start looking into it to find useful patterns.

Step 5: Perform in-depth analysis

Now, you will be applying your statistical, mathematical and technological knowledge to find every insight you can in the data.

Step 6: Communicate the results of the analysis

The last step in the data science process is presenting your insights in an elegant manner. Make sure your audience knows exactly what you found.

If you worked as a data scientist, you would apply this process to your work every day.

What’s next?

Before you jump into data science and working through the data science process, there are some things you need to learn to become a data scientist.

Most data scientists use a combination of skills every day. Among the skills necessary to become a data scientist include an analytical mindset, mathematics, data visualization, and business knowledge, just to name a few.

In addition to having the skills, you’ll need to then learn how to use the modern data science tools. Hadoop, SQL, Python, R, Excel are some of the tools you’ll need to be familiar using. Each tool plays a different role in the data science process.

If you’re ready to learn more about data science, take a deeper look at the skills necessary to become a data scientist, and how to get a job in data science, download Springboard’s comprehensive 60-page guide on How to get your first job in data science.


How to get a Data Science Job

About Springboard: At Springboard, we’re building an educational experience that empowers our students to thrive in technology careers. Through our online workshops, we have prepared thousands of people for careers in data science.

GoLang for Data Science

While it is not one of the popular programming languages for data science, The Go Programming Language (aka Golang) has surfaced for me a few times in the past few years as an option for data science. I decided to do some searching and find some conclusions about whether golang is a good choice for data science.

Popularity of Go and Data Science

As the following figure from Google Trends demonstrates, golang and data science became trendy topics at about the same time and grew at a similar rate.

The timely trends may have created the desire to merge the two technologies together.

Golang Projects for Data Science

Some internet searching will reveal a number of interesting Golang/Data Science projects on Github. Unfortunately, many of the projects had good initial traction but have dwindled in activity over the last couple years. Below is a listing of some of the data science related projects for Golang.

  • Gopher Data – Gophers doing data analysis, no schedule events, last blog post was 2017
  • Gopher Notes – Golang in Jupyter Notebooks
  • Lgo – Interactive programming with Jupyter for Golang
  • Gota – Data frames for Go, “The API is still in flux so use at your own risk.”
  • qframe – Immutable data frames for Go, better speed than Gota but not as well documented
  • GoLearn – Machine Learning for Go
  • Gorgonia – Library for machine learning in Go
  • Go Sklearn – Port of sci-kit learn from Python, still active but only a couple committers, early but promising
  • Gonum – Numerical library for Go, very promising and active

Golang Data Science Books

There have even been a couple books written about the topic.

Thoughts from the Community

The “Go for Data Science” debate has been discussed numerous times over the past few years. Below is a listing of some of those discussions and the key take aways.

Reasons to use Golang for Data Science

  • Performance
  • Concurrency
  • Strong Developer Ecosystem
  • Basic Data Science packages are available

Reasons Not to use Golang for Data Science

  • Limited support from the data science community for Golang
  • Significantly increased time for exploratory analysis
  • Less flexibility to try other optimization and ML techniques
  • The data science community has not really adopted golang

Summary

In short, Golang is not widely used for exploratory data science, but rewriting your algorithms in Golang might be a good idea.

Microsoft Launches Data Science Certifications

Read to the end to learn more about a new study group I will be launching.

In Late January 2019, Microsoft launched 3 new certifications aimed at Data Scientists/Engineers. For a while, Microsoft has been toying with different methods for training and credentials. They launched the Microsoft Professional Program in Data Science back in 2017. While it provides great content, it did not result in either a college diploma or an official Microsoft certification. Now Microsoft is in the process of restructing certifications to be more role-focused. Here are details about the 3 certification of interest to data scientists and data engineers.

1. Azure Data Scientist Associate

Exams Required:

For more details and to register, go to the Azure Data Scientist Associate page.

2. Azure AI Engineer Associate

Exams Required:

For more details and to register, go to the Azure AI Engineer Associate page.

3. Azure Data Engineer Associate

Exams Required:

For more details and to register, go to the AzureData Engineer Associate page.

Exam Details

As of March 2019, the exams are in beta phase and the details of what is on the exams are very sparse and vague.

New Study Group

Are you interested in taking one or all of the exams? I am organizing a study group/community. Sign up to get the latest details from Microsoft Data Science Certification Study Group.

Full Disclosure: I am not a Microsoft Employee, and this group is not sponsored or endorsed by Microsoft. I am just excited about the certifications and hoping to help others (and myself) prepare.

Azure Functions for Data Science

Data Scientists do more than build fancy AI and machine learning models. They often times need to get involved with the data acquisition process. It is common for data to be pulled from other databases or even an API. Plus, the models need to be deployed. These tasks fall to the data scientist to solve (unless there is a data engineer willing to help). Recently, I have discovered Azure Functions to be an extremely useful tool for solving these types of tasks.

What are Azure Functions?

Simply stated, Azure Functions are pieces of code that run. More formally stated,

Azure Functions are serverless computing which allows code to run on-demand without the need to manage servers or hardware infrastructure.

This is exactly what a data scientist needs to solve the tasks mentioned above. I, for one, do not enjoy managing servers (hardware or virtual). I have done it before, but I find it time-consuming and tedious. It is just not my thing. Thus, I happily welcome the serverless capabilities of Azure Functions. I just focus on the code and get the task completed.

Because the code does not always need to be running, Azure Functions invoke the code based upon specified triggers. Once the trigger is activated, the code will begin to run. The following list provides some examples of the triggers available.

Triggers for Azure Functions:
  1. Timer – Set a timer to run the Azure Function as often as you like. Timing is specified with a cron expression.
  2. HTTP Rest call – Have some other code fire off an HTTP request to start the Azure Function.
  3. Blob storage – Run the Azure Function whenever a new file is added to a Blob storage account.
  4. Event Hubs – Event Hubs are often used for collecting real-time data, and this integration offers Azure Functions the ability to run when a real-time event occurs.
  5. Others – Cosmos DB, Service Bus, IoT Hub, GitHub are other events which can trigger an Azure Function.

What can Azure Functions Do?

Once you begin to understand the concept, you can quickly see some of the possibilities. Without having to configure servers or virtual machines, the following tasks become much simpler:

  • Reading and writing data from a database
  • Processing images
  • Interacting with an HTTP endpoint
  • Automating decisions in real-time
  • Computing descriptive statistics
  • Creating your own endpoint for other data scientists to call
  • Automatically analyzing code after commits
Programming Languages for Azure Functions

As of August 2018, full support is provided for C#, Javascript, and F#. Experimental support is provided for Batch, PowerShell, Python, and TypeScript. Python can be used to create an HTTP endpoint. This would allow someone to quickly create an endpoint for running machine learning models via scikit-learn or another python module. Unfortunately, R is not yet available, but Microsoft has a lot invested in R, so I am expecting this eventually.

Simplify Tasks for Data Science

Next time you have a data science task which requires a little coding, consider using an Azure Function to run the code. It will most likely save you some deployment and configuration time. Then you can quickly get back to optimizing those fancy AI models.

See the video below for a quick demonstration of how to create an Azure function via a web browser (no IDE needed).

Best Practices for Machine Learning Engineering

Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.

Rules of Machine Learning:
Best Practices for ML Engineering

If you do data engineering or are involved with building data science systems, this document is worth a look.