Occasionally a product in Microsoft Azure will go down. Luckily, Azure has a status page to tell you which servers and services are down. Here is a quick video to help you find that status page.
While it is not one of the popular programming languages for data science, The Go Programming Language (aka Golang) has surfaced for me a few times in the past few years as an option for data science. I decided to do some searching and find some conclusions about whether golang is a good choice for data science.
Popularity of Go and Data Science
As the following figure from Google Trends demonstrates, golang and data science became trendy topics at about the same time and grew at a similar rate.
The timely trends may have created the desire to merge the two technologies together.
Golang Projects for Data Science
Some internet searching will reveal a number of interesting Golang/Data Science projects on Github. Unfortunately, many of the projects had good initial traction but have dwindled in activity over the last couple years. Below is a listing of some of the data science related projects for Golang.
- Gopher Data – Gophers doing data analysis, no schedule events, last blog post was 2017
- Gopher Notes – Golang in Jupyter Notebooks
- Lgo – Interactive programming with Jupyter for Golang
- Gota – Data frames for Go, “The API is still in flux so use at your own risk.”
- qframe – Immutable data frames for Go, better speed than Gota but not as well documented
- GoLearn – Machine Learning for Go
- Gorgonia – Library for machine learning in Go
- Go Sklearn – Port of sci-kit learn from Python, still active but only a couple committers, early but promising
- Gonum – Numerical library for Go, very promising and active
Golang Data Science Books
There have even been a couple books written about the topic.
- Go Machine Learning Projects (2018) – this book uses gonum and gorgonia in the examples
- Machine Learning with Go (2017)
Thoughts from the Community
The “Go for Data Science” debate has been discussed numerous times over the past few years. Below is a listing of some of those discussions and the key take aways.
- Machine Learning with Go? on Reddit
“and once you know what you are going to do, implementing the training and deploying in Go is much better”
- Golang for Data Science on Reddit
“most likely it won’t be go and be one of more academia adopted languages like Python, MatLab or Julia”
- Data Science Gophers – O’Reilly Blog Post
- Moving From Python to Go – Towards Data Science blog post
not data science specific, but helpful
- Go for Data Science
by the author of gorgonia and one of the books, includes a great presentation slidedeck
- Why we switched from Python to Go
also, not data science specific
- Go vs Python 3 Benchmarks
Go performs much better than Python on benchmarks
- Can Go really be that much faster than python? Stackoverflow
“Go really can be that much faster than python”
Reasons to use Golang for Data Science
- Strong Developer Ecosystem
- Basic Data Science packages are available
Reasons Not to use Golang for Data Science
- Limited support from the data science community for Golang
- Significantly increased time for exploratory analysis
- Less flexibility to try other optimization and ML techniques
- The data science community has not really adopted golang
In short, Golang is not widely used for exploratory data science, but rewriting your algorithms in Golang might be a good idea.
Read to the end to learn more about a new study group I will be launching.
In Late January 2019, Microsoft launched 3 new certifications aimed at Data Scientists/Engineers. For a while, Microsoft has been toying with different methods for training and credentials. They launched the Microsoft Professional Program in Data Science back in 2017. While it provides great content, it did not result in either a college diploma or an official Microsoft certification. Now Microsoft is in the process of restructing certifications to be more role-focused. Here are details about the 3 certification of interest to data scientists and data engineers.
1. Azure Data Scientist Associate
For more details and to register, go to the Azure Data Scientist Associate page.
2. Azure AI Engineer Associate
For more details and to register, go to the Azure AI Engineer Associate page.
3. Azure Data Engineer Associate
For more details and to register, go to the AzureData Engineer Associate page.
As of March 2019, the exams are in beta phase and the details of what is on the exams are very sparse and vague.
New Study Group
Are you interested in taking one or all of the exams? I am organizing a study group/community. Sign up to get the latest details from Microsoft Data Science Certification Study Group.
Full Disclosure: I am not a Microsoft Employee, and this group is not sponsored or endorsed by Microsoft. I am just excited about the certifications and hoping to help others (and myself) prepare.
Data Scientists do more than build fancy AI and machine learning models. They often times need to get involved with the data acquisition process. It is common for data to be pulled from other databases or even an API. Plus, the models need to be deployed. These tasks fall to the data scientist to solve (unless there is a data engineer willing to help). Recently, I have discovered Azure Functions to be an extremely useful tool for solving these types of tasks.
What are Azure Functions?
Simply stated, Azure Functions are pieces of code that run. More formally stated,
Azure Functions are serverless computing which allows code to run on-demand without the need to manage servers or hardware infrastructure.
This is exactly what a data scientist needs to solve the tasks mentioned above. I, for one, do not enjoy managing servers (hardware or virtual). I have done it before, but I find it time-consuming and tedious. It is just not my thing. Thus, I happily welcome the serverless capabilities of Azure Functions. I just focus on the code and get the task completed.
Because the code does not always need to be running, Azure Functions invoke the code based upon specified triggers. Once the trigger is activated, the code will begin to run. The following list provides some examples of the triggers available.
Triggers for Azure Functions:
- Timer – Set a timer to run the Azure Function as often as you like. Timing is specified with a cron expression.
- HTTP Rest call – Have some other code fire off an HTTP request to start the Azure Function.
- Blob storage – Run the Azure Function whenever a new file is added to a Blob storage account.
- Event Hubs – Event Hubs are often used for collecting real-time data, and this integration offers Azure Functions the ability to run when a real-time event occurs.
- Others – Cosmos DB, Service Bus, IoT Hub, GitHub are other events which can trigger an Azure Function.
What can Azure Functions Do?
Once you begin to understand the concept, you can quickly see some of the possibilities. Without having to configure servers or virtual machines, the following tasks become much simpler:
- Reading and writing data from a database
- Processing images
- Interacting with an HTTP endpoint
- Automating decisions in real-time
- Computing descriptive statistics
- Creating your own endpoint for other data scientists to call
- Automatically analyzing code after commits
Programming Languages for Azure Functions
Simplify Tasks for Data Science
Next time you have a data science task which requires a little coding, consider using an Azure Function to run the code. It will most likely save you some deployment and configuration time. Then you can quickly get back to optimizing those fancy AI models.
See the video below for a quick demonstration of how to create an Azure function via a web browser (no IDE needed).
Pablo Casas has published a book freely available online, Data Science Live Book. To quote from the book,
It is a book about data preparation, data analysis and machine learning.
The book is open source, and the code examples are written in R.
Martin Zinkevich, Research Scientist at Google, just compiled a large list (43 to be exact) of best practices for building machine learning systems.
If you do data engineering or are involved with building data science systems, this document is worth a look.
The differences between Data Scientists, Data Engineers, and Software engineers can get a little confusing at times. Thus, here is a guest post provided by Jake Stein, CEO at Stitch formerly RJ Metrics, which aims to clear up some of that confusion based upon LinkedIn data.
As data grows, so does the expertise needed to manage it. The past few years have seen an increasing distinction between the key roles tasked with managing data: software engineers, data engineers, and data scientists.
More and more we’re seeing data engineers emerge as a subset within the software engineering discipline, but this is still a relatively new trend. Plenty of software engineers are still tasked with moving and managing data.
Our team has released two reports over the past year, one focused on understanding the data science role, one on data engineering. Both of these reports are based on self-reported LinkedIn data. In this post, I’ll lay out the distinctions between these roles and software engineers, but first, here’s a diagram to show you (in very broad strokes) what we saw in the skills breakdown between these three roles:
A software engineer builds applications and systems. Developers will be involved through all stages of this process from design, to writing code, to testing and review. They are creating the products that create the data. Software engineering is the oldest of these three roles, and has established methodologies and tool sets.
- Frontend and backend development
- Web apps
- Mobile apps
- Operating system development
- Software design
A data engineer builds systems that consolidate, store, and retrieve data from the various applications and systems created by software engineers. Data engineering emerged as a niche skill set within software engineering. 40% of all data engineers were previously working as a software engineer, making this the most common career path for data engineers by far.
- Advanced data structures
- Distributed computing
- Concurrent programming
- Knowledge of new & emerging tools: Hadoop, Spark, Kafka, Hive, etc.
- Building ETL/data pipelines
A data scientist builds analysis on top of data. This may come in the form of a one-off analysis for a team trying to better understand customer behavior, or a machine learning algorithm that is then implemented into the code base by software engineers and data engineers.
- Data modeling
- Machine learning
- Business Intelligence dashboards
Evolving Data Teams
These roles are still evolving. The process of ETL is getting much easier overall as new tools (like Stitch) enter the market, making it easy for software developers to set up and maintain data pipelines. Larger companies are pulling data engineers off the software engineering team entirely in lieu of forming a centralized data team where infrastructure and analysis sit together. In some scenarios data scientists are responsible for both data consolidation and analysis.
At this point, there is no single dominant path. But we expect this rapid evolution to continue, after all, data certainly isn’t getting any smaller.