All posts by Ryan Swanstrom

Cloud Data Science News Beta #2

Here are this weeks major announcements and news for doing data science in the cloud.

Microsoft Azure

Amazon AWS

Google Cloud

If you would like to get the Cloud Data Science News as an email, you can sign up for the Cloud Data Science Newsletter.

Storing Data On a Piece of Glass – Microsoft’s Project Silica

Last week at Microsoft Ignite 2019, Microsoft Research announced the release of Project Silica. This is an amazing new technology to store data on a piece of glass. I was lucky enough to get the opportunity to sit down with Antony Rowstron, Deputy Lab Director at Microsoft Research, and ask him all my questions.

project-silica-interview
An interview with Project Silica Researcher, Ant Rowstron

What is Project Silica?

Project Silica is a research endevour to store digital data on a piece of quartz glass. The technology currently exists and can be used to store files the size of the original Superman movie or the Windows 10 operating system.

The technology is still very new, so it will be years before it is productionalized.

How is Data Stored?

A femtosecond laser is fired inside the glass. It uses multiple pulses to form a voxel. A voxel can be thought of as a tiny iceberg within the glass. The voxel is shaped like a teardrop and has a size of approximately 1 micron (1 micrometer). Each voxel can store multiple bits and data depends upon the size and orientation of the voxel.

Many layers of voxels are stored on a piece of glass. The sample I got to see had 20 layers. The sample from the keynote had 74 layers. Current capabilities allow for hundreds of layers.

Why Quartz Glass?

It needs to be transparent because in order to read the data a microscope-like device needs to be able to focus at different layers. Thus, it needs to be able to see through the upper layers.

Quartz glass is purer than window pane glass. Plus, it is a readily available substance that the world already produces.

Most importantly,

“The properties of the voxel formations is a function of glass.”

–Ant Rowstron, Microsoft Research

How is Data Read?

A separate device is used to read the data. It is a computer-controlled microsocope. To begin with, it focuses on the layer of interest and a set of polarization images are taken. These images are then processed to determine the orientation and size of the voxels. The process is then repeated for other layers.

The images are fused using machine learning and a Convolution Neural Network. In addition,

“There are about 8 tracks which have well-known data written in them. … If in 100 years time, it doesn’t read, we can we retrain the ML from the tracks we have. It is a self-describing media. “

–Ant Rowstron on how data is read from Project Silica

What are the Use Cases?

Project Silica is being created as a long-term archival storage device. Previous technologies such as tape and hard disks were designed before the cloud existed. They have limitations around temperature, humidity, air-quality, and life-span. Project Silica avoids those limitations.

“This is a technology designed just for the [cloud] datacenter.”

–Ant Rowstron, Microsoft Research

How Durable is it?

The Quartz glass will not deteriorate, which is one of its better qualities for this application. The glass can retain its stored data after being submerged in boiling water, put in a flame, scratched with steel wool, shaken or microwaved.

The voxels will still be there after 1000s of years.

It can however be smashed with a hammer or shattered like any other piece of glass.

What Do You Invision as the Future of Project Silica?

A wall covered in plates of glass with an arm coming out to fetch a piece of glass. Then the glass will be taken over to a reader. It will be read and then returned to the wall.

What is Next?

There are three things which are really important to any storage technology.

  1. Density – how much data can be fit in a certain amount of space
  2. Write throughput – how fast can the data be written
  3. Read thoughput – how quickly can the data be read

Microsoft is going to be pushing really hard on all three. Currently a piece of glass the size of an optical disc can store more data than an optical disc. And, the read/write process has gotten 1000 times faster than the beginning of the project. Theoretically, the piece of glass I am holding in the image below should be able to hold hundreds of terabytes.

holding-project-silica
Holding Project Silica with Ant Rowstron

More Information

If you are looking for more information, you can reference one of the research papers, Glass: A New Media for a New Era or you can listen to the Microsoft Research Podcast Episode, Optics for the Cloud .

Microsoft has also published a short video demonstrating some of the capabilities.

Cloud Data Science News Beta #1

Welcome to the first beta edition of Cloud Data Science News. This will cover major announcements and news for doing data science in the cloud. If the first few are well received, this will become a weekly segment.

Microsoft Azure

  • Azure Arc
    You can now run Azure services anywhere (on-prem, on the edge, any cloud) you can run Kubernetes.
  • Azure Synapse Analytics
    This is the future of data warehousing. It combines data warehousing and data lakes into a simple query interface for a simple and fast analytics service.
  • SQL Server 2019
    SQL Server 2019 went Generally Available.
  • Data Science Announcements from Microsoft Ignite
    Many other services were announced such as: Azure Quantum, Project Silica, R support in Azure ML, and Visual Studio Online.

Amazon Web Services

  • Call for Research Proposals
    Amazon is seeking proposals for impactful research in the Artificial Intelligence and Machine Learning areas. If you are at a University or non-profit, you can ask for cash and/or AWS credits.
  • AWS Parallel Cluster for Machine Learning
    AWS Parallel Cluster is an open-source cluster management tool. It can be used to do distributed Machine Learning on AWS.

Google Cloud

If you would like to get the Cloud Data Science News as an email, you can sign up for the Cloud Data Science Newsletter.

Data Science News from Microsoft Ignite 2019

Microsoft just held one of its largest conferences of the year, and a few major announcements were made which pertain to the cloud data science world. Here they are in my order of importance (based upon my opinion).

Azure Synapse

I think this announcement will have a very large and immediate impact. Azure Synapse Analytics can be seen as a merge of Azure SQL Data Warehouse and Azure Data Lake. Synapse allows one to use SQL to query petabytes of data, both relational and non-relational, with amazing speed.

Azure Arc

Azure Arc allows deployment and management of Azure services to any environment which can run Kubernetes. This allows Azure to manage a completely hybrid infrastructure of: Azure, on-premise, IoT, and other cloud environments. It is now possible to deploy an Azure SQL Database to a virtual machine running on Amazon Web Services (AWS) and manage it from Azure. It’s true, I saw it happen this week.

R Support for Azure Machine Learning

Azure Machine Learning now has a new web interface and it just got support for the R programming language. Python support has been available for a while. Azure Machine Learning is an environment to help with all the aspects of data science from data cleaning to model training to deployment.

Others

There were a few other interesting announcements which are not completely specific to data scientists, but are worth mentioning.

Visual Studio Online

This is exactly what it sounds like. An Integrated Development Environment (IDE) in your browser. I have not gotten a chance to try it out yet, so I am not sure its usecase for data science yet. Years ago, there were many companies attempting to do this, but most are no longer around. Hopefully, now is a better time for an in-browser IDE and Visual Studio Online will succeed.

Project Silica

Without question, this was the coolest thing to be unveiled at Ignite. Microsoft Research has come up with a technique to store data into a piece of glass. They call it Project Silica. I was fortunate to be able to speak with one of the lead researchers on the project, so I will be sharing more from that interview later. It is fascinating, but probably years from implementation.

Azure Quantum

I have been ignoring Quantum for a while now, but it is time for that ignorance to stop. Microsoft is a (qu)bit late to the game but they are making some impressive progress. The Quantum Development Kit (QDK) has been released and Azure Quantum is in private preview, so you can sign up to be an early adopter.

Those are the big data science announcements of the week.

Fundamentals of Data Mining

Today we are generating data more than ever before. Over the last two years, 90 percent of the data in the world was generated. This data alone does not make any sense unless it’s identified to be related in some pattern. Data mining is the process of discovering these patterns among the data and is therefore also known as Knowledge Discovery from Data (KDD).

A definition from the book ‘Data Mining: Practical Machine Learning Tools and Techniques’, written by, Ian Witten and Eibe Frank describes Data mining as follows:

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…

Overview

Take the example of a big supermarket that has a large number of records of customer purchases. Conventionally, in many supermarkets, this data is mostly used for inventory management or budgeting. However, by using sophisticated data mining tools and diligently scanning through data to find patterns that were never seen before, the supermarket management may know which combination of products is mostly purchased by their customers and how seasonality and other factors influence their purchasing decision. This is what we term as ‘recommender systems’ which is now being implemented to boost sales by recommending products to frequent customers based on their previous purchase activities. Consequently, this increases customer satisfaction.

As you may expect, this process can be extremely arduous and time-consuming and may require a high level of acumen to come up with valuable and actionable insights. Going through such a large amount of data, where trends might not be obvious, could get painful and demotivating. Therefore, learning some useful data mining procedures may prove beneficial in this regard.

You might be wondering what benefit you can get out of these techniques? As taught in Data Science Dojo’s data science bootcamp, you will have improved prediction and forecasting with respect to your product. An in-depth analysis of trends can offer managers a much more reliable way to conduct planning and forecasts. Furthermore, it also assists them in boosting their decision-making abilities as the decisions are evidence-based rather than mere conjectures and intuition. Additionally, this will enable an organization to utilize resources optimally and enhance the customer’s experience. How mining techniques could be leveraged to fulfill the goals of organizations is discussed ahead.

Data Mining Process

The complexity of the entire data mining mechanism can vary according to the size and kind of data an organization has and the aims that are required to be fulfilled. However, in most cases, there will be a generic process that underlies all such activities.

Domain Knowledge

The foremost step of this process is to possess relevant domain knowledge regarding the problem at hand. To anyone looking at a large pile of data, it may seem like a collection of junk unless the person has the background knowledge and information about the business. Only then one will be able to get a sense of what sort of data they require, what are the relevant properties of data to take into consideration and how that could be used to solve the problem at hand. Once these questions are answered, it’ll be easy to stay focused, allocate resources properly and attain a productive result eventually. This step will guide the discovery method and allow discovered patterns to be expressed in concise terms and at different levels of abstraction.

Data Collection

After defining the goals in the previous step, it is essential to collect data. This could involve using data that already exists in a company’s database, getting data from external resources or steps to collect new data through survey forms filled by customers. A group of experts holds this point of view that an organization must collect as much data as possible, even if it doesn’t make sense at an early stage.

Data Cleaning and Preprocessing

Following the collection step, comes the most onerous step of all: Data Cleaning and Preprocessing. In simple terms, this step involves dealing with missing values and outliers and removing noise or other misleading components that may cause false conclusions. This also includes transforming data into a form required by the mining procedure. This step may take a lot of resources, effort and patience to perform.

Analysis and Interpretation

The most crucial part begins by limiting data to the most important features and creating new relevant and useful features by leveraging the combination of the existing features in the data. Data is carefully analyzed and using mining algorithms, hidden patterns are extracted from the data. The models created using these algorithms could be evaluated against appropriate metrics to verify the model’s credibility. The choice of these metrics depends on the nature of the problem. A problem relating to the detection of fraudulent activity might choose false-negative error as a suitable evaluation metric. The patterns discovered after this step are interpreted using various visualization and reporting techniques and are made comprehensible for other team members to understand.

Deployment

Finally, the insights are used to take action and make important business decisions to solve the problem and leverage the entire laborious process undertaken. The success of this process can be assessed by how much value it brings to your business.

Data Mining Models

The models used for data mining can be primarily distinguished under two main types: supervised and unsupervised. The former is a term used for models where the data has been labeled, whereas, unsupervised learning, on the other hand, refers to unlabeled data. These models can be further classified as specified in the descriptions below.

Classification

Classification is a form of supervised learning technique where a known structure is generalized for distinguishing instances in new data. Based on the data, the model will create sets of discrete rules to split and group the highest proportion of similar target variables together. Banks use classification to predict if a client is going to default loan payment or not based on the client’s activities.

Regression

Regression Analysis is a statistical method for examining the relationship between two or more variables. It is a supervised learning technique used in predictive analytics to find a continuous value based on one or numerous variables. For example, regression algorithms are used by companies to forecast sales in future months based on sales data of previous months.

Anomaly Detection

Also known as outlier detection, anomaly detection is an unsupervised learning technique that is used to find rare occurrences or suspicious events in your data. The unusual data points may point to a problem or rare event that can be subject to further investigation. Anomaly detection could be used for network security to find external intrusions or suspicious activities from the users, for instance, a hacker opening connections on non-common ports or protocols.

Clustering

Another unsupervised learning method, clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. It assists in finding out structures in data that can group similar data points together. For example, clustering is used to group a large set of documents into categories based on the content.

Common Applications

Data mining can be effectively used in marketing to create customer segments based on their purchasing patterns that can be extracted from behavioral analysis. This could be used for creating targeted advertising that varies according to the customer profile and as a result, increases the conversion rate.

As mentioned above, financial organizations have been using data mining to detect fraudulent transactions. Previous transaction data can be analyzed to extract patterns and find any anomaly that can assist in distinguishing such transactions. Such methods are getting more effective in combating fraudulent activities and anticipating activities that might be atypical and might potentially go unnoticed.

In the area of Natural Language Processing, data mining, or referred to as text mining, can be of extreme use when analyzing a large volume of news, social media or other text data. Such data mining techniques have evolved to become contextually aware and can enable us to find out sentiments for a particular product. A firm could use it to understand how the general public feels about their newly launched device. Similarly, it also assists in discovering topics under discussion amongst the public related to a particular aspect. This technique is used for detecting fake news on social media as well.

About The Author

Rahim Rasool is an Associate Data Scientist at Data Science Dojo (DSD) where he helps create learning material for DSD’s data science bootcamp. He holds a bachelor’s in electrical engineering from National University of Sciences and Technology. He possesses great interest in machine learning, astronomy and history.

Note: Data Science 101 is proud to have this sponsored post from Data Science Dojo.

Why You should Attend SQLSaturday – An Interview with John Byrnes

A few weeks ago, I attended my first SQLSaturday event. I brought along my camera and was lucky enough to record a couple interviews. This is one of those interviews.

I sat down with John Byrnes and we discussed:

  • Where has SQL taken his career?
  • What does the SQLSaturday community mean to him?
  • Why should someone attend a SQLSaturday event?
SQLSaturday Interview with John Byrnes