Category Archives: Learn Data Science

This is a category for all things related to learning data science.

Storing Data On a Piece of Glass – Microsoft’s Project Silica

Last week at Microsoft Ignite 2019, Microsoft Research announced the release of Project Silica. This is an amazing new technology to store data on a piece of glass. I was lucky enough to get the opportunity to sit down with Antony Rowstron, Deputy Lab Director at Microsoft Research, and ask him all my questions.

project-silica-interview
An interview with Project Silica Researcher, Ant Rowstron

What is Project Silica?

Project Silica is a research endevour to store digital data on a piece of quartz glass. The technology currently exists and can be used to store files the size of the original Superman movie or the Windows 10 operating system.

The technology is still very new, so it will be years before it is productionalized.

How is Data Stored?

A femtosecond laser is fired inside the glass. It uses multiple pulses to form a voxel. A voxel can be thought of as a tiny iceberg within the glass. The voxel is shaped like a teardrop and has a size of approximately 1 micron (1 micrometer). Each voxel can store multiple bits and data depends upon the size and orientation of the voxel.

Many layers of voxels are stored on a piece of glass. The sample I got to see had 20 layers. The sample from the keynote had 74 layers. Current capabilities allow for hundreds of layers.

Why Quartz Glass?

It needs to be transparent because in order to read the data a microscope-like device needs to be able to focus at different layers. Thus, it needs to be able to see through the upper layers.

Quartz glass is purer than window pane glass. Plus, it is a readily available substance that the world already produces.

Most importantly,

“The properties of the voxel formations is a function of glass.”

–Ant Rowstron, Microsoft Research

How is Data Read?

A separate device is used to read the data. It is a computer-controlled microsocope. To begin with, it focuses on the layer of interest and a set of polarization images are taken. These images are then processed to determine the orientation and size of the voxels. The process is then repeated for other layers.

The images are fused using machine learning and a Convolution Neural Network. In addition,

“There are about 8 tracks which have well-known data written in them. … If in 100 years time, it doesn’t read, we can we retrain the ML from the tracks we have. It is a self-describing media. “

–Ant Rowstron on how data is read from Project Silica

What are the Use Cases?

Project Silica is being created as a long-term archival storage device. Previous technologies such as tape and hard disks were designed before the cloud existed. They have limitations around temperature, humidity, air-quality, and life-span. Project Silica avoids those limitations.

“This is a technology designed just for the [cloud] datacenter.”

–Ant Rowstron, Microsoft Research

How Durable is it?

The Quartz glass will not deteriorate, which is one of its better qualities for this application. The glass can retain its stored data after being submerged in boiling water, put in a flame, scratched with steel wool, shaken or microwaved.

The voxels will still be there after 1000s of years.

It can however be smashed with a hammer or shattered like any other piece of glass.

What Do You Invision as the Future of Project Silica?

A wall covered in plates of glass with an arm coming out to fetch a piece of glass. Then the glass will be taken over to a reader. It will be read and then returned to the wall.

What is Next?

There are three things which are really important to any storage technology.

  1. Density – how much data can be fit in a certain amount of space
  2. Write throughput – how fast can the data be written
  3. Read thoughput – how quickly can the data be read

Microsoft is going to be pushing really hard on all three. Currently a piece of glass the size of an optical disc can store more data than an optical disc. And, the read/write process has gotten 1000 times faster than the beginning of the project. Theoretically, the piece of glass I am holding in the image below should be able to hold hundreds of terabytes.

holding-project-silica
Holding Project Silica with Ant Rowstron

More Information

If you are looking for more information, you can reference one of the research papers, Glass: A New Media for a New Era or you can listen to the Microsoft Research Podcast Episode, Optics for the Cloud .

Microsoft has also published a short video demonstrating some of the capabilities.

Fundamentals of Data Mining

Today we are generating data more than ever before. Over the last two years, 90 percent of the data in the world was generated. This data alone does not make any sense unless it’s identified to be related in some pattern. Data mining is the process of discovering these patterns among the data and is therefore also known as Knowledge Discovery from Data (KDD).

A definition from the book ‘Data Mining: Practical Machine Learning Tools and Techniques’, written by, Ian Witten and Eibe Frank describes Data mining as follows:

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…

Overview

Take the example of a big supermarket that has a large number of records of customer purchases. Conventionally, in many supermarkets, this data is mostly used for inventory management or budgeting. However, by using sophisticated data mining tools and diligently scanning through data to find patterns that were never seen before, the supermarket management may know which combination of products is mostly purchased by their customers and how seasonality and other factors influence their purchasing decision. This is what we term as ‘recommender systems’ which is now being implemented to boost sales by recommending products to frequent customers based on their previous purchase activities. Consequently, this increases customer satisfaction.

As you may expect, this process can be extremely arduous and time-consuming and may require a high level of acumen to come up with valuable and actionable insights. Going through such a large amount of data, where trends might not be obvious, could get painful and demotivating. Therefore, learning some useful data mining procedures may prove beneficial in this regard.

You might be wondering what benefit you can get out of these techniques? As taught in Data Science Dojo’s data science bootcamp, you will have improved prediction and forecasting with respect to your product. An in-depth analysis of trends can offer managers a much more reliable way to conduct planning and forecasts. Furthermore, it also assists them in boosting their decision-making abilities as the decisions are evidence-based rather than mere conjectures and intuition. Additionally, this will enable an organization to utilize resources optimally and enhance the customer’s experience. How mining techniques could be leveraged to fulfill the goals of organizations is discussed ahead.

Data Mining Process

The complexity of the entire data mining mechanism can vary according to the size and kind of data an organization has and the aims that are required to be fulfilled. However, in most cases, there will be a generic process that underlies all such activities.

Domain Knowledge

The foremost step of this process is to possess relevant domain knowledge regarding the problem at hand. To anyone looking at a large pile of data, it may seem like a collection of junk unless the person has the background knowledge and information about the business. Only then one will be able to get a sense of what sort of data they require, what are the relevant properties of data to take into consideration and how that could be used to solve the problem at hand. Once these questions are answered, it’ll be easy to stay focused, allocate resources properly and attain a productive result eventually. This step will guide the discovery method and allow discovered patterns to be expressed in concise terms and at different levels of abstraction.

Data Collection

After defining the goals in the previous step, it is essential to collect data. This could involve using data that already exists in a company’s database, getting data from external resources or steps to collect new data through survey forms filled by customers. A group of experts holds this point of view that an organization must collect as much data as possible, even if it doesn’t make sense at an early stage.

Data Cleaning and Preprocessing

Following the collection step, comes the most onerous step of all: Data Cleaning and Preprocessing. In simple terms, this step involves dealing with missing values and outliers and removing noise or other misleading components that may cause false conclusions. This also includes transforming data into a form required by the mining procedure. This step may take a lot of resources, effort and patience to perform.

Analysis and Interpretation

The most crucial part begins by limiting data to the most important features and creating new relevant and useful features by leveraging the combination of the existing features in the data. Data is carefully analyzed and using mining algorithms, hidden patterns are extracted from the data. The models created using these algorithms could be evaluated against appropriate metrics to verify the model’s credibility. The choice of these metrics depends on the nature of the problem. A problem relating to the detection of fraudulent activity might choose false-negative error as a suitable evaluation metric. The patterns discovered after this step are interpreted using various visualization and reporting techniques and are made comprehensible for other team members to understand.

Deployment

Finally, the insights are used to take action and make important business decisions to solve the problem and leverage the entire laborious process undertaken. The success of this process can be assessed by how much value it brings to your business.

Data Mining Models

The models used for data mining can be primarily distinguished under two main types: supervised and unsupervised. The former is a term used for models where the data has been labeled, whereas, unsupervised learning, on the other hand, refers to unlabeled data. These models can be further classified as specified in the descriptions below.

Classification

Classification is a form of supervised learning technique where a known structure is generalized for distinguishing instances in new data. Based on the data, the model will create sets of discrete rules to split and group the highest proportion of similar target variables together. Banks use classification to predict if a client is going to default loan payment or not based on the client’s activities.

Regression

Regression Analysis is a statistical method for examining the relationship between two or more variables. It is a supervised learning technique used in predictive analytics to find a continuous value based on one or numerous variables. For example, regression algorithms are used by companies to forecast sales in future months based on sales data of previous months.

Anomaly Detection

Also known as outlier detection, anomaly detection is an unsupervised learning technique that is used to find rare occurrences or suspicious events in your data. The unusual data points may point to a problem or rare event that can be subject to further investigation. Anomaly detection could be used for network security to find external intrusions or suspicious activities from the users, for instance, a hacker opening connections on non-common ports or protocols.

Clustering

Another unsupervised learning method, clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. It assists in finding out structures in data that can group similar data points together. For example, clustering is used to group a large set of documents into categories based on the content.

Common Applications

Data mining can be effectively used in marketing to create customer segments based on their purchasing patterns that can be extracted from behavioral analysis. This could be used for creating targeted advertising that varies according to the customer profile and as a result, increases the conversion rate.

As mentioned above, financial organizations have been using data mining to detect fraudulent transactions. Previous transaction data can be analyzed to extract patterns and find any anomaly that can assist in distinguishing such transactions. Such methods are getting more effective in combating fraudulent activities and anticipating activities that might be atypical and might potentially go unnoticed.

In the area of Natural Language Processing, data mining, or referred to as text mining, can be of extreme use when analyzing a large volume of news, social media or other text data. Such data mining techniques have evolved to become contextually aware and can enable us to find out sentiments for a particular product. A firm could use it to understand how the general public feels about their newly launched device. Similarly, it also assists in discovering topics under discussion amongst the public related to a particular aspect. This technique is used for detecting fake news on social media as well.

About The Author

Rahim Rasool is an Associate Data Scientist at Data Science Dojo (DSD) where he helps create learning material for DSD’s data science bootcamp. He holds a bachelor’s in electrical engineering from National University of Sciences and Technology. He possesses great interest in machine learning, astronomy and history.

Note: Data Science 101 is proud to have this sponsored post from Data Science Dojo.

Why You should Attend SQLSaturday – An Interview with John Byrnes

A few weeks ago, I attended my first SQLSaturday event. I brought along my camera and was lucky enough to record a couple interviews. This is one of those interviews.

I sat down with John Byrnes and we discussed:

  • Where has SQL taken his career?
  • What does the SQLSaturday community mean to him?
  • Why should someone attend a SQLSaturday event?
SQLSaturday Interview with John Byrnes

Open Source Data Science Projects 2019

A number of new impactful open source projects have been released lately.

Open Source Data Science Projects

  • Pythia – from Facebook for deep learning with vision and language, “such as answering questions related to visual data and automatically generating image captions “
  • InterpretML – from Microsoft, ” package for training interpretable models and explaining blackbox systems “
  • ML framework for Julia – from Alan Turing Institute, MLJ is a machine learning toolbox for Julia
  • Plato – a conversational AI platform from Uber

Is the list missing a project released in 2019? If so, please leave a comment.

Nuts About Data Book Review

Just released this week, Nuts about Data, is a fun introductory book about the data science process. Meor Amer tells a witty story about squirrels, mining for nuts, teamwork, and survival. It brings together the entire data science lifecycle from asking questions to final storytelling.

It is a quick read and really fun. I highly recommend it and hope you enjoy it.