Tag Archives: data mining

Fundamentals of Data Mining

Today we are generating data more than ever before. Over the last two years, 90 percent of the data in the world was generated. This data alone does not make any sense unless it’s identified to be related in some pattern. Data mining is the process of discovering these patterns among the data and is therefore also known as Knowledge Discovery from Data (KDD).

A definition from the book ‘Data Mining: Practical Machine Learning Tools and Techniques’, written by, Ian Witten and Eibe Frank describes Data mining as follows:

Data mining is the extraction of implicit, previously unknown, and potentially useful information from data. The idea is to build computer programs that sift through databases automatically, seeking regularities or patterns. Strong patterns, if found, will likely generalize to make accurate predictions on future data. … Machine learning provides the technical basis for data mining. It is used to extract information from the raw data in databases…


Take the example of a big supermarket that has a large number of records of customer purchases. Conventionally, in many supermarkets, this data is mostly used for inventory management or budgeting. However, by using sophisticated data mining tools and diligently scanning through data to find patterns that were never seen before, the supermarket management may know which combination of products is mostly purchased by their customers and how seasonality and other factors influence their purchasing decision. This is what we term as ‘recommender systems’ which is now being implemented to boost sales by recommending products to frequent customers based on their previous purchase activities. Consequently, this increases customer satisfaction.

As you may expect, this process can be extremely arduous and time-consuming and may require a high level of acumen to come up with valuable and actionable insights. Going through such a large amount of data, where trends might not be obvious, could get painful and demotivating. Therefore, learning some useful data mining procedures may prove beneficial in this regard.

You might be wondering what benefit you can get out of these techniques? As taught in Data Science Dojo’s data science bootcamp, you will have improved prediction and forecasting with respect to your product. An in-depth analysis of trends can offer managers a much more reliable way to conduct planning and forecasts. Furthermore, it also assists them in boosting their decision-making abilities as the decisions are evidence-based rather than mere conjectures and intuition. Additionally, this will enable an organization to utilize resources optimally and enhance the customer’s experience. How mining techniques could be leveraged to fulfill the goals of organizations is discussed ahead.

Data Mining Process

The complexity of the entire data mining mechanism can vary according to the size and kind of data an organization has and the aims that are required to be fulfilled. However, in most cases, there will be a generic process that underlies all such activities.

Domain Knowledge

The foremost step of this process is to possess relevant domain knowledge regarding the problem at hand. To anyone looking at a large pile of data, it may seem like a collection of junk unless the person has the background knowledge and information about the business. Only then one will be able to get a sense of what sort of data they require, what are the relevant properties of data to take into consideration and how that could be used to solve the problem at hand. Once these questions are answered, it’ll be easy to stay focused, allocate resources properly and attain a productive result eventually. This step will guide the discovery method and allow discovered patterns to be expressed in concise terms and at different levels of abstraction.

Data Collection

After defining the goals in the previous step, it is essential to collect data. This could involve using data that already exists in a company’s database, getting data from external resources or steps to collect new data through survey forms filled by customers. A group of experts holds this point of view that an organization must collect as much data as possible, even if it doesn’t make sense at an early stage.

Data Cleaning and Preprocessing

Following the collection step, comes the most onerous step of all: Data Cleaning and Preprocessing. In simple terms, this step involves dealing with missing values and outliers and removing noise or other misleading components that may cause false conclusions. This also includes transforming data into a form required by the mining procedure. This step may take a lot of resources, effort and patience to perform.

Analysis and Interpretation

The most crucial part begins by limiting data to the most important features and creating new relevant and useful features by leveraging the combination of the existing features in the data. Data is carefully analyzed and using mining algorithms, hidden patterns are extracted from the data. The models created using these algorithms could be evaluated against appropriate metrics to verify the model’s credibility. The choice of these metrics depends on the nature of the problem. A problem relating to the detection of fraudulent activity might choose false-negative error as a suitable evaluation metric. The patterns discovered after this step are interpreted using various visualization and reporting techniques and are made comprehensible for other team members to understand.


Finally, the insights are used to take action and make important business decisions to solve the problem and leverage the entire laborious process undertaken. The success of this process can be assessed by how much value it brings to your business.

Data Mining Models

The models used for data mining can be primarily distinguished under two main types: supervised and unsupervised. The former is a term used for models where the data has been labeled, whereas, unsupervised learning, on the other hand, refers to unlabeled data. These models can be further classified as specified in the descriptions below.


Classification is a form of supervised learning technique where a known structure is generalized for distinguishing instances in new data. Based on the data, the model will create sets of discrete rules to split and group the highest proportion of similar target variables together. Banks use classification to predict if a client is going to default loan payment or not based on the client’s activities.


Regression Analysis is a statistical method for examining the relationship between two or more variables. It is a supervised learning technique used in predictive analytics to find a continuous value based on one or numerous variables. For example, regression algorithms are used by companies to forecast sales in future months based on sales data of previous months.

Anomaly Detection

Also known as outlier detection, anomaly detection is an unsupervised learning technique that is used to find rare occurrences or suspicious events in your data. The unusual data points may point to a problem or rare event that can be subject to further investigation. Anomaly detection could be used for network security to find external intrusions or suspicious activities from the users, for instance, a hacker opening connections on non-common ports or protocols.


Another unsupervised learning method, clustering is the practice of assigning labels to unlabeled data using the patterns that exist in it. It assists in finding out structures in data that can group similar data points together. For example, clustering is used to group a large set of documents into categories based on the content.

Common Applications

Data mining can be effectively used in marketing to create customer segments based on their purchasing patterns that can be extracted from behavioral analysis. This could be used for creating targeted advertising that varies according to the customer profile and as a result, increases the conversion rate.

As mentioned above, financial organizations have been using data mining to detect fraudulent transactions. Previous transaction data can be analyzed to extract patterns and find any anomaly that can assist in distinguishing such transactions. Such methods are getting more effective in combating fraudulent activities and anticipating activities that might be atypical and might potentially go unnoticed.

In the area of Natural Language Processing, data mining, or referred to as text mining, can be of extreme use when analyzing a large volume of news, social media or other text data. Such data mining techniques have evolved to become contextually aware and can enable us to find out sentiments for a particular product. A firm could use it to understand how the general public feels about their newly launched device. Similarly, it also assists in discovering topics under discussion amongst the public related to a particular aspect. This technique is used for detecting fake news on social media as well.

About The Author

Rahim Rasool is an Associate Data Scientist at Data Science Dojo (DSD) where he helps create learning material for DSD’s data science bootcamp. He holds a bachelor’s in electrical engineering from National University of Sciences and Technology. He possesses great interest in machine learning, astronomy and history.

Note: Data Science 101 is proud to have this sponsored post from Data Science Dojo.

Data Mining and Analysis Textbook (Free Download)

Mohammed J. Zaki, Computer Science Professor at RPI, and Wagner Meira Jr., Computer Science Professor at Universidade Federal de Minas Gerais, have written the textbook Data Mining and Analysis: Fundamental Concepts and Algorithms. The book is currently available as a PDF download.

Based upon the chapters, the book looks very good. It contains large sections on data analysis, clustering, and classification. The final book will be published sometime in 2014.

Data Mining MOOC

The University of Waikato in New Zealand will be offering a free online course titled, Data Mining with Weka.

Weka is a widely-used toolkit for data mining and machine learning. The University of Waikato developed the toolkit.

Don’t wait too long to sign up, the course starts September 9, 2013.

Here is a video of the instructor of the course providing a brief overview.

Data Mining Standard Processes

There are a couple of standard processes for approaching data mining problems.


The most common approach is Cross Industry Standard Process for Data Mining (CRISP-DM).

Steps of CRISP-DM

  1. Business Understanding
  2. Data Understanding
  3. Data Preparation
  4. Modeling
  5. Evaluation
  6. Deployment

The steps are mostly self-explanatory, but the CRISP-DM wikipedia page has a lengthier description.


The second most popular process for data mining is SEMMA.

Steps of SEMMA

  1. Sample
  2. Explore
  3. Modify
  4. Model
  5. Assess

More details can be found on the SEMMA wikipedia page.

A Data Science Process?

Other than The Data Scientific Method (which is not a standard), I am not aware of any other process for data science.

Do you know of any processes for data science? Is anyone aware of a group working on standardizing a data science process?

Best Free Data Mining Tools

I recently saw the article, The Best Data Mining Tools You Can Use for Free in Your Company. It contains a very brief description of each of the following tools.

  1. RapidMiner
  2. RapidAnalytics
  3. Weka
  4. PSPP
  5. KNIME
  6. Orange
  7. Apache Mahout
  8. jHepWork
  9. Rattle

See The Best Data Mining Tools You Can Use for Free in Your Company for more details, links, and pictures.