Data preprocessing is a data mining technique that is used to transform the raw data into a useful and efficient format.
Steps involved are,
- Data Cleaning
- Data Integration
- Data Transformation
- Data Reduction
- Data Discretization and Concept Hierarchy Generation
Data Cleaning
Data cleaning in data mining is the process of detecting and removing incomplete, noisy, and inconsistent data.
# We can remove the incomplete data by following methods,
- Ignore the tuples - This approach is suitable only when the dataset we have is quite large and multiple values are missing within a tuple.
- Fill the Missing values - We can fill the missing values manually, by attribute mean or the most probable value.
# Noisy Data (meaningless data) can be handled by,
- Binning Method - This method works on sorted data to smooth it. The whole data is divided into segments of equal size and then various methods are performed to complete the task.
For Example,
There are three approaches to perform smoothing –
# Smoothing by bin means: In smoothing by bin means, each value in a bin is replaced by the mean value of the bin.
# Smoothing by bin median: In this method, each bin value is replaced by its bin median value.
# Smoothing by bin boundary: In
smoothing by bin boundaries, the minimum and maximum values in a given
bin are identified as the bin boundaries. Each bin value is then
replaced by the closest boundary value.
Approach:
- Sort the array of the given data set.
- Divides the range into N intervals, each containing the approximately same number of samples(Equal-depth partitioning).
- Store mean/ median/ boundaries in each row.
Example
Sorted data: 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Smoothing by bin median:
- Bin 1: 9 9, 9, 9
- Bin 2: 24, 24, 24, 24
- Bin 3: 29, 29, 29, 29
- Regression - Data can be smoothed by fitting the data into a regression function.
- Clustering - This approach groups similar data into a cluster. Values that fall outside of the set of clusters may be considered outliers.
Data Integration
Data
Integration is a data preprocessing technique that combines data from
multiple sources and provides users with a unified view of these data.
The data integration approaches are formally defined as triple <G, S, M> where,
G stand for the global schema,
S stands for a heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
There are mainly 2 major approaches for data integration – one is “tight coupling approach” and another is the “loose coupling approach”.
* Tight Coupling - In
this coupling, data is combined from different sources into a single
physical location through the process of ETL – Extraction,
Transformation, and Loading.
* Loose Coupling - Here, the data only remains in the actual source databases.
Issues in Data Integration
- Schema Integration
- Redundancy
- Detection and resolution of data value conflicts
Data Transformation
The process of transforming the data into appropriate forms suitable for the mining process. This involves the following ways,
- Normalization - It is done to scale the data values in a specified range like 0.0 to 1.0.
- Attribute Selection - In this strategy, new attributes are constructed from the given set of attributes to help the mining process.
- Generalization - Here, low-level data are replaced with high-level data by using concept hierarchies climbing. For Example-The attribute “city” can be converted to “country”.
- Smoothing - It is a process that is used to remove noise from the data set using some algorithms
- Aggregation - Data collection or aggregation is the method of storing and presenting data in a summary format.
- Discretization - This is done to replace the raw values of a numeric attribute by ranges or conceptual levels.
Data Reduction
Data
reduction techniques can be applied to obtain a reduced representation
of the data which aims to increase storage efficiency and reduce data
storage and analysis costs. The various steps to data reduction are:
- Data Cube Aggregation - Aggregation operation is applied to data for the construction of the data cube.
- Numerosity Reduction - This enables to store the model of data instead of whole data, e.g. Regression Models.
- Dimensionality Reduction - This reduces the size of data by encoding mechanisms. It can be lossy or lossless. If the original data can be retrieved after reconstruction from compressed data, such reduction is called lossless reduction else it is called lossy reduction. The two effective methods of dimensionality reduction are: Wavelet transforms and PCA (Principal Component Analysis).
- Attribute Subset Selection - The goal of attribute subset selection is to find a minimum set of attributes such that dropping of those irrelevant attributes.
1. Stepwise Forward Selection
* This procedure starts with an empty set of attributes as the minimal set.
* The most relevant attributes are chosen(having minimum p-value) and are added to the minimal set. * In each iteration, one attribute is added to a reduced set.
* The most relevant attributes are chosen(having minimum p-value) and are added to the minimal set. * In each iteration, one attribute is added to a reduced set.
2. Stepwise Backward Elimination
Here all the attributes are considered in the initial set of attributes.
In each iteration, one attribute is eliminated from the set of
attributes whose p-value is higher than the significance level.
3. Combination of Forward Selection and Backward Elimination
* The stepwise forward selection and backward elimination are combined to select the relevant attributes most efficiently.
* This is the most common technique which is generally used for attribute selection.
4. Decision Tree Induction
* This approach a decision tree is used for attribute selection.
* It constructs a flow chart like structure having nodes denoting a test on an attribute.
* Each branch corresponds to the outcome of the test and leaf nodes are a class prediction.
* The attribute that is not the part of the tree is considered irrelevant and hence discarded.
Data Discretization and Concept Hierarchy Generation
Data Discretization techniques
can be used to divide the range of continuous attributes into
intervals. This leads to a concise, easy-to-use, knowledge-level
representation of mining results.
Discretization techniques can be categorized based on which direction it proceeds, as Top-down & Bottom-up.
Concept hierarchy method can be used to reduce the data by collecting and replacing low-level concepts with higher-level concepts.
Typical methods
* Binning
* Cluster Analysis
* Histogram Analysis
* Entropy-Based Discretization
No comments:
Post a Comment