Big Data Mining

Big Data Mining: Data Mining with Big Data, Issues, Case Study, Clustering on Big Data, Limitations of Mapreduce Framework, Case Study-Graph Algorithms on Mapreduce

Data Mining uses tools such as statistical models, machine learning, and visualization to "Mine" (extract) the useful data and patterns from the Big Data, whereas Big Data processes high-volume and high-velocity data, which is challenging to do in older databases and analysis program.

DIFFERENCES BETWEEN BIG DATA AND DATA MINING | Download Scientific Diagram

 

 ISSUES AND CHALLENGES

The analysis of Big Data involves multiple distinct phases which include data acquisition
and recording, information extraction and cleaning, data integration, aggregation and
representation, query processing, data modeling and analysis and Interpretation. Each of these phases introduces challenges. Heterogeneity, scale, timeliness, complexity and privacy are certain challenges of big data mining. 

Heterogeneity and Incompleteness

The difficulties of big data analysis derive from its large scale as well as the presence of mixed data based on different patterns or rules (heterogeneous mixture data) in the collected and stored data. In the case of complicated heterogeneous mixture data, the data has several patterns and rules and the properties of the patterns vary greatly. Data can be both structured and unstructured. Transforming this data to structured format for  later analysis is a major challenge in big data mining. So new technologies have to be adopted for dealing with such data.

Incomplete data creates uncertainties during data analysis and it must be managed during data analysis. Doing this correctly is also a challenge. Incomplete data refers to the missing of data field values for some samples.

Scale and complexity

Managing large and rapidly increasing volumes of data is a challenging issue. Traditional
software tools are not enough for managing the increasing volumes of data. Data analysis, organization, retrieval and modeling are also challenges due to scalability and complexity of data that needs to be analyzed.

Timeliness

As the size of the data sets to be processed increases, it will take more time to analyse. In some situations results of the analysis is required immediately. For example, if a fraudulent credit card transaction is suspected, it should ideally be flagged before the transaction is completed by preventing the transaction from taking place at all.

Security & Privacy Challenges

As big data expands the sources of data it can use, the trust worthiness of each data source needs to be verified and techniques should be explored in order to identify maliciously inserted data.

Security of big data can be enhanced by using the techniques of authentication, authorization, encryption and audit trails.


Clustering on Big Data

  • Cluster is a group of objects that belongs to the same class. In other words, similar objects are grouped in one cluster and dissimilar objects are grouped in another cluster. 
  • Clustering is the process of making a group of abstract objects into classes of similar objects.  
  • A cluster of data objects can be treated as one group.  
  • While doing cluster analysis, the first partition the set of data into groups based on data similarity and then assign the labels to the groups.  
  • The main advantage of clustering over classification is that, it is adaptable to changes and helps single out useful features that distinguish different groups.

 Clustering methods can be classified into the following categories −

  • Partitioning Method (k-means)
  • Hierarchical Method
  • Density-based Method
  • Grid-Based Method
  • Model-Based Method
  • Constraint-based Method

 1. Partitioning Method

Suppose that are given a database of ‘n’ objects and the partitioning method constructs ‘k’ partition of data. Each partition will represent a cluster and k ≤ n. It means that it will
classify the data into k groups, which satisfy the following requirements − 

  • Each group contains at least one object. 
  • Each object must belong to exactly one group. 

2. Hierarchical Methods 
This method creates a hierarchical decomposition of the given set of data objects. Many authors can classify hierarchical methods on the basis of how the hierarchical decomposition is formed. There are two approaches here

  • Agglomerative: This is a "bottom-up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.  
  • Divisive: This is a "top-down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

3. Density-based Method  

This method is based on the notion of density. The basic idea is to continue growing the given cluster as long as the density in the neighborhood exceeds some threshold, i.e., for each data point within a given cluster, the radius of a given cluster has to contain at least a minimum number of points.

4. Grid-based Method
In this, the objects together form a grid. The object space is quantized into finite number of cells that form a grid structure. The major advantage of this method is fast processing time. It is dependent only on the number of cells in each dimension in the quantized space. 

K-means Algorithm

  • K-means clustering is a well known partitioning method. 
  • In this objects are classified as belonging to one of K-groups. 
  • The result of partitioning method is a set of K clusters, each object of data set belonging to one cluster. 
  • In each cluster there may be a centroid or a cluster representative. In case where it has consider real -valued data, arithmetic mean of attribute vectors for all objects within a cluster provides an appropriate representative; alternative types of centroid may be required within other cases.

Steps Of K-means Clustering Algorithm 

  • K-Means Clustering algorithm is an idea, within which there is need to classify given data set into K clusters; value of K (Number of clusters) is defined by user which is fixed. 
  • In this first centroid of each cluster is selected for clustering & then according to chosen centriod, data points having minimum distance from given cluster, is assigned to that particular cluster. 
  • Euclidean Distance is used for calculating distance of data point from particular centroid. 
Advantages
  • K-means clustering is very Fast, robust & easily understandable. If data set is well separated from each other data set, then it gives best results.  
  • The clusters do not having overlapping character & are also non-hierarchical within nature.

Disadvantages

  • In this algorithm, complexity is more as compared to others. 
  • Need of predefined cluster centers.
  • Handling any of empty Clusters: One more problems with K-means clustering is that empty clusters are generated during execution, if within case no data points are allocated to a cluster under consideration during assignment phase.

Challenges of clustering big data
The challenges of clustering big data are characterized into three main components:
1. Volume: as the scale of the data generated by modern technologies is rising
exponentially, clustering methods become computationally expensive and do not
scale up to very large datasets.
2. Velocity: this refers to the rate of speed in which data is incoming to the system. Dealing
with high velocity data requires the development of more dynamic clustering methods
to derive useful information in real time.
3. Variety: Current data are heterogeneous and mostly unstructured, which make the
issue to manage, merge and govern data extremely challenging. 


Conventional clustering algorithms cannot handle the complexity of big data due the
above reasons. For example, k-means algorithm is an NP-hard, even when the number
of clusters is small. Consequently, scalability is a major challenge in big data. Traditional
clustering methods were developed to run over a single machine and various techniques
are used to improve their performance. For instance, sampling method is used to perform
clustering on samples of the data and then generalize it to the whole dataset. This reduces
the amount of memory needed to process the data but results in lower accuracy. Another
technique is features reduction where the dimension of the dataset is projected into
lower dimensional space to speed up the process of mining 

 

Comparing: RDBMS vs. HadoopTraditional RDBMS Hadoop / MapReduceData Size Gigabytes (Terabytes) Petabytes (Hexabytes)Access... 

https://www.analyticsvidhya.com/blog/2016/11/an-introduction-to-clustering-and-different-methods-of-clustering/

No comments:

Post a Comment

Monk and Inversions

using System; public class Solution { public static void Main () { int T = Convert . ToInt32 ( Console . ReadLine...