Hierarchical Clustering method- BIRCH

 

 Hierarchical Clustering method

A hierarchical clustering method works by grouping data objects into a tree of clusters. Hierarchical clustering methods can be further classified into,
  • Agglomerative hierarchical clustering (Bottom- Up) - This bottom-up strategy starts by placing each object in its own cluster and then merges these atomic clusters into larger and larger clusters until all of the objects are in a single cluster or until certain termination conditions are satisfied. Most hierarchical clustering methods belong to this category.
  • Divisive hierarchical clustering (Top- Down) - This top-down strategy does the reverse of agglomerative hierarchical clustering by starting with all objects in one cluster. It subdivides the cluster into smaller and smaller pieces until each object forms a cluster on its own or until it satisfies certain termination conditions.

BIRCH

  • BIRCH stands for Balanced Iterative Reducing and Clustering Using Hierarchies, which uses hierarchical methods to cluster and reduce data.
  • BIRCH only needs to scan the data set in a single pass to perform clustering.
  • Given ―n d-dimensional data objects or points in a cluster, we can define the centroid x0, radius R, and diameter D of the cluster.
  • The BIRCH algorithm uses a tree structure to create a cluster. It is generally called the Clustering Feature Tree (CF Tree). Each node of this tree is composed of several Clustering features (CF).
  • Clustering Feature tree structure is similar to the balanced B+ tree.
  • Each node including leaf nodes has several CFs, and the CFs of internal nodes have pointers to child nodes, and all leaf nodes are linked by a doubly linked list.
  • Each CF is a triplet, which can be represented by (N, LS, SS).
Where, N represents the number of sample points in the CF, which is easy to understand,
LS represents the vector sum of the feature dimensions of the sample points in the CF, 
SS represents the square of the feature dimensions of the sample points in the CF.
  • For example, as shown in the following figure, in a CF of a node in the CF Tree, there are the following 5 samples (3,4), (2,6), (4,5), (4,7), ( 3,8). Then it corresponds to,
Image for post
Algorithm

Phase 1: Build the CF Tree. Load the data into memory by building a cluster-feature tree (CF tree, defined below). Optionally, condense this initial CF tree into a smaller CF.
Phase 2: Global Clustering. Apply an existing clustering algorithm on the leaves of the CF tree. Optionally, refine these clusters.

Advantages
  • Save memory, all samples are on disk, CF Tree only stores CF nodes and corresponding pointers.
  • The clustering speed is fast, and it only takes one scan of the training set to build the CF Tree, and the addition, deletion, and modification of the CF Tree are very fast.
  • Noise points can be identified, and preliminary classification pre-processing can be performed on the data set.
Disadvantages
  • Since CF Tree has a limit on the number of CFs per node, the clustering result may be different from the real category distribution.
  • The data clustering effect is not good on high-dimensional features. 
  • If the distribution cluster of the data set is not similar to a hyper-sphere or is not convex, the clustering effect is not good.




No comments:

Post a Comment

Monk and Inversions

using System; public class Solution { public static void Main () { int T = Convert . ToInt32 ( Console . ReadLine...