ML Basic #6 | Clustering


Introduction

  • Major Clustering Algorithm

    • (1) Distribution Based Clustering : EM Algorithm

    • (2) Centroid Based Clustering : K-means

    • (3) Conectivity Based Models

    • (4) Density Based Models : DBSCAN

  • Spark built-in Clustering



1. Distribution based Clustering

  • Data points are grouped based on the probability of belonging to a gaussian distrib.


2. Centroid Based Clustering

  • An iterative algorithm in which data is organized into clusters based on how close data points are in the centre (centroid) of clusgters

2.1 K-Means

  • Introduction : assigns data points to clusters based on the shortest distance to the centroids of the cluster

  • Methods : minimizing the sum of distances between data points and their respective clusters

    1. $k$ initial means (centroids) are randomly generated within the data domain

    2. $k$ clusters are created by associating every observation with the nearest mean (centroid)

    3. The centroid of each $k$ cluster becomes the new mean

    4. Repeat above steps until convergence has been reached



3. Connectivity Based Clustering (Hierarchical Clustering)

  • This is similar to the centroid model and is also known as Hierarchical clustering. It is a cluster analysis method that seeks to build a hierarchy of clusters.

3.1 Agglomerative Clustering

  • Introduction : performs a hierarchical clustering using a botom up approach:

  • Methods : each observation starts in its own cluster, and clusters are successively merged together.

  • Usuage:



4. Density Based Models

  • searches areas of dense data points and assigns those area to the same clusters.

4.1 DBSCAN (Density Based Spatial Clustering of Applications with Noise)

  • Introduction : finds core samples of high density and expands clusters from them.

  • Methods

    1. Calculate pointwise distance of given dataset

    2. If there are data points more than MinPts within the distance of eps, then the target data point will be CorePoint. else it would be noise.

  • Usuage

    • sklearn.cluster.DBSCAN : provides faster clustering by means of “precomputed” metric and Tree-based algorithm to calc pointwise distances
  • References

from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
print(clustering.labels_) # >> array([ 0,  0,  0,  1,  1, -1])

4.2 HDBSCAN (Hierarchical DBSCAN, 2013)

  • Introduction : HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.

  • Methods :

    1. Transform the space according to the density/sparsity : The core part of clustering, single linkage algorithm is very sensitive to noise, so we need more robust metric to calc distance

      • Mutual Reachability Distance : max(distance(a, knn of a), distance(b, knn of b), distance (a,b)) [details]

        $$ MRD = max[core_k(a), core_k(b), d(a,b)] $$

    2. Build the minimum spanning tree of the distance weighted graph : graph node is each data sample, graph edge is MRD score (fig1)

    3. Construct a cluster hierarchy of connected components.

    4. Condense (Cut) the cluster hierarchy based on minimum cluster size

    5. Extract the stable clusters from the condensed tree : we want the choose clusters that persist and have a longer lifetime;

       

4.3 OPTICS (Ordering Points to Identify the Clustering Structure)

  • Introduction : OPTICS closely related to DBSCAN, finds core sample of high density and expands clusters from them. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets

    • OPTICS does not require to set the number of clusters in advance, instead, it extracts the clustering structure of the data and produces the reachability plot

      • This allows the user to have more flexibility in selecting the number of clusters, by cutting the reachability plot at certain point.
  • Methods:

    • Define a density threshold parameter, EPS, which controls the minimum density of clusters.

    • For each point in the dataset, calc the distance to its k-nearest neighbors.

    • Starting with an arbitrary point, calculate the reachability distance of each point in the dataset, based on the density of its neighbors

    • Order the points based on their reachability distance and create the reachability plot.

    • Extract clusters from the reachability plot by grouping points that are close to each other and have similar reachability distance

  • References:



References