Introduction
-
Major Clustering Algorithm
-
(1) Distribution Based Clustering : EM Algorithm
-
(2) Centroid Based Clustering : K-means
-
(3) Conectivity Based Models
-
(4) Density Based Models : DBSCAN
-
-
Spark built-in Clustering
1. Distribution based Clustering
- Data points are grouped based on the probability of belonging to a gaussian distrib.
2. Centroid Based Clustering
- An iterative algorithm in which data is organized into clusters based on how close data points are in the centre (centroid) of clusgters
2.1 K-Means
-
Introduction : assigns data points to clusters based on the shortest distance to the centroids of the cluster
-
Methods : minimizing the sum of distances between data points and their respective clusters
-
$k$ initial means (centroids) are randomly generated within the data domain
-
$k$ clusters are created by associating every observation with the nearest mean (centroid)
-
The centroid of each $k$ cluster becomes the new mean
-
Repeat above steps until convergence has been reached
-
3. Connectivity Based Clustering (Hierarchical Clustering)
- This is similar to the centroid model and is also known as Hierarchical clustering. It is a cluster analysis method that seeks to build a hierarchy of clusters.
3.1 Agglomerative Clustering
-
Introduction : performs a hierarchical clustering using a botom up approach:
-
Methods : each observation starts in its own cluster, and clusters are successively merged together.
-
Usuage:
- sklearn.cluster.AgglomerativeClustering : The
linkage
criteria determines the metric used for the merge strategy
- sklearn.cluster.AgglomerativeClustering : The
4. Density Based Models
- searches areas of dense data points and assigns those area to the same clusters.
4.1 DBSCAN (Density Based Spatial Clustering of Applications with Noise)
-
Introduction : finds core samples of high density and expands clusters from them.
-
Methods
-
Calculate pointwise distance of given dataset
-
If there are data points more than MinPts within the distance of eps, then the target data point will be CorePoint. else it would be noise.
-
-
Usuage
- sklearn.cluster.DBSCAN : provides faster clustering by means of “precomputed”
metric
and Tree-basedalgorithm
to calc pointwise distances
- sklearn.cluster.DBSCAN : provides faster clustering by means of “precomputed”
-
References
from sklearn.cluster import DBSCAN
import numpy as np
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
print(clustering.labels_) # >> array([ 0, 0, 0, 1, 1, -1])
4.2 HDBSCAN (Hierarchical DBSCAN, 2013)
-
Introduction : HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm, and then using a technique to extract a flat clustering based in the stability of clusters.
-
Methods :
-
Transform the space according to the density/sparsity : The core part of clustering, single linkage algorithm is very sensitive to noise, so we need more robust metric to calc distance
-
Mutual Reachability Distance :
max(distance(a, knn of a), distance(b, knn of b), distance (a,b))
[details]$$ MRD = max[core_k(a), core_k(b), d(a,b)] $$
-
-
Build the minimum spanning tree of the distance weighted graph : graph node is each data sample, graph edge is MRD score (fig1)
-
Construct a cluster hierarchy of connected components.
-
Condense (Cut) the cluster hierarchy based on minimum cluster size
-
Extract the stable clusters from the condensed tree : we want the choose clusters that persist and have a longer lifetime;
-
4.3 OPTICS (Ordering Points to Identify the Clustering Structure)
-
Introduction : OPTICS closely related to DBSCAN, finds core sample of high density and expands clusters from them. Unlike DBSCAN, keeps cluster hierarchy for a variable neighborhood radius. Better suited for usage on large datasets
-
OPTICS does not require to set the number of clusters in advance, instead, it extracts the clustering structure of the data and produces the reachability plot
- This allows the user to have more flexibility in selecting the number of clusters, by cutting the reachability plot at certain point.
-
-
Methods:
-
Define a density threshold parameter, EPS, which controls the minimum density of clusters.
-
For each point in the dataset, calc the distance to its k-nearest neighbors.
-
Starting with an arbitrary point, calculate the reachability distance of each point in the dataset, based on the density of its neighbors
-
Order the points based on their reachability distance and create the reachability plot.
-
Extract clusters from the reachability plot by grouping points that are close to each other and have similar reachability distance
-
-
References: