CURE Algorithm

CURE ( Clustering Using Representatives ) is an efficient cluster analysis algorithm for large databases . Compared to the k-means method, the algorithm is more resistant to outliers and is able to identify clusters that do not have a spherical shape and with a large variation in sizes.

Content

Disadvantages of Traditional Algorithms

The popular k-means algorithm minimizes the :

E=\sum _{i=1}^{k}\sum _{p\in C_{i}}(p-m_{i})^{2},

{\ displaystyle E = \ sum _ {i = 1} ^ {k} \ sum _ {p \ in C_ {i}} (p-m_ {i}) ^ {2},}

If there is a big difference in the size or geometry of different clusters, the quadratic error method can split large clusters to minimize the error squared, which is not always the case. Also, in the case of hierarchical clustering algorithms, this problem is present, since none of the measures of distances between clusters ( $d_{min},d_{mean}$ ${\ displaystyle d_ {min}, d_ {mean}}$ ) does not seek to work with various forms of clusters. Also, the run time is long if n is long.

The problem with the BIRCH algorithm is that when generating clusters after step 3, the algorithm uses the center of gravity of the clusters and assigns each cluster with the nearest center of gravity. Using only centers of gravity to redistribute points has a problem if clusters do not form uniform sizes and shapes.

CURE Clustering Algorithm

To avoid problems with heterogeneous cluster sizes or shapes, CURE uses a hierarchical clustering algorithm that makes a between the center of gravity and all extremes. In the CURE algorithm, a constant is selected from the points of the cluster with a good distribution, and these points are contracted to the center of gravity of the cluster by a certain value. Points after contraction are used as representatives of the cluster. Clusters with the closest pair of representatives are combined at each step of the CURE hierarchical clustering algorithm. This enables the CURE algorithm to correctly recognize clusters and makes it less sensitive to outliers.

The runtime is O ( n ² log n ), which makes it more expensive, and the spatial complexity of the algorithm is O ( n ).

The algorithm cannot be applied directly to a large database due to the great complexity of the calculations. The following enhancements address this issue.

Random sampling: Random sampling supports large data sets. In the general case, random selection is located in RAM . Random selection is a compromise between accuracy and efficiency.
Partitioning: The main idea is to partition the space of elementary events into p parts. Each part contains n / p elements. The first pass clusters each part until the total number of clusters is reduced to n / pq for some constant $q\geqslant 1$ ${\ displaystyle q \ geqslant 1}$ . The second pass of clustering brings the number of clusters to n / q . On the second pass, only representing points are remembered, since the cluster merging procedure requires only representatives of the clusters before calculating the representatives of the combined cluster. Splitting an input reduces execution time.
Labeling data on disk: If only representatives of k clusters are given, the remaining units of information are also distributed among the clusters. To do this, randomly representing points for each of k clusters are selected and a unit of information is assigned to the cluster containing the representative closest to the point.

Pseudo-code

CURE (number of points, k )

Entrance: Many points S

Output: k clusters

For each u cluster (each point) in u.mean and u.rep, the center of gravity of the cluster points and the set of c representatives of the cluster are stored (initially c = 1, since each cluster has one unit of information). Also in u.closest the closest cluster to u is remembered.
All input points are inserted into the k-dimensional tree T
Each input point is treated as a separate cluster, compute u.closest for each u, and then insert each cluster into a bunch of Q. (clusters are arranged in increasing order of distance from u to u.closest).
So far size (Q)> k
We remove the upper element of the heap Q (u) and combine it with its nearest cluster u.closest (v), then we calculate new representatives for the combined cluster w.
Remove u and v from T and Q.
For all clusters x from Q, update x.closest and determine the location of x on the heap
insert w into Q
go to the beginning of the cycle

Availability

The open source pyclustering library includes an implementation of the CURE algorithm in Python and C ++.

Notes

Literature

Sudipto Guha, Rajeev Rastogi, Kyuseok Shim. CURE: An Efficient Clustering Algorithm for Large Databases // Information Systems. - 1998. - T. 26 , no. 1 . - S. 35–58 . - DOI : 10.1016 / S0306-4379 (01) 00008-4 .
Jacob Kogan, Charles K. Nicholas, Teboulle M. Grouping multidimensional data: recent advances in clustering. - Springer, 2006. - ISBN 978-3-540-28348-5 .
Sergios Theodoridis, Konstantinos Koutroumbas. Pattern recognition . - Academic Press, 2006. - S. 572-574. - ISBN 978-0-12-369531-4 .