Clustering by fast search and find ofdensity peaks
Alex Rodriguez and Alessandro Laio
Cluster analysis is aimed at classifying elements into categories on the basis of their similarity. 聚类分析是在分类元素进行分类的基础上发掘他们的相似之处 Its applications range from astronomy to bioinformatics, bibliometrics, and pattern recognition. 其应用范围从天文学到生物信息学、文献计量学和模式识别 We propose an approach based on the idea that cluster centers are characterized by a higher density than their neighbors and by a relatively large distance from points with higher densities. 该算法基于这样的假设：类簇中心被具有较低局部密度的邻居点包围，且与具 有更高密度的任何点有相对较大的距离。 This idea forms the basis of a clustering procedure in which the number of clusters arises intuitively, outliers are automatically spotted and excluded fromthe analysis, and clusters are recognized regardless of their shape and of the dimensionality of the space in which they are embedded.We demonstrate the power of the algorithm on several test cases. Clustering algorithms attempt to classify elements into categories, or clusters, on the basis of their similarity. Several different clustering strategies have been proposed (1), but no consensus has been reached even on the definition of a cluster. In K-means (2) and K-medoids (3) methods, clusters are groups of data characterized by a small distance to the cluster center. An objective function, typically the sum of the distance to a set of putative cluster centers, is optimized (3–6) until the best cluster centers candidates are found. However, because a data point is always assigned to the nearest center, these approaches are not able to detect nonspherical clusters (7). In distribution-based algorithms,one attempts to reproduce the observed realization of data points as a mix of predefined probability distribution functions (8); the accuracy of such methods depends on the capability of the trial probability to represent the data. Clusters with an arbitrary shape are easily detected by approaches based on the local density of data points. In density-based spatial clustering of applications with noise (DBSCAN) (9), one chooses a density threshold, discards as noise the points in regions with densities lower than this threshold, and assigns to different clusters disconnected regions of high density. However,choosing an appropriate threshold can be nontrivial,a drawback not present in the mean-shift clustering method (10, 11). There a cluster is defined as a set of points that converge to the same local maximum of the density distribution function.This method allows the finding of nonsphericalclusters but works only for data defined by a set of coordinates and is computationally costly. Here, we propose an alternative approach.Similar to the K-medoids method, it has its basis only in the distance between data points.Like DBSCAN and the mean-shift method, it is able to detect nonspherical clusters and to automatically find the correct number of clusters.
The cluster centers are defined, as in the meanshift method, as local maxima in the density of data points. However, unlike the mean-shift method,our procedure does not require embeddingthe data in a vector space and maximizing explicitlythe density field for each data point. The algorithm has its basis in the assumptions that cluster centers are surrounded by neighbors with lower local density and that they are at a relatively large distance from any points with ahigher local density. For each data point i, we compute two quantities: its local density ri and its distance di from points of higher density. Both these quantities depend only on the distances between data points, which are assumed to satisfy the triangular
inequality. The local density ri of data point i is defined as