Subquadratic Approximation Algorithms For Clustering Problems in High Dimensional Spaces
Allan Borodin, Rafail Ostrovsky and Yuval Rabani
Abstract:
One of the central problems in data-mining, computational biology,
statistical analysis, computer vision, geographic analysis, pattern
recognition, distributed protocols and the web hierarchical search
engines (like Yahoo) is the question of classification of data
according to some clustering rule. Often the data is noisy and even
approximate classification is of extreme importance. The difficulty of
such classification stems from the fact that usually the data has many
incomparable attributes, and often results in the question of
clustering problems in high dimensional spaces. Since they require
measuring distance between every pair of data points, standard
algorithms for computing the exact clustering solutions use quadratic
or "nearly quadratic" running time; i.e. O(d n2-a(d))
where n is the number of data points, d is the dimension of the
space and a(d) approaches 0 as d grows. In this paper, we
show (for three fairly natural clustering rules) that computing an
approximate solution can be done much more efficiently. More
specifically, for the agglomerative clustering (used, for example, in
in the Alta Vista
comment:
Appeared in
Mahine Learning Journal Special Issue: Theoretical Advances in Data Clustering
(Guest Editors: Nina Mishra and Rajeev Motwani) 56 (1-3): 153-167, 2004
Fetch PostScript file of the
paper
Fetch PDF file of the
paper
Preliminary version in
proceedings of
The 31'st ACM Symposium on Theory of Computing (STOC-99)
Back to Publications List