**
Streaming K-Means on well-Clusterable Data
**

*
Vladimir Braverman,
Adam Meyerson,
Rafail Ostrovskyy,
Alan Roytmans,
Michael Shindler,
Brian Tagiku
*

**
Abstract:
**

One of the central problems in data-analysis is K-means clustering. In recent years, considerable attention in the literature addressed the "streaming" variant of this problem, culminating in a series of results (Har-Peled and Mazumdar; Frahling and Sohler; Frahling, Monemizadeh, and Sohler; Chen) that produced a (1=E)-approximation for K-means clustering in the streaming setting. Unfortunately, since optimizing the K-means objective is Max-SNP hard, all algorithms that achieve a (1+E)-approximation must take time exponential in K unless P=NP.

Thus, to avoid exponential dependence on K, some additional assumptions must be made to guarantee high quality approximation and polynomial running time. A recent paper of Strovsky,Rabani, Schulman, and Swamy (FOCS 2006 Introduced the very natural assumption of "data separability" the assumption closely reflects how K-means clustering in the non-streaming setting with polynomial running time even for large values of K. Their work left open a natural and important question: Are similar results possible in a streaming setting? This is the question we answer in this paper, albeit using substantially different techniques.

We show a near-optimal streaming approximation algorithm for K-means in high -dimensional Euclidean space with sublinear memory and single pass, under the same data separability assumption. Our algorithm offers significant improvements in both space and running time over previous work while yielding asymptotically best-possible performance (assuming that the running time must fully polynomial and P=/ NP)

The novel techniques we develop along the way imply a number of additional results:we provide a high-probability performance guarantee for online facility location (in contrast, Meyerson's FOCS 2001 algorithm gave bounds only in expectation); we develop a constant
approximation method for general class of semi-metric clustering problems; we improve (even without Qseparability) by a logarithmic factor space requirements for streaming constant-approximation for K-median; finally we design a "re-sampling method" in a streaming setting to convert any constant approximation for clustering to a [1+O](Q)^{2
-approximation for Q-separable data.
}

**comment:**
SODA 2011:26-40

Fetch PDF file of the paper

Back to Publications List |