The Zen of Cluster Counting
Sunday, May 25th, 2008Problem I’m using k-means (or insert-clustering-gizmo-here) algorithm. How many clusters shall I partition my data into?
Solution Consider the scale of the clustering, which is naively the zoom setting at which the data is plotted. At a wide enough zoom, all data is one cluster, at telephoto, each point is a cluster. Scale is chosen at the outset of the problem determined not by clustering algorithm but by what is to be achieved by the clustering.
Let’s characterize the correct number of clusters at a given scale.
- A big Tibshirani Gap: Across-group variance of n-grouped similarly-distributed random data is much higher than the across-group variance of n-grouped given data.
- How big is big? Try various values of n and pick the largest.
- How do you generate similarly distributed radom data in high dimensions??
- Low across-iteration variance in variance: If the number of clusters hits the sweet spot, the grouping will be stable across iterations; i.e. a global minima will exist for minimum variance which can be attained several times. For n-clustered data, the variance measure at each iteration will be stable.
- What’s the variance for multi-dimensional data? For a basic implementation: the trace of covariance matrix.
- How many iterations? Thousands of them.