Archive for the ‘statistics’ Category

Bon Statistiques: The Chernoff Bound

Thursday, August 13th, 2009

The beauty of the Chernoff bound, like everything admirable, lies in the minimalism.

In the simplest form, it does this: Give me a random number between -1 and 1, with mean 0 and variance \sigma. The Chernoff bound can tell you themaximum possible probability of it being greater than a number \alpha, say p.

If the probability were any higher than p, the variance would need to be bumped up above the given \sigma. So this probability is intimately linked to \sigma.

Sketch of proof:
- Prove for single random variable
- Extend proof to two random variables.
- Invoke mathematical induction.

Ref: http://cseweb.ucsd.edu/~klevchen/techniques/chernoff.pdf

The Zen of Cluster Counting

Sunday, May 25th, 2008

Problem I’m using k-means (or insert-clustering-gizmo-here) algorithm. How many clusters shall I partition my data into?

Solution Consider the scale of the clustering, which is naively the zoom setting at which the data is plotted. At a wide enough zoom, all data is one cluster, at telephoto, each point is a cluster. Scale is chosen at the outset of the problem determined not by clustering algorithm but by what is to be achieved by the clustering.

Let’s characterize the correct number of clusters at a given scale.

  1. A big Tibshirani Gap: Across-group variance of n-grouped similarly-distributed random data is much higher than the across-group variance of n-grouped given data.
    • How big is big? Try various values of n and pick the largest.
    • How do you generate similarly distributed radom data in high dimensions??
  2. Low across-iteration variance in variance: If the number of clusters hits the sweet spot, the grouping will be stable across iterations; i.e. a global minima will exist for minimum variance which can be attained several times. For n-clustered data, the variance measure at each iteration will be stable.
    • What’s the variance for multi-dimensional data? For a basic implementation: the trace of covariance matrix.
    • How many iterations? Thousands of them.

Clustering, Gap Statistic, Multidimensional Scaling (miscellany from Ch. 13-14)

Saturday, May 3rd, 2008

(Really useless without the author’s color illustrations.)

Exhibits for a superficial introduction to K-means clustering, extended to Gaussian mixture modeling, followed by a description of the Gap Statistic and multidimensional scaling.