Archive for the ‘math’ Category

Fast Cross-correlogram

Monday, May 26th, 2008

Problem I want to compute cross-correlogram. Fast. Can it not be done quick and dirty in the spectral domain?

Solution Cross-correlation is straightforward to compute in spectral domain as ft(sig1)*conj(ft(sig2)). Computing the cross-correlogram requires successive windowed cross-correlation. MATLAB specgram does just that: can we leverage it?
(more…)

The Zen of Cluster Counting

Sunday, May 25th, 2008

Problem I’m using k-means (or insert-clustering-gizmo-here) algorithm. How many clusters shall I partition my data into?

Solution Consider the scale of the clustering, which is naively the zoom setting at which the data is plotted. At a wide enough zoom, all data is one cluster, at telephoto, each point is a cluster. Scale is chosen at the outset of the problem determined not by clustering algorithm but by what is to be achieved by the clustering.

Let’s characterize the correct number of clusters at a given scale.

  1. A big Tibshirani Gap: Across-group variance of n-grouped similarly-distributed random data is much higher than the across-group variance of n-grouped given data.
    • How big is big? Try various values of n and pick the largest.
    • How do you generate similarly distributed radom data in high dimensions??
  2. Low across-iteration variance in variance: If the number of clusters hits the sweet spot, the grouping will be stable across iterations; i.e. a global minima will exist for minimum variance which can be attained several times. For n-clustered data, the variance measure at each iteration will be stable.
    • What’s the variance for multi-dimensional data? For a basic implementation: the trace of covariance matrix.
    • How many iterations? Thousands of them.

Support Vector Classifiers and Machines 12.1-12.3

Wednesday, April 23rd, 2008

Hastie, Tibshirani & Friedman, Chapter 12.1-12.3, (without the authors’ color illustrations)

An introduction to support vector classifiers developed as formalization of notion of optimal separating hyperplane, leading to support vector machines by basis expansion of data. The kernel trick for easy adoption of arbitrary bases expansion (even when basis functions themselves are not known), and the use of SVM for regression is also presented.

presentation preview

R from 0 to “What seems to be the problem, Officer?”

Tuesday, April 8th, 2008

Links for learning R, clone of the statistical processing language S from Bell Labs, roughly along the abscissa of the learning curve.

PRIM, MARS, Hierarchical Mix of Experts

Thursday, April 3rd, 2008

Hastie, Tibshirani & Friedman, Chapter 9.1-9.7, (without the authors’ color illustrations)

A discussion of methods to handle data with high dimensionality by assuming some structure in the data. The data is broken down into regions and approximated as either piecewise contours (PRIM), or linear splines (MARS), or by linear/logistic functions (mixture of experts). Also discussed are techniques to handle missing predictors, and the running time of these techniques.

presentation preview

Python Resources for Scientists, Engineers, and Statisticians

Tuesday, March 18th, 2008

Ordered roughly by relevance for getting to know the kool tool:

LaTeX Math appears in font other than Computer Modern

Wednesday, March 12th, 2008

Problem: I want Computer Modern fonts in my presentation, but these weird text-land fonts show up in my equations. CM are there in dvi and ps (and even pdf) when I use gv for preview, but disappear in Ad0b3 Reader.

(Btw, I needed to use a symbol, and to use it I used the package txfonts (or pxfonts). That couldn’t hurt?)

Resolution: txfonts adjusts Times for mathematics, and pxfonts adjusts Palatino. Thus, these packages effect all the math, and are not “just for a few symbols” packages.

The symbol you’re looking for — if you found it in tx/pxfonts — will definitely be available from some other package (latex pre-defined or amsfonts, etc.); so look harder in symbols-letter.pdf (or symbols-a4.pdf) and DON’T use tx/pxfonts if Times/Palatino math is not your thing.

Model Assessment and Selection

Tuesday, March 4th, 2008

Hastie, Tibshirani & Friedman, Chapter 7.1-7.9, (without the authors’ color illustrations)

A discussion of methods to evaluate performance of models from a class in order to choose a model that balances explanation, generalization and interpretability. The progress of ideas is as follows: various measures of performance error, bias-variance decomposition for squared error, estimation of in-sample error using optimism estimate and training error, Mallow’s C_p statistic, AIC, BIC, Minimum Description Length, and the Vapnik-Chernovenkis dimension as the measure of complexity of a model.

presentation preview

Organized Sources of Knowledge Online

Tuesday, February 12th, 2008

Linear Methods for Classification

Thursday, January 31st, 2008

Hastie, Tibshirani & Friedman, Chapter 4.1-4.3, (without the authors’ color illustrations)

A discussion of linear methods for classification, using discriminant functions derived by two methods: regression on indicator functions, and approximation by Gaussian functions (of equal or different covariance, leading to linear discriminants and quadratic discriminants respectively). A note on how Fisher arrives at the same discriminants from a different viewpoint concludes this lecture.

presentation preview