The search for the best projection ( Projection Pursuit ) is the statistical method consisting in finding such a projection of multidimensional data for which a certain projection quality function reaches its maximum.
Scope
Although people have a good visual perception of information, they are able to analyze images of only small dimensions. When multidimensional data is analyzed, figurative perception is inconvenient. This problem is solved by considering projections of data of dimension two or three. For visualization of data projections, usual methods are used: scattering diagrams, histograms, box diagrams and so on.
Method Idea
It is necessary to decide which projection will be the most “interesting”. One approach to automating the selection of the most “interesting” projection is based on the following considerations (for clarity, we will consider projection onto a straight line). The data projection will be considered “uninteresting” if the histogram has a normal distribution density, as in Figure 1.
Projections with a two-vertex distribution, as in Figure 2, will be considered “interesting”.
We consider the two-vertex (bimodal) distribution to be more interesting, since it indicates the possible presence of two clusters in the data.
Projection Index
To automate the search for the most “interesting” projection, they use a specially selected quality function, which is often called an index. The projection for which the quality function is maximum is declared the best. A projection search for multidimensional data based on maximizing some projection quality function is called a Projection Pursuit. Choosing an index determines how useful the result will be. We describe several options for the projection quality function.
Search for one-dimensional projection
We introduce the notation. Let be - -dimensional random vector, we will further assume that the vector is centered, i.e. .
We denote - -dimensional numerical vector, finding this vector is the task of finding the best projection, which will take the form .
In this case, the data matrix has a dimension , vector and there is a data matrix.
Then the index defined as the variance of a linear combination under the additional condition of normalization .
Common Projection Indices
Friedman and Tukey Approach
Jerome Friedman and John Tukey (1974) measured how “interesting” the multidimensional distribution is. considering index
,
Where denotes a nuclear density estimate obtained from the designed data,
.
If a multidimensional random variable has a normal distribution, then each projection has a standard normal distribution, while and centered. Changes in regarding indicate deviations from normality.
The Hodges and Lehman Approach
Projection Index is defined as where - distribution density of a multidimensional random variable which is a data matrix. Very often, density cannot be calculated explicitly or much more conveniently, instead of density, using its estimate.
Hodges and Lehman (1956) showed that if and then minimum achieved at the density of Yepanechnikov, which has the form where and . This is a parabolic density function that is equal to zero outside the interval . Thus, when using such an index, the density of Yepanechnikov will be the least interesting. A large index value indicates a large deviation from the parabolic form.
The Hodges-Lehman alternative index is based on maximizing entropy , i.e. .
If a and , then the minimum index achieved at standard normal density. This property is an advantage of the index compared to the previous version.
Indeed, it seems intuitively that the normal distribution is “less interesting” than the Yepanechnikov distribution. So using the index , we measure the deviation of the distribution from normal.
Fisher's Approach
As another index, we can consider Fisher’s information , .
When calculating the entropy index, we encounter great computational difficulties, which require a lot of time, which, of course, is not very convenient.
Jones and Sibson's approach
Jones and Sibson (1987) suggested considering deviations from normal density as where function satisfies the conditions
at
To simplify the calculation of the Jones - Sibson index, it is convenient to go to the cumulants , .
Since the standard normal density satisfies the condition , the index should at least include information up to the level of symmetrical deviations ( or - not zero) from normality. The simplest of these indices is a positive definite quadratic form from and . In this case, there must be invariance when replacing the data sign, starting with and , we must get the same kind of deviation from normality. notice, that - odd, i.e. . BUT - even, that is . Quadratic form from and measuring deviation from normality does not include a mixed coefficient .
Therefore, the index proposed by Jones and Sibson is
.
This index actually measures the difference. .
Implementation Issues
The method of finding the best projection can give interesting results, but there are many drawbacks to its implementation. Firstly, it is difficult to come up with the correct interpretation of the results. Secondly, the implementation of the method can take a long time and require a sufficiently large amount of computer RAM. In addition, there are still differences between the human visual representation of the best projection and the solution obtained when searching for the best projection. These problems have not yet been resolved; there is no “canonical” version of the method; active research is underway.
Literature
- Peter J. Huber , Projection Pursuit (Invited paper), Harvard University, The Annalas of Statistics, 13, No. 2 (1985), 435-475.
- Jerome H. Friedman , Exploratory projection pursuit. J. Amer. Statist Assoc. 82 (1987) 249-266. Software online .
See also
- Clustering
- Factor analysis
- Principal Component Method