Clever Geek Handbook
📜 ⬆️ ⬇️

Search for the best projection

The search for the best projection ( Projection Pursuit ) is the statistical method consisting in finding such a projection of multidimensional data for which a certain projection quality function reaches its maximum.

Scope

Although people have a good visual perception of information, they are able to analyze images of only small dimensions. When multidimensional data is analyzed, figurative perception is inconvenient. This problem is solved by considering projections of data of dimension two or three. For visualization of data projections, usual methods are used: scattering diagrams, histograms, box diagrams and so on.

Method Idea

Рисунок 1.JPG

It is necessary to decide which projection will be the most “interesting”. One approach to automating the selection of the most “interesting” projection is based on the following considerations (for clarity, we will consider projection onto a straight line). The data projection will be considered “uninteresting” if the histogram has a normal distribution density, as in Figure 1.

Рисунок 2.JPG

Projections with a two-vertex distribution, as in Figure 2, will be considered “interesting”.

We consider the two-vertex (bimodal) distribution to be more interesting, since it indicates the possible presence of two clusters in the data.

Projection Index

To automate the search for the most “interesting” projection, they use a specially selected quality function, which is often called an index. The projection for which the quality function is maximum is declared the best. A projection search for multidimensional data based on maximizing some projection quality function is called a Projection Pursuit. Choosing an index determines how useful the result will be. We describe several options for the projection quality function.

Search for one-dimensional projection

We introduce the notation. Let beX {\ displaystyle X}   -p {\ displaystyle p}   -dimensional random vector, we will further assume that the vector is centered, i.e.E(X)=0 {\ displaystyle E (X) = 0}   .

We denotea {\ displaystyle a}   -p {\ displaystyle p}   -dimensional numerical vector, finding this vector is the task of finding the best projection, which will take the formaTX {\ displaystyle a ^ {T} X}   .

In this case, the data matrix has a dimensionp×one {\ displaystyle p \ times 1}   , vectorX {\ displaystyle X}   and there is a data matrix.

Then the indexI(a) {\ displaystyle I (a)}   defined as the variance of a linear combinationaTX {\ displaystyle a ^ {T} X}   under the additional condition of normalizationaTa=one {\ displaystyle a ^ {T} a = 1}   .

Common Projection Indices

Friedman and Tukey Approach

Jerome Friedman and John Tukey (1974) measured how “interesting” the multidimensional distribution is.X {\ displaystyle X}   considering index

IFT,h(a)=n-one∑j=onenf^h,a2(aTXi){\ displaystyle I_ {FT, \; h} (a) = n ^ {- 1} \ sum _ {j = 1} ^ {n} {\ hat {f}} _ {h, \; a} ^ { 2} (a ^ {T} X_ {i})}   ,

Wheref^h,a {\ displaystyle {\ hat {f}} _ {h, \; a}}   denotes a nuclear density estimate obtained from the designed data,

f^h,a(z)=n-one∑j=onenKh(z-aTXj){\ displaystyle {\ hat {f}} _ {h, \; a} (z) = n ^ {- 1} \ sum _ {j = 1} ^ {n} K_ {h} (za ^ {T} X_ {j})}   .

If a multidimensional random variableX {\ displaystyle X}   has a normal distribution, then each projectionz=aTX {\ displaystyle z = a ^ {T} X}   has a standard normal distribution, while‖a‖=one {\ displaystyle \ | a \ | = 1}   andX {\ displaystyle X}   centered. Changes inIFT,h(a) {\ displaystyle I_ {FT, \; h} (a)}   regardinga {\ displaystyle a}   indicate deviations from normality.

The Hodges and Lehman Approach

Projection Index is defined as∫(f′)2 {\ displaystyle \ int (f ') ^ {2}}   wheref {\ displaystyle f}   - distribution density of a multidimensional random variableX {\ displaystyle X}   which is a data matrix. Very often, density cannot be calculated explicitly or much more conveniently, instead of density, using its estimate.

Hodges and Lehman (1956) showed that ifE(X)=0 {\ displaystyle E (X) = 0}   andD(X)=one {\ displaystyle D (X) = 1}   then minimum∫(f′)2 {\ displaystyle \ int (f ') ^ {2}}   achieved at the density of Yepanechnikov, which has the formf(z)=max{0,c(b2-z2)} {\ displaystyle f (z) = \ max {\ {0, \; c (b ^ {2} -z ^ {2}) \}}}   wherec=320five {\ displaystyle c = {\ frac {3} {20 {\ sqrt {5}}}}}   andb=five {\ displaystyle b = {\ sqrt {5}}}   . This is a parabolic density function that is equal to zero outside the interval(-five,five) {\ displaystyle (- {\ sqrt {5}}, \; {\ sqrt {5}})}   . Thus, when using such an index, the density of Yepanechnikov will be the least interesting. A large index value indicates a large deviation from the parabolic form.

The Hodges-Lehman alternative index is based on maximizing entropy , i.e.∫(-flog⁡f) {\ displaystyle \ int (-f \ log f)}   .

If aE(X)=0 {\ displaystyle E (X) = 0}   andD(X)=one {\ displaystyle D (X) = 1}   , then the minimum index∫(flog⁡f) {\ displaystyle \ int (f \ log f)}   achieved at standard normal density. This property is an advantage of the index compared to the previous version.

Indeed, it seems intuitively that the normal distribution is “less interesting” than the Yepanechnikov distribution. So using the index∫(flog⁡f) {\ displaystyle \ int (f \ log f)}   , we measure the deviation of the distribution from normal.

Fisher's Approach

As another index, we can consider Fisher’s information ,∫(f′)2/f {\ displaystyle \ int (f ') ^ {2} / f}   .

When calculating the entropy index, we encounter great computational difficulties, which require a lot of time, which, of course, is not very convenient.

Jones and Sibson's approach

Jones and Sibson (1987) suggested considering deviations from normal density asf(x)=φ(x){one+ε(x)} {\ displaystyle f (x) = \ varphi (x) \ {1+ \ varepsilon (x) \}}   where functionε {\ displaystyle \ varepsilon}   satisfies the conditions

∫φ(u)ε(u)u-rdu=0{\ displaystyle \ int \ varphi (u) \ varepsilon (u) u ^ {- r} du = 0}   atr=0,one,2. {\ displaystyle r = 0, \; 1, \; 2.}  

To simplify the calculation of the Jones - Sibson index, it is convenient to go to the cumulantsκ3=μ3=E(X3) {\ displaystyle \ kappa _ {3} = \ mu _ {3} = E (X ^ {3})}   ,κfour=μfour=E(Xfour)-3 {\ displaystyle \ kappa _ {4} = \ mu _ {4} = E (X ^ {4}) - 3}   .

Since the standard normal density satisfies the conditionκ3=κfour=0 {\ displaystyle \ kappa _ {3} = \ kappa _ {4} = 0}   , the index should at least include information up to the level of symmetrical deviations (κ3 {\ displaystyle \ kappa _ {3}}   orκfour {\ displaystyle \ kappa _ {4}}   - not zero) from normality. The simplest of these indices is a positive definite quadratic form fromκ3 {\ displaystyle \ kappa _ {3}}   andκfour {\ displaystyle \ kappa _ {4}}   . In this case, there must be invariance when replacing the data sign, starting withaTX {\ displaystyle a ^ {T} X}   and-aTX {\ displaystyle -a ^ {T} X}   , we must get the same kind of deviation from normality. notice, thatκ3 {\ displaystyle \ kappa _ {3}}   - odd, i.e.κ3(aTX)=-κ3(-aTX) {\ displaystyle \ kappa _ {3} (a ^ {T} X) = - \ kappa _ {3} (- a ^ {T} X)}   . BUTκfour {\ displaystyle \ kappa _ {4}}   - even, that isκfour(aTX)=κfour(-aTX) {\ displaystyle \ kappa _ {4} (a ^ {T} X) = \ kappa _ {4} (- a ^ {T} X)}   . Quadratic form fromκ3 {\ displaystyle \ kappa _ {3}}   andκfour {\ displaystyle \ kappa _ {4}}   measuring deviation from normality does not include a mixed coefficientκ3κfour {\ displaystyle \ kappa _ {3} \ kappa _ {4}}   .

Therefore, the index proposed by Jones and Sibson is

IJS(a)={κ32(aTX)+κfour2(aTX)/four}/12{\ displaystyle I_ {JS} (a) = \ {\ kappa _ {3} ^ {2} (a ^ {T} X) + \ kappa _ {4} ^ {2} (a ^ {T} X) / 4 \} / 12}   .

This index actually measures the difference.∫flog⁡f-∫φlog⁡φ {\ displaystyle \ int f \ log f- \ int \ varphi \ log \ varphi}   .

Implementation Issues

The method of finding the best projection can give interesting results, but there are many drawbacks to its implementation. Firstly, it is difficult to come up with the correct interpretation of the results. Secondly, the implementation of the method can take a long time and require a sufficiently large amount of computer RAM. In addition, there are still differences between the human visual representation of the best projection and the solution obtained when searching for the best projection. These problems have not yet been resolved; there is no “canonical” version of the method; active research is underway.

Literature

  • Peter J. Huber , Projection Pursuit (Invited paper), Harvard University, The Annalas of Statistics, 13, No. 2 (1985), 435-475.
  • Jerome H. Friedman , Exploratory projection pursuit. J. Amer. Statist Assoc. 82 (1987) 249-266. Software online .

See also

  • Clustering
  • Factor analysis
  • Principal Component Method
Source - https://ru.wikipedia.org/w/index.php?title= Search for the best projection &oldid = 83899228


More articles:

  • Domnich, Ivan Nesterovich
  • Frontier Care
  • MOEC
  • An-12
  • Kaufman, Bel
  • 11th Army (Wehrmacht)
  • Uruguay Armed Forces
  • IgroService (football club)
  • VVA-Moscow Region
  • CORBA

All articles

Clever Geek | 2019