The principle of minimum description length

The principle of minimum description length (MDL) is the formalization of Occam's razor , in which the best hypothesis (model and its parameters) for this data set is the one that leads to better data compression . The MDL principle was proposed by Jorma Rissanen in 1978 ^[1] . The principle is an important concept in the theory of information and the theory of computational learning ^[2] ^[3] ^[4] .

Content

Overview

Any data set can be represented as a string of characters from a finite (say, binary ) alphabet .

[The MDL principle] is based on the following realization: any pattern in a given data set can be used to compress data , that is, to describe data using a smaller set of characters than literally needed to describe data. (Grunwald, 1998) ^[5]

MDL is a theory of logical and statistical derivation, which begins with the idea: all statistical training is related to the discovery of patterns in data, and the best hypothesis for describing patterns in data is the one that compresses the data most strongly. Like other statistical methods, the principle can be used to train model parameters using some data. Although, usually, standard statistical methods suggest that the general view of the model is fixed. The main strength of the MDL principle is that it can be used to select the general appearance of the model and its parameters. A quantitative characteristic (sometimes only models, sometimes only parameters, sometimes models and parameters) is called a hypothesis. The basic idea is to consider a two-stage (lossless) code that encodes data $D$ ${\ displaystyle D}$ by first coding the hypothesis $H$ ${\ displaystyle H}$ in the many hypotheses considered ${\cal {H}}$ ${\ displaystyle {\ cal {H}}}$ and then coding $D$ ${\ displaystyle D}$ "via" $H$ ${\ displaystyle H}$ . In the simplest context, this simply means “encoding the deviation of the data from the prediction obtained by” $H$ ${\ displaystyle H}$ :

{L(D)}=\min _{H\in {\cal {H}}}\ (\ L(H)+L(D|H)\ )\

{\ displaystyle {L (D)} = \ min _ {H \ in {\ cal {H}}} \ (\ L (H) + L (D | H) \) \}

Hypothesis $H$ ${\ displaystyle H}$ at which the minimum is reached is then considered as the best explanation $D$ ${\ displaystyle D}$ . As a simple example, consider the regression problem: data $D$ ${\ displaystyle D}$ let consist of a sequence of points $D=(x_{1},y_{1}),\ldots ,(x_{n},y_{n})$ ${\ displaystyle D = (x_ {1}, y_ {1}), \ ldots, (x_ {n}, y_ {n})}$ , lots of ${\cal {H}}$ ${\ displaystyle {\ cal {H}}}$ let it be the set of all polynomials from $X$ ${\ displaystyle X}$ at $Y$ ${\ displaystyle Y}$ . To describe the polynomial $H$ ${\ displaystyle H}$ degrees (say) $k$ ${\ displaystyle k}$ , you must first discretize the parameters to some accuracy, then you need to describe this accuracy (natural number). Then the degree should be described. $k$ ${\ displaystyle k}$ (another natural number) and, finally, it should be described $k+1$ ${\ displaystyle k + 1}$ parameters. Full length will be $L(H)$ ${\ displaystyle L (H)}$ . Then we describe the points in $D$ ${\ displaystyle D}$ using some fixed code for x-values and then using code for $n$ ${\ displaystyle n}$ deviations $y_{i}-H(x_{i})$ ${\ displaystyle y_ {i} -H (x_ {i})}$ .

In practice, a statistical model is often (but not always) used. For example, each polynomial is associated $H$ ${\ displaystyle H}$ with the corresponding conditional distribution, thereby indicating that the data $X$ ${\ displaystyle X}$ , $Y$ ${\ displaystyle Y}$ normally distributed with average $H(X)$ ${\ displaystyle H (X)}$ and some deviation $\sigma ^{2}$ ${\ displaystyle \ sigma ^ {2}}$ which can be either fixed or added as parameters. Then a set of hypotheses ${\cal {H}}$ ${\ displaystyle {\ cal {H}}}$ reduced to a linear model $Y=H(X)+\epsilon$ ${\ displaystyle Y = H (X) + \ epsilon}$ with $H$ ${\ displaystyle H}$ in the form of a polynomial.

Moreover, often the specific parameter values are not directly very interesting, but only, for example, the degree of the polynomial are interesting. In this case, many ${\cal {H}}$ ${\ displaystyle {\ cal {H}}}$ equal ${\cal {H}}=\{{\cal {H}}_{0},{\cal {H}}_{1},\ldots \}$ ${\ displaystyle {\ cal {H}} = \ {{\ cal {H}} _ {0}, {\ cal {H}} _ {1}, \ ldots \}}$ where each element ${\cal {H}}_{j}$ ${\ displaystyle {\ cal {H}} _ {j}}$ represents the hypothesis that the data are best described by a polynomial of degree j. Then encode the data $D$ ${\ displaystyle D}$ this hypothesis ${\cal {H}}_{j}$ ${\ displaystyle {\ cal {H}} _ {j}}$ using a designed so that when some hypothesis $H\in {\cal {H}}_{j}$ ${\ displaystyle H \ in {\ cal {H}} _ {j}}$ well consistent with the data code $L(D|H)$ ${\ displaystyle L (D | H)}$ short. The development of such codes is called universal coding . There are various types of generic codes that can be used, often giving similar lengths for long data sequences, but different for short sequences. The 'best' codes (in the sense that they have the minimax optimality property) are normalized maximum likelihood codes (NML) or Starkov codes. A very useful class of codes is Bayesian marginal likelihood codes. For a family of exponential distributions, when a Jeffrey a priori distribution is used and the parameter space is suitably bounded, they asymptotically coincide with NML codes. This brings MDL theory close to the Bayesian objective choice of model, to which Jeffrey's a priori distribution is also sometimes applied, although for other reasons.

MDL vs. Solomon's Inference Theory

To select the hypothesis that catches the most regularity in the data, scientists are looking for a hypothesis by which to get the best compression. To do this, the code for data compression is fixed. Perhaps the most common code that can be used is a ( Turing-complete ) computer language . A program for outputting data is written in this language. Then the program effectively presents the data. The length of the shortest program that outputs data is called the Kolmogorov data complexity . This is the central idea of Ray Solomon's , which serves as an inspiration for MDL.

Conclusion

However, this mathematical theory does not provide a practical method for deriving a conclusion. The most important reasons for this are:

Kolmogorov complexity is not computable - there is no algorithm that, in an arbitrary sequence of data, produces the shortest program that reproduces data.
Kolmogorov’s complexity depends on which computer language is used. The choice of language is arbitrary, but it affects complexity for some additional constant. For this reason, a constant in the theory of Kolmogorov complexity is discarded. In practice, however, only a small amount of data is available, so constants can have a very big effect on the output results - good results when working with a limited data set are not guaranteed.

MDL tries to deal with this problem by:

Limitations of the set of allowed codes so that it becomes possible (computationally) to find the smallest code length for the data according to the allowed codes.
The selection of codes that are reasonably effective regardless of the data. The idea of "reasonable effectiveness" is reflected in the idea of a "universal code."

One of the most important properties of MDL methods is that they provide natural protection against retraining , because they implement a compromise between the complexity of the hypothesis (model class) and the complexity of the data ^[3] .

MDL Example

A coin is thrown 1000 times and the number of heads or tails drawn is recorded. Consider two classes of models:

The first is a code that writes 0 for the eagle and 1 for the tails. This code represents the hypothesis that the coin is symmetrical (that is, the loss of the eagle and the tails should be equally likely). The length of the code, according to this coding, is always exactly 1000 bits.
The second consists of all codes that are effective for an asymmetric coin, and represent the hypothesis that the coin is not symmetrical. Say we observe 510 loss of the eagle and 490 loss of tails. Then the code length corresponding to the best coding in the second class of models is less than 1000 bits.

For this reason, the naive statistical method may choose the second model as the best explanation of the data. However, the MDL approach would build one code based on a hypothesis instead of using better code. This code could be a normalized maximum likelihood code or a Bayesian code. If such code is used, the total length of the code based on the second class of models would be more than 1000 bits. Therefore, the conclusion that inevitably follows from the MDL approach will be that there is no sufficient basis for the hypothesis of an asymmetric coin, even if the best element of the second class of models gives a better fit to the data.

MDL Designation

The central concepts for MDL theory are the one-to-one correspondence between the length of the function code and probability distributions (this follows from the Kraft - Macmillan inequality ). For any probability distribution $P$ ${\ displaystyle P}$ can build code $C$ ${\ displaystyle C}$ such that the length (in bits) $C(x)$ ${\ displaystyle C (x)}$ is equal to $-\log _{2}P(x)$ ${\ displaystyle - \ log _ {2} P (x)}$ . This code minimizes the expected code length. And vice versa, if the code is given $C$ ${\ displaystyle C}$ , you can build a probability distribution $P$ ${\ displaystyle P}$ such that the above statement holds. (Rounding issues are ignored here.) In other words, finding an effective code is equivalent to finding a good probability distribution.

Related Concepts

The MDL principle is strongly related to probability theory and statistics through code matching and the probability distribution mentioned above. This led some researchers to conclude that the MDL principle is equivalent to the Bayesian conclusion — the length of the model code and the data in MDL correspond to a priori probability and in the Bayesian scheme ^[6] .

While Bayesian algorithms are often useful for constructing effective MDL codes, the MDL principle also contains other non-Bayesian codes. An example is the normalized Starkov maximum likelihood code, which plays a central role in the current MDL theory, but has no equivalent in Bayesian inference. Moreover, Rissanen emphasizes that we should not make any assumptions about the fidelity of the process of obtaining data - in practice, the class of models is usually a simplification of reality, and therefore does not contain any codes or probability distributions that are true in an objective sense ^[7] ^[8] . In the last link, Rissanen brings the mathematical justification of the MDL principle to .

According to the MDL philosophy, Bayesian methods should be avoided if they are based on unreliable a priori probability , which can lead to poor results. A priori conditions acceptable from the point of view of MDL are also preferable to the so-called Bayesian objective analysis. Here, however, the reasons are usually different ^[9] .

Other systems

MDL was not the first approach to information-theoretical education. Back in 1968, Wallace and Bolton introduced a related concept called the minimum message length (MML). The difference between MDL and MML is a source of constant confusion. Externally, the methods look mostly equivalent, but there are some significant differences, especially in interpretation:

MML is a completely subjective Bayesian approach - it starts with the idea that there is some faith about the process of obtaining data in the form of an a priori distribution. The MDL principle avoids any assumptions about the data acquisition process.
Both methods use two-part codes - one part always represents the information that they are trying to train, such as the index of the class model (when ) or parameter values (when evaluating parameters ). The second part contains encoded data according to information from the first part. The difference in methods is that in the literature on MDL they recommend that unwanted parameters be assigned to the second part of the code, where they can be represented with data using the so-called - , which is often more efficient than two-part code. In the original MML description, all parameters are encoded in the first part, so that all parameters are trained.
In the MML scheme, each parameter is set exactly to the position that leads to the optimal total length of the message — the above example may occur if some parameters were initially considered “possibly useful” for the model, but then it was discovered that they were unable to help explain the data. The MDL scheme focuses more on comparing model classes, rather than models themselves, and it is more natural to ask the same question by comparing model classes than explicitly including such a parameter in one class and not including it in another.

Notes

↑ Rissanen, 1978 , p. 465–658.
↑ Minimum Description Length (unspecified) . University of Helsinki . Date of treatment July 3, 2010. Archived February 18, 2010.
↑ ¹ ² Grünwald, 2007 .
↑ Grünwald, Myung, Pitt, 2005 .
↑ Grünwald, 2004 .
↑ MacKay, 2003 .
↑ Rissanen, Jorma . Homepage of Jorma Rissanen . Date of treatment July 3, 2010.
↑ Rissanen, 2007 .
↑ Nannen, 2010 .

Literature

Rissanen J. Modeling by shortest data description // Automatica. - 1978. - T. 14 , no. 5 . - DOI : 10.1016 / 0005-1098 (78) 90005-5 .
Peter D. Grünwald. the Minimum Description Length principle. - Cambridge, Massachusetts; London, England: MIT Press , 2007 .-- ISBN 978-0-262-07281-6 .
Advances in Minimum Description Length: Theory and Applications / Peter D. Grünwald, In Jae Myung, Mark A. Pitt. - Cambridge, Massachusetts; London, England: MIT Press , 2005. - (Neural Information Processing). - ISBN 0-262-07262-9 .

Peter Grünwald. {{{title}}} . - 2004.
Rissanen J. Information and Complexity in Statistical Modeling . - Springer, 2007. - (Information Science and Statistics). - ISBN 0-387-36610-5 .
Volker Nannen. A short introduction to Model Selection, Kolmogorov Complexity and Minimum Description Length // preprint. - 2010.
David MacKay. Information Theory, Inference, and Learning Algorithms . - Cambridge University Press , 2003.