Probability

A simple example: the probability that the number "5" will appear on the cube is equal to

{\tfrac {1}{6}}

{\ displaystyle {\ tfrac {1} {6}}}

{\ tfrac {1} {6}}

. The same as for any other number on the cube.

Probability is the degree (relative measure, quantitative assessment) of the possibility of the occurrence of a certain event . When the grounds for any possible event to occur in reality outweigh the opposite grounds, then this event is called probable , otherwise it is unlikely or unbelievable . The preponderance of positive grounds over negative ones, and vice versa, can be in varying degrees, as a result of which the probability (and improbability ) is greater or lesser ^[1] . Therefore, often the probability is evaluated at a qualitative level, especially in cases where a more or less accurate quantitative assessment is impossible or extremely difficult. Various gradations of “levels” of probability are possible ^[2] .

The study of probability from a mathematical point of view is a special discipline - the theory of probability ^[1] . In probability theory and mathematical statistics, the concept of probability is formalized as a numerical characteristic of an event - a probability measure (or its value) - a measure on a set of events (subsets of a set of elementary events), taking values from $0$ ${\ displaystyle 0}$ ${\ displaystyle 0}$ before $one$ ${\ displaystyle 1}$ $one$ . Value $one$ ${\ displaystyle 1}$ $one$ corresponds to a reliable event. An impossible event has a probability of 0 (the converse is generally not always true). If the probability of an event is $p$ ${\ displaystyle p}$ $p$ , then the probability of non-occurrence is equal to $1-p$ ${\ displaystyle 1-p}$ $1-p$ . In particular, the probability $1/2$ ${\ displaystyle 1/2}$ $1/2$ means equal probability of occurrence and non-occurrence of the event.

The classical definition of probability is based on the concept of equal opportunity outcomes. The probability is the ratio of the number of outcomes favorable to this event to the total number of equally possible outcomes. For example, the probability of an “eagle” or “tails” falling in case of a random coin flip is 1/2, if it is assumed that only these two possibilities take place ^[3] and they are equally possible. This classical “definition” of probability can be generalized to the case of an infinite number of possible values - for example, if some event can happen with equal probability at any point (the number of points is infinite) in some limited area of space (plane), then the probability that it will happen in some part of this allowable area is equal to the ratio of the volume (area) of this part to the volume (area) of the area of all possible points.

The empirical "definition" of probability is associated with the frequency of the occurrence of an event, proceeding from the fact that with a sufficiently large number of tests, the frequency should tend to an objective degree of possibility of this event. In the modern presentation of probability theory, probability is determined axiomatically , as a special case of the abstract theory of measure of a set . Nevertheless, the link between the abstract measure and probability, expressing the degree of possibility of the occurrence of an event, is precisely the frequency of its observation.

A probabilistic description of certain phenomena has become widespread in modern science, in particular in econometrics , statistical physics of macroscopic ( thermodynamic ) systems, where even in the case of the classical deterministic description of particle motion, a deterministic description of the entire particle system does not seem practically possible and expedient. In quantum physics , the processes themselves are of a probabilistic nature.

History

Christian Huygens probably published the first book on probability theory

Background to the concept of probability

The need for the concept of probability and research in this direction has historically been associated with gambling , especially dice. Before the notion of probability appeared, the combinatorial problems of calculating the number of possible outcomes when throwing several dice were formulated, as well as the task of dividing bets between players when the game is completed ahead of schedule. The first task when throwing three dice was "solved" in 960 by Bishop Vibold of Cambrai ^[4] . He counted 56 options. However, this amount does not in fact reflect the number of equally probable opportunities, since each of the 56 options can be implemented in a different number of ways. In the first half of the 13th century, Richard de Fornival took these aspects into account. Despite the fact that he also has the number 56, but in his reasoning he takes into account that, for example, “the same number of points on three bones can be obtained in six ways.” Based on his reasoning, it can already be established that the number of equally possible options is 216. In the future, many did not quite correctly solve this problem. For the first time, the number of equally possible outcomes when throwing three bones was clearly calculated by Galileo Galilei , raising the six (the number of options for losing one bone) to degree 3 (the number of bones): 6³ = 216. He also compiled a table of the number of ways to get different points.

Tasks of the second type at the end of the 15th century were formulated and proposed the first (generally speaking erroneous) solution of Luca Pacioli ^[4] . His decision was to split the rate in proportion to the already won games. Significant further advancement in the early 16th century is associated with the names of Italian scientists Gerolamo Cardano and N. Tartaglia . Cardano gave a correct calculation of the number of cases when throwing two bones (36). He also for the first time correlated the number of cases of loss of a certain number of at least one bone (11) to the total number of outcomes (which corresponds to the classical definition of probability) - 11/36. Similarly, for three bones, he considered, for example, that nine points can turn out in the number of ways equal to 1/9 of the “whole series” (that is, the total number of equally possible outcomes is 216). Cardano did not formally introduce the concept of probability, but essentially considered the relative number of outcomes, which is essentially equivalent to considering probabilities. It should also be noted that in its infancy, Cardano can also find ideas related to the law of large numbers. Regarding the task of dividing the bet, Cardano suggested taking into account the number of remaining games that need to be won. N. Tartaglia also made remarks about the decision of Luke and proposed his decision (generally speaking, also erroneous).

Galileo's merit also lies in expanding the field of research to the field of observation errors. He first pointed out the inevitability of errors and classified them into systematic and random (such a classification is still used today).

Karl Friedrich Gauss

The emergence of concepts and probability theory

The first works on probability date back to the 17th century. Such as the correspondence of the French scientists B. Pascal , P. Fermat (1654) and the Dutch scientist H. Huygens (1657), which gave the earliest known scientific interpretation of probability ^[5] . In fact, Huygens already operated on the concept of mathematical expectation. The Swiss mathematician J. Bernoulli established the law of large numbers for a scheme of independent trials with two outcomes (posthumous, 1713).

In the XVIII century. - early 19th century probability theory is developed in the works of A. Moiré (England, 1718), P. Laplace (France), C. Gauss (Germany) and S. Poisson (France). Probability theory begins to be applied in the theory of observation errors, which has developed in connection with the needs of geodesy and astronomy, and in the theory of shooting. It should be noted that the law of the distribution of errors essentially proposed Laplace first as an exponential dependence on error without taking into account the sign (in 1774), then as an exponential function of the square of the error (in 1778). The latter law is usually called the Gaussian distribution or the normal distribution. Bernoulli (1778) introduced the principle of the product of the probabilities of simultaneous events. Adrien Marie Legendre (1805) developed the least squares method .

Andrei Nikolaevich Kolmogorov

In the second half of the XIX century. the development of probability theory is associated with the works of Russian mathematicians P. L. Chebyshev , A. M. Lyapunov and A. A. Markov (Sr.), as well as works on mathematical statistics A. Ketle (Belgium) and F. Galton (England) and statistical the physics of L. Boltzmann (in Austria), which created the basis for a significant expansion of the problems of probability theory. The currently most common logical (axiomatic) scheme for constructing the foundations of probability theory was developed in 1933 by the Soviet mathematician A. N. Kolmogorov .

Probability Definitions

Classical Definition

The classical “definition” of probability proceeds from the notion of equal opportunity as an objective property of the phenomena being studied. Equivalence is an indefinable concept and is established from the general considerations of the symmetry of the studied phenomena. For example, when tossing a coin, it is assumed that, due to the assumed coin symmetry, material uniformity and randomness (impartiality) of the tossing, there is no reason to prefer “tails” over the “eagle” or vice versa, that is, the loss of these sides can be considered equally probable (equally probable) .

Along with the concept of equal possibilities in the general case, the classical definition also requires the concept of an elementary event (outcome) that is favorable or not for the event A. It is a question of outcomes, the occurrence of which excludes the possibility of other outcomes. These are incompatible elementary events. For example, when throwing a dice, the loss of a specific number excludes the loss of the remaining numbers.

The classical definition of probability can be formulated as follows:

The probability of a random event A is the ratio of the number n of incompatible equally probable elementary events that make up event A to the number of all possible elementary events N :

P(A)={\frac {n}{N}}

{\ displaystyle P (A) = {\ frac {n} {N}}}

For example, let two bones be thrown. The total number of equally possible outcomes (elementary events) is 36 (since for each of the 6 possible outcomes of one bone, 6 variants of the outcome of another are possible). Estimate the probability of losing seven points. You can get 7 points only with the following combinations of outcomes of throwing two dice: 1 + 6, 2 + 5, 3 + 4, 4 + 3, 5 + 2, 6 + 1. That is a total of 6 equally possible outcomes, conducive to obtaining 7 points, out of 36 possible outcomes of a roll of dice. Therefore, the probability will be equal to 6/36 or, if shortened, 1/6. For comparison: the probability of getting 12 points or 2 points is only 1/36 - 6 times less.

Geometric Definition

Despite the fact that the classical definition is intuitive and derived from practice, at least it cannot be directly applied if the number of equally possible outcomes is infinite. A striking example of an infinite number of possible outcomes is a limited geometric region G, for example, on a plane with area S. A randomly “thrown” “point” with equal probability can be at any point in this region. The task is to determine the probability of a point falling into some subdomain g with area s. In this case, generalizing the classical definition, we can come to a geometric definition of the probability of falling into the subdomain $g$ ${\ displaystyle g}$ :

P(A)={\frac {s}{S}}

{\ displaystyle P (A) = {\ frac {s} {S}}}

In view of the equal possibilities, this probability does not depend on the shape of the region g, it depends only on its area. This definition can naturally be generalized to a space of any dimension, where instead of the area to use the concept of "volume". Moreover, just such a definition leads to the modern axiomatic definition of probability. The concept of volume is generalized to the concept of a measure of some abstract set, to which the requirements are imposed, which the “volume” has in a geometric interpretation - first of all, it is non-negativity and additivity .

Frequency (Statistical) Definition

When considering complex problems, the classical definition runs into insurmountable difficulties. In particular, in some cases it may not be possible to identify equal cases. Even in the case of a coin, as you know, there is clearly an unequal possibility of a “rib” falling out, which cannot be estimated from theoretical considerations (one can only say that it is unlikely and this is rather a practical consideration). Therefore, even at the dawn of the formation of probability theory, an alternative “frequency” definition of probability was proposed. Namely, formally, probability can be defined as the limit of the frequency of observations of event A, assuming the uniformity of the observations (i.e., the sameness of all observation conditions) and their independence from each other:

$P(A)=\lim _{N\rightarrow \infty }{\frac {n}{N}},$ ${\ displaystyle P (A) = \ lim _ {N \ rightarrow \ infty} {\ frac {n} {N}},}$

Where $N$ ${\ displaystyle N}$ - the number of observations, and $n$ ${\ displaystyle n}$ - the number of occurrences of the event $A$ ${\ displaystyle A}$ .

Despite the fact that this definition most likely indicates a method for estimating an unknown probability - by a large number of homogeneous and independent observations - this definition nevertheless reflects the content of the concept of probability. Namely, if some probability is attributed to the event, as an objective measure of its possibility, then this means that under fixed conditions and repeated repetition, we must obtain a frequency of its appearance close to $p$ ${\ displaystyle p}$ (the closer, the more observations). Actually, this is the initial meaning of the concept of probability. It is based on an objectivist view of natural phenomena. Below we will consider the so-called laws of large numbers, which provide a theoretical basis (within the framework of the modern axiomatic approach described below), including for the frequency estimation of probability.

Axiomatic definition

In the modern mathematical approach, the probability is given by the Kolmogorov axiomatics . It is assumed that some space of elementary events is given $X$ ${\ displaystyle X}$ . Subsets of this space are interpreted as random events . The combination (sum) of some subsets (events) is interpreted as an event consisting in the occurrence of at least one of these events. The intersection (product) of subsets (events) is interpreted as an event consisting in the onset of all these events. Disjoint sets are interpreted as incompatible events (their joint offensive is impossible). Accordingly, an empty set means an impossible event.

Probability ( probability measure ) is called a measure (numerical function) $\mathbf {P}$ ${\ displaystyle \ mathbf {P}}$ defined on a set of events with the following properties:

Non-negativity : $\forall A\subset X\colon \mathbf {P} (A)\geqslant 0$ ${\ displaystyle \ forall A \ subset X \ colon \ mathbf {P} (A) \ geqslant 0}$ ,
Additivity : the probability of occurrence of at least one (i.e., sum) of pairwise incompatible events is equal to the sum of the probabilities of these events; in other words if $A_{i}A_{j}=\varnothing$ ${\ displaystyle A_ {i} A_ {j} = \ varnothing}$ at $i\neq j$ ${\ displaystyle i \ neq j}$ then $P\left(\sum _{i}A_{i}\right)=\sum _{i}\mathbf {P} (A_{i})$ ${\ displaystyle P \ left (\ sum _ {i} A_ {i} \ right) = \ sum _ {i} \ mathbf {P} (A_ {i})}$ .
Finiteness (limited by unit): $\mathbf {P} (X)=1$ ${\ displaystyle \ mathbf {P} (X) = 1}$ ,

If the space of elementary events X is finite , then the indicated additivity condition for arbitrary two incompatible events is sufficient, from which additivity will follow for any finite number of incompatible events. However, in the case of an infinite ( countable or uncountable) space of elementary events, this condition is not enough. The so-called countable or sigma additivity is required, that is, the fulfillment of the additivity property for any no more than a countable family of pairwise incompatible events. This is necessary to ensure the "continuity" of the probability measure.

A probability measure may not be defined for all subsets of the set $X$ ${\ displaystyle X}$ . It is assumed that it is defined on some sigma-algebra $\Omega$ ${\ displaystyle \ Omega}$ subsets ^[6] . These subsets are called measurable in a given probabilistic measure and they are random events. Totality $(X,\Omega ,P)$ ${\ displaystyle (X, \ Omega, P)}$ - that is, a set of elementary events, the sigma-algebra of its subsets and the probability measure - is called the probability space .

Probability Properties

The basic properties of probability are easiest to determine based on the axiomatic definition of probability.

1) the probability of an impossible event (empty set $\varnothing$ ${\ displaystyle \ varnothing}$ ) is zero:

\mathbf {P} \{\varnothing \}=0;

{\ displaystyle \ mathbf {P} \ {\ varnothing \} = 0;}

This follows from the fact that each event can be represented as the sum of this event and an impossible event, which, due to the additivity and finiteness of the probability measure, means that the probability of an impossible event must be zero.

2) if event A is included (“included”) in event B, that is $A\subset B$ ${\ displaystyle A \ subset B}$ , i.e. the occurrence of event A also implies the occurrence of event B, then:

\mathbf {P} \{A\}\leqslant \mathbf {P} \{B\};

{\ displaystyle \ mathbf {P} \ {A \} \ leqslant \ mathbf {P} \ {B \};}

This follows from the non-negativity and additivity of the probability measure, since the event $B$ ${\ displaystyle B}$ maybe "contains" except for the event $A$ ${\ displaystyle A}$ some other events incompatible with $A$ ${\ displaystyle A}$ .

3) the probability of each event $A$ ${\ displaystyle A}$ is from 0 to 1, that is, it satisfies the inequalities:

0\leqslant \mathbf {P} \{A\}\leqslant 1;

{\ displaystyle 0 \ leqslant \ mathbf {P} \ {A \} \ leqslant 1;}

The first part of the inequality (non-negativity) is axiomatically asserted, and the second follows from the previous property, taking into account the fact that any event “enters” into $X$ ${\ displaystyle X}$ , and for $X$ ${\ displaystyle X}$ axiomatically assumed $\mathbf {P} \{X\}=1$ ${\ displaystyle \ mathbf {P} \ {X \} = 1}$ .

4) the probability of an event $B\setminus A$ ${\ displaystyle B \ setminus A}$ where $A\subset B$ ${\ displaystyle A \ subset B}$ occurrence of an event $B$ ${\ displaystyle B}$ while not occurring $A$ ${\ displaystyle A}$ is equal to:

\mathbf {P} \{B\setminus A\}=\mathbf {P} \{B\}-\mathbf {P} \{A\};

{\ displaystyle \ mathbf {P} \ {B \ setminus A \} = \ mathbf {P} \ {B \} - \ mathbf {P} \ {A \};}

This follows from the additivity of probability for incompatible events and from the fact that events $A$ ${\ displaystyle A}$ and $B\setminus A$ ${\ displaystyle B \ setminus A}$ are incompatible by condition, and their sum is equal to the event $B$ ${\ displaystyle B}$ .

5) event probability ${\bar {A}}$ ${\ displaystyle {\ bar {A}}}$ opposite to the event $A$ ${\ displaystyle A}$ is equal to:

\mathbf {P} \{{\bar {A}}\}=1-\mathbf {P} \{A\};

{\ displaystyle \ mathbf {P} \ {{\ bar {A}} \} = 1- \ mathbf {P} \ {A \};}

This follows from the previous property, if as a set $B$ ${\ displaystyle B}$ use all the space $X$ ${\ displaystyle X}$ and consider that $\mathbf {P} \{X\}=1$ ${\ displaystyle \ mathbf {P} \ {X \} = 1}$ .

6) ( probability addition theorem ) the probability of the occurrence of at least one of (i.e., the sum) of arbitrary (not necessarily incompatible) two events $A$ ${\ displaystyle A}$ and $B$ ${\ displaystyle B}$ is equal to:

\mathbf {P} \{A+B\}=\mathbf {P} \{A\}+\mathbf {P} \{B\}-\mathbf {P} \{AB\}.

{\ displaystyle \ mathbf {P} \ {A + B \} = \ mathbf {P} \ {A \} + \ mathbf {P} \ {B \} - \ mathbf {P} \ {AB \}.}

This property can be obtained by representing the union of two arbitrary sets as the union of two disjoint sets - the first and the difference between the second and the intersection of the original sets: $A+B=A+(B\setminus (AB))$ ${\ displaystyle A + B = A + (B \ setminus (AB))}$ . Hence, taking into account the additivity of probability for disjoint sets and the formula for the probability of difference (see property 4) of sets, we obtain the required property.

Conditional Probability

Bayes Formula

Probability of occurrence of an event $A$ ${\ displaystyle A}$ subject to the occurrence of an event $B$ ${\ displaystyle B}$ is called conditional probability $A$ ${\ displaystyle A}$ (under this condition) and is indicated $P(A|B)$ ${\ displaystyle P (A | B)}$ . It is most simple to derive a formula for determining conditional probability based on the classical definition of probability. For these two events $A$ ${\ displaystyle A}$ and $B$ ${\ displaystyle B}$ consider the following set of incompatible events: $A{\overline {B}},AB,{\overline {A}}B,{\overline {A}}\cdot {\overline {B}}$ ${\ displaystyle A {\ overline {B}}, AB, {\ overline {A}} B, {\ overline {A}} \ cdot {\ overline {B}}}$ that exhaust all possible outcomes (such a set of events is called complete - see below). The total number of equally possible outcomes is $n$ ${\ displaystyle n}$ . If the event $B$ ${\ displaystyle B}$ already arrived, then equally possible outcomes are limited to only two events $AB,{\overline {A}}B$ ${\ displaystyle AB, {\ overline {A}} B}$ which is equivalent to the event $B$ ${\ displaystyle B}$ . Let the number of outcomes be $n_{B}$ ${\ displaystyle n_ {B}}$ . Of these outcomes to an event $A$ ${\ displaystyle A}$ only those related to the event are favored $A B$ ${\ displaystyle AB}$ . The number of relevant outcomes is denoted by $n_{AB}$ ${\ displaystyle n_ {AB}}$ . Then, according to the classical definition of probability, the probability of an event $A$ ${\ displaystyle A}$ subject to the event $B$ ${\ displaystyle B}$ will be equal $P(A|B)=n_{AB}/n_{B}$ ${\ displaystyle P (A | B) = n_ {AB} / n_ {B}}$ by dividing the numerator and denominator by the total number of equally possible outcomes $n$ ${\ displaystyle n}$ and re-considering the classical definition, we finally get the conditional probability formula:

P(A|B)={\frac {P(AB)}{P(B)}}

{\ displaystyle P (A | B) = {\ frac {P (AB)} {P (B)}}}

.

This implies the so-called probability multiplication theorem :

P(AB)=P(B)P(A|B)

{\ displaystyle P (AB) = P (B) P (A | B)}

.

By virtue of symmetry, we can similarly show that also $P(AB)=P(A)P(B|A)$ ${\ displaystyle P (AB) = P (A) P (B | A)}$ , from here follows the Bayes formula :

P(A|B)={\frac {P(A)P(B|A)}{P(B)}}

{\ displaystyle P (A | B) = {\ frac {P (A) P (B | A)} {P (B)}}}

Event Independence

Events A and B are called independent if the probability of occurrence of one of them does not depend on whether another event has occurred. Given the notion of conditional probability, this means that $P(A|B)=P(A)$ ${\ displaystyle P (A | B) = P (A)}$ , whence it follows that for independent events the equality

P(AB)=P(A)P(B).

{\ displaystyle P (AB) = P (A) P (B).}

In the framework of the axiomatic approach, this formula is accepted as a definition of the concept of independence of two events. For an arbitrary (final) set of events $A_{i}$ ${\ displaystyle A_ {i}}$ together, their independence means that the probability of their joint offensive is equal to the product of their probabilities:

P(A_{1}A_{2}\times \ldots \times A_{n})=P(A_{1})P(A_{2})\times \ldots \times P(A_{n}).

{\ displaystyle P (A_ {1} A_ {2} \ times \ ldots \ times A_ {n}) = P (A_ {1}) P (A_ {2}) \ times \ ldots \ times P (A_ {n }).}

The formula for conditional probability derived (in the framework of the classical definition of probability) above with the axiomatic definition of probability is the definition of conditional probability. Accordingly, as a consequence of the definitions of independent events and conditional probability, the equality of conditional and unconditional event probabilities is obtained.

Total Probability and Bayes Formula

Event Set $A_{i}$ ${\ displaystyle A_ {i}}$ , at least one of which will necessarily (with a single probability) come as a result of the test, is called complete . This means that the set of such events exhausts all possible outcomes. Formally, in the framework of the axiomatic approach, this means that $\sum _{i}A_{i}=X$ ${\ displaystyle \ sum _ {i} A_ {i} = X}$ . If these events are incompatible, then in the framework of the classical definition, this means that the sum of the numbers of elementary events that favor one or another event is equal to the total number of equally possible outcomes.

Let there be a complete set of pairwise incompatible events $A_{i}$ ${\ displaystyle A_ {i}}$ . Then for any event $B$ ${\ displaystyle B}$ The following formula for calculating its probability is true ( full probability formula ):

P(B)=\sum _{i=1}^{n}P(B|A_{i})P(A_{i})

{\ displaystyle P (B) = \ sum _ {i = 1} ^ {n} P (B | A_ {i}) P (A_ {i})}

Then the above Bayes formula, taking into account the full probability, can be written as follows:

$P(A_{j}|B)={\frac {P(A_{j})P(B|A_{j})}{\sum _{i=1}^{n}P(A_{i})P(B|A_{i})}}$ ${\ displaystyle P (A_ {j} | B) = {\ frac {P (A_ {j}) P (B | A_ {j})} {\ sum _ {i = 1} ^ {n} P (A_ {i}) P (B | A_ {i})}}}$

This formula is the basis of an alternative approach to probability - a Bayesian or subjective approach (see below).

Probability and random variables

The most important particular case of the application of “probability” is the probability of obtaining, as a result of testing or observation, of a particular numerical value of some measurable (observed) quantity. It is assumed that before the test (observation) the exact value of this quantity is unknown, that is, there is a clear uncertainty associated usually (with the exception of quantum physics) with the impossibility of taking into account all the factors affecting the result. Such quantities are called random . In modern probability theory, the concept of a random variable is formalized and it is defined as a function of “chance” - a function on the space of elementary events. With this definition, not elementary events themselves are observed, but “realizations”, specific values of a random variable. For example, when tossing a coin, “tails” or “eagle” appears. If we introduce a function that sets the tails - the number 1, and the eagle - 0, then we get a random variable as a function of these outcomes. In this case, the concept of a random variable is generalized to functions that display the space of elementary events in a certain space of arbitrary nature; accordingly, the concepts of a random vector , a random set , etc. can be introduced. However, usually a random variable is understood to mean a numerical function (quantity).

Distracting from the described formalization, under the space of elementary events one can understand the set of possible values of a random variable. The sigma-algebra of the subsets is arbitrary intervals on the numerical axis, their various (countable) associations and intersections. The probability measure is called in this case the distribution of a random variable. It is enough to specify a probability measure for intervals of the form $(-\infty ;x)$ ${\ displaystyle (- \ infty; x)}$ , since an arbitrary interval can be represented as the union or intersection of such intervals. It is assumed that each interval of the above type is associated with a certain probability $F(x)=P(X<x)$ ${\ displaystyle F (x) = P (X <x)}$ , that is, some function of possible values $x$ ${\ displaystyle x}$ . Such a function is called an integral, cumulative, or simply a distribution function of a random variable. In the case of differentiability of this function (in this case, the corresponding random variables are called continuous ), an analytically often more convenient function is introduced - the distribution density - the derivative of the distribution function: $f(x)=F'(x)$ ${\ displaystyle f (x) = F '(x)}$ . In the case of discrete random variables, instead of density (it does not exist in this case), one can directly use the distribution series $p_{i}$ ${\ displaystyle p_ {i}}$ - probability $i$ ${\ displaystyle i}$ th value. The corresponding distribution function will be associated with a number of distribution as: $F(x)=\sum _{x_{i}<x}p_{i}$ ${\ displaystyle F (x) = \ sum _ {x_ {i} <x} p_ {i}}$ . The probability that a random variable will be in a certain interval $(x_{1},x_{2})$ ${\ displaystyle (x_ {1}, x_ {2})}$ defined as the difference in the values of the distribution function at the ends of this interval. Through the distribution density, this is the corresponding integral of the density on a given interval (for a discrete random variable, it’s just the sum of the probabilities of the values from this interval).

Galton's Board - Demonstrates Normal Distribution

The distribution of a random variable gives its full characterization. However, individual characteristics of this distribution are often used. First of all, this is the mathematical expectation of a random variable - the average expected value of a random variable, taking into account weighting by the probabilities of occurrence of certain values, and variance or variation - the average square of the deviation of the random variable from its mathematical expectation. In some cases, other characteristics are used, among which asymmetry and excess are important. The described indicators are special cases of the so-called distribution moments .

There are some standard distribution laws that are often used in practice. First of all, this is a normal distribution (Gaussian distribution). It is fully characterized by two parameters - mathematical expectation and dispersion. Its widespread use is connected, in particular, with the so-called limit theorems (see below). When testing hypotheses, Chi-square distributions, Student's distribution , Fisher distribution often arise. When analyzing discrete random variables, a binomial distribution , a Poisson distribution , etc. are also considered. A gamma distribution is also often considered, a particular case of which is an exponential distribution , as well as the above Chi-square distribution Naturally, the distributions used in practice are not limited to these distributions only.

Often, in practice, based on a priori considerations, the assumption is made that the probability distribution of a given random variable refers to some distribution known up to parameters. For example, to the same normal distribution, but with an unknown mathematical expectation and dispersion (these two parameters uniquely determine the entire normal distribution). The task of the statistical sciences (mathematical statistics, econometrics, etc.) in this case is to estimate the values of these parameters in the most efficient (accurate) way. There are criteria by which you can establish the degree of "truth" of the respective assessment methods. Usually, at least consistency of assessment , bias and effectiveness in a certain class of assessments is required.

In practice, nonparametric methods for estimating distributions are also used.

The laws of large numbers

Of major importance in probability theory and in its applications is a group of theorems, usually grouped under the name " law of large numbers " or limit theorems . Without resorting to rigorous formulations, we can say, for example, that under some weak conditions, the average value of independent identically distributed random variables tends to their mathematical expectation with a sufficiently large number of these random variables. If we consider independent observations of the same random variable as a set of random variables, this means that the average of sample observations should tend to the true (unknown) mathematical expectation of this random variable. This is the law of large numbers in the form of Chebyshev . This provides the basis for obtaining appropriate estimates.

A very particular, but very important case is the Bernoulli scheme - independent tests, as a result of which some event either occurs or not. It is assumed that in each test the probability of an event occurring is the same and equal $p$ ${\ displaystyle p}$ (but she is unknown). This scheme can be reduced to the average value if we introduce a formal random variable X, which is an indicator of the occurrence of an event: it is 1 when the event occurs and 0 when the event does not occur. For such a random variable, the mathematical expectation is also equal to $p$ ${\ displaystyle p}$ . Then the average value of such a random variable is actually the frequency of the event $A$ ${\ displaystyle A}$ . According to the above theorem, this average (frequency) should tend to the true mathematical expectation of this random variable, that is, to an unknown probability $p$ ${\ displaystyle p}$ . Thus, with an increase in the number of observations, the frequency of the event can be used as a good estimate of the unknown probability. This is the so-called Bernoulli law of large numbers. This law was historically the first law of large numbers. More strictly, it can at least be argued that the probability that the frequency will deviate from $p$ ${\ displaystyle p}$ by a certain amount $\varepsilon$ ${\ displaystyle \ varepsilon}$ tends to zero for any values $\varepsilon$ ${\ displaystyle \ varepsilon}$ . A more general result ( Glivenko – Cantelli theorem ) is that the empirical distribution as a whole tends to the true probability distribution with increasing number of observations.

Along with these theorems, there is the so-called central limit theorem , which gives a limit probability distribution for the average, namely, under certain weak conditions, the average value of the observations of a random variable with a sufficiently large number of observations have a normal distribution ( regardless of the initial distribution of the random variable itself). For example, this is the case for the average of independent identically distributed random variables. In particular, this theorem is also applicable to the Bernoulli scheme. In general, the number of occurrences of event A in n trials has a binomial distribution , however, with a sufficiently large number of observations, this distribution, according to the indicated theorem, tends to a normal distribution in this case with a mathematical expectation $n p$ ${\ displaystyle np}$ and dispersion $np(1-p)$ ${\ displaystyle np (1-p)}$ where $p$ ${\ displaystyle p}$ - the probability of occurrence of event A in each test. This is asserted in the local and integral theorems of the Moavre-Laplace . This also implies the above conclusion, namely: the average value of the random variable-indicator of the event - that is, the frequency of occurrence of the event in the tests - will have a mathematical expectation in the limit $p$ ${\ displaystyle p}$ and variance $p(1-p)/n$ ${\ displaystyle p (1-p) / n}$ , which tends to zero with an increase in the number of trials. Thus, the frequency tends to the true probability of the occurrence of the event with an increase in the number of independent trials, and we know the frequency distribution with a sufficiently large number of observations.

Bayesian Approach to Probability

The objective (frequency) approach described above is based on the assumption that there is an objective uncertainty inherent in the phenomena being studied. In the alternative Bayesian approach, uncertainty is interpreted subjectively - as a measure of our ignorance. In the framework of the Bayesian approach, probability is understood as the degree of confidence in the truth of a judgment - subjective probability.

The idea of the Bayesian approach is to switch from a priori knowledge to a posteriori knowledge taking into account the observed phenomena. The essence of the Bayesian approach follows from the Bayesian formula described above. Let there be a complete set of hypotheses $A_{i}$ ${\ displaystyle A_ {i}}$ , and from a priori considerations, the probabilities of the validity of these hypotheses (the degree of confidence in them) are estimated. The completeness of the set means that at least one of these hypotheses is true and the sum of a priori probabilities $p(A_{i})$ ${\ displaystyle p (A_ {i})}$ equal to 1. Also for the event under study $B$ ${\ displaystyle B}$ probabilities are known from a priori considerations $P(B|A_{i})$ ${\ displaystyle P (B | A_ {i})}$ - probability of occurrence of an event $B$ ${\ displaystyle B}$ subject to the validity of the hypothesis $A_{i}$ ${\ displaystyle A_ {i}}$ . Then, using the Bayes formula, we can determine the posterior probabilities $P(A_{j}|B)$ ${\ displaystyle P (A_ {j} | B)}$ - that is, the degree of confidence in the validity of the hypothesis $A_{j}$ ${\ displaystyle A_ {j}}$ after the event $B$ ${\ displaystyle B}$ happened. Actually, the procedure can be repeated taking new probabilities as a priori and again making a test, thereby iteratively specifying a posteriori probabilities of hypotheses.

In particular, in contrast to the basic approach to estimating distributions of random variables, where it is assumed that the values of unknown distribution parameters are estimated on the basis of observations, the Bayesian approach assumes that the parameters are also random variables (from the point of view of our ignorance of their values). The hypotheses are one or another possible parameter value and some a priori densities of unknown parameters are assumed to be data. $p(\theta )$ ${\ displaystyle p (\ theta)}$ . An a posteriori distribution acts as an estimate of unknown parameters. Let some values be obtained as a result of observations $x$ ${\ displaystyle x}$ studied random variable. Then, for the values of this sample, assuming the known likelihood is the probability (density) of obtaining this sample for given parameter values $p(x|\theta )$ ${\ displaystyle p (x | \ theta)}$ , according to the Bayes formula (in this case, a continuous analog of this formula, where densities are used instead of probabilities, and summation is replaced by integration), we obtain the posterior probability (density) $p(\theta |x)$ ${\ displaystyle p (\ theta | x)}$ parameters for this sample.

Probability, Information, and Entropy

Let there $N$ ${\ displaystyle N}$ equally probable outcomes. The degree of uncertainty of experience in this situation can be characterized by the number $H=\log _{2}N$ ${\ displaystyle H = \ log _ {2} N}$ . This indicator, introduced by Hartley, communications engineer in 1928, characterizes the information that you must have in order to know which one $N$ ${\ displaystyle N}$ of equal possibilities takes place, that is, to reduce the uncertainty of experience to zero. The easiest way to find out is to ask questions like “the outcome number is less than half N”, if so, then a similar question can be asked for one of the halves (depending on the answer to the question), etc. The answer to each similar question reduces uncertainty. In total such questions will be needed just to completely remove the uncertainty $H$ ${\ displaystyle H}$ . More formally, outcome numbers can be represented in binary notation, then $H$ ${\ displaystyle H}$ - this is the number of necessary bits for such a representation, that is, the amount of information in bits with which you can encode the implementation of equally possible outcomes. In general, the unit of information may be different, so the logarithm can theoretically be used with any reason.

In the general case (outcomes are not necessarily equally probable) the amount of information associated with the implementation of one of $N$ ${\ displaystyle N}$ outcomes whose probabilities are equal $p_{i}$ ${\ displaystyle p_ {i}}$ (assumed $\sum _{i}p_{i}=1$ ${\ displaystyle \ sum _ {i} p_ {i} = 1}$ ) is defined as follows ( Shannon's formula ):

H=\sum _{i}p_{i}\log {\frac {1}{p_{i}}}=-\sum _{i}p_{i}\log p_{i}=-E(\log p)

{\ displaystyle H = \ sum _ {i} p_ {i} \ log {\ frac {1} {p_ {i}}} = - \ sum _ {i} p_ {i} \ log p_ {i} = - E (\ log p)}

Where $E$ ${\ displaystyle E}$ - a sign of mathematical expectation.

Obviously, with the equiprobability of all outcomes ( $p_{i}=1/N$ ${\ displaystyle p_ {i} = 1 / N}$ ) we obtain the already known relation $H=\log N$ ${\ displaystyle H = \ log N}$ . For a continuous random variable in this formula, it is necessary to use, instead of probabilities, the distribution density function and, instead of the sum, the corresponding integral.

The indicated value is called information, information quantity, information entropy , etc. It should be noted that such a definition of information is abstracted from any content of information, the content of specific outcomes. The information quantity is determined only on the basis of probabilities. Value $H$ ${\ displaystyle H}$ Shannon called entropy in connection with the similarity with thermodynamic entropy. The last concept was first introduced by Rudolf Clausis in 1865, and the probabilistic interpretation of entropy was given by Ludwig Boltzmann in 1877. The entropy of a macroscopic system is a measure of the number of possible microstates for a given macrostate (more specifically, it is proportional to the logarithm of the number of microstates - statistical weight ) or a measure of the “internal disorder” of a macrosystem.

Probability and Quantum Physics

In quantum mechanics, the state of a system (particle) is characterized by a wave function (generally speaking, a state vector) - a complex-valued function of "coordinates", the square of the module of which is interpreted as the probability density of obtaining given values of "coordinates". According to modern concepts, the probabilistic determination of the state is complete and the reason for the probabilistic nature of quantum physics is not any “hidden” factors - this is due to the nature of the processes themselves. In quantum physics, any mutual transformations of various particles that are not forbidden by certain conservation laws are possible. And these interconversions obey laws - probabilistic laws. According to modern ideas, it is fundamentally impossible to predict either the moment of mutual transformation, or a specific result. We can only talk about the probabilities of various transformation processes. Instead of exact classical quantities in quantum physics, it is only possible to estimate the average values (mathematical expectations) of these quantities, for example, the average particle lifetime.

Probability in other areas

In addition to the question of the probability of a fact, a question may arise both in the field of law and in the field of moral (with a well-known ethical point of view) the question of how likely it is that this particular fact constitutes a violation of the general law. This question, which serves as the main motive in the religious jurisprudence of the Talmud , has also caused very complex systematic constructions and huge dogmatic and polemical literature in Roman Catholic moral theology (especially since the end of the 16th century) (see Probabilism ) ^[1] .

Notes

↑ ¹ ² ³ V.S. Soloviev Probability // Brockhaus and Efron Encyclopedic Dictionary : in 86 volumes (82 volumes and 4 additional). - SPb. , 1890-1907.
↑ So, for example, in jurisprudence , when a fact subject to trial is established on the basis of witness testimony, it always remains, strictly speaking, only probable, and it is necessary to know how significant this probability is. In Roman law , a four-fold division was adopted here: probatio plena (where probability almost translates into reliability), then probatio minus plena , then probatio semiplena major, and finally probatio semiplena minor . It should be noted that in Roman the word probability is etymologically related to the word honesty.
↑ Generally speaking, there is a non-zero (but negligible) probability that a coin, for example, will rise “edge”, so the true probability of a tails or eagle falling is slightly less than 1/2
↑ ¹ ² Gnedenko B.V. Probability Theory Course: Textbook - Ed. 6th, rev. and add. - M .: Science. Ch. ed. physical mat. lit., 1988 - 448s.- p. 386-387
↑ Abrams, William, A Brief History of Probability , Second Moment , < http://www.secondmoment.org/articles/probability.php > . Retrieved November 10, 2017.
↑ More precisely, it is assumed that the measure is defined at least on some semiring of the subsets and it is further proved that it is also defined on the minimal ring containing this semiring and, moreover, this measure can be extended to the sigma-algebra of the subsets

Literature

Alfred Renyi Letters of probability / trans. with weng. D. Saas and A. Crumley, ed. B.V. Gnedenko. M .: World. 1970
Gnedenko B.V. Probability theory course. M., 2007. 42 p.
Kuptsov V. I. Determinism and probability. M., 1976.256 s.
Probability in Physics // Encyclopedic Dictionary of the Young Physicist / V. A. Chuyanov (comp.). - M .: Pedagogy, 1984. - S. 39 .-- 352 p.

[ЭСБЕ-1] ¹ ² ³ V.S. Soloviev Probability // Brockhaus and Efron Encyclopedic Dictionary : in 86 volumes (82 volumes and 4 additional). - SPb. , 1890-1907.

[2] So, for example, in jurisprudence , when a fact subject to trial is established on the basis of witness testimony, it always remains, strictly speaking, only probable, and it is necessary to know how significant this probability is. In Roman law , a four-fold division was adopted here: probatio plena (where probability almost translates into reliability), then probatio minus plena , then probatio semiplena major, and finally probatio semiplena minor . It should be noted that in Roman the word probability is etymologically related to the word honesty.

[3] Generally speaking, there is a non-zero (but negligible) probability that a coin, for example, will rise “edge”, so the true probability of a tails or eagle falling is slightly less than 1/2

[Gnedenko-4] ¹ ² Gnedenko B.V. Probability Theory Course: Textbook - Ed. 6th, rev. and add. - M .: Science. Ch. ed. physical mat. lit., 1988 - 448s.- p. 386-387

[5] Abrams, William, A Brief History of Probability , Second Moment , < http://www.secondmoment.org/articles/probability.php > . Retrieved November 10, 2017.

[6] More precisely, it is assumed that the measure is defined at least on some semiring of the subsets and it is further proved that it is also defined on the minimal ring containing this semiring and, moreover, this measure can be extended to the sigma-algebra of the subsets