Consider the standard linear regression problem, in which for {\ displaystyle i = 1, ..., n}
we indicate the mean of the conditional distribution of the magnitude {\ displaystyle y_ {i}}
for a given vector {\ displaystyle k \ times 1}
predictions {\ displaystyle \ mathbf {x} _ {i}}
:
- {\ displaystyle y_ {i} = \ mathbf {x} _ {i} ^ {\ rm {T}} {\ boldsymbol {\ beta}} + \ epsilon _ {i},}

Where {\ displaystyle {\ boldsymbol {\ beta}}}
is an {\ displaystyle k \ times 1}
vector, and {\ displaystyle \ epsilon _ {i}}
are independent and equally distributed normally random variables:
- {\ displaystyle \ epsilon _ {i} \ sim N (0, \ sigma ^ {2}).}

This corresponds to the following likelihood function :
- {\ displaystyle \ rho (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- n / 2 } e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T }} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}})}.}

The solution to the is to estimate the coefficient vector using :
- {\ displaystyle {\ hat {\ boldsymbol {\ beta}}} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ rm {T}} \ mathbf {y}}

Where {\ displaystyle \ mathbf {X}}
is an {\ displaystyle n \ times k}
, each row of which is a prediction vector {\ displaystyle \ mathbf {x} _ {i} ^ {\ rm {T}}}
, but {\ displaystyle \ mathbf {y}}
is a column vector r {\ displaystyle [y_ {1} \; \ cdots \; y_ {n}] ^ {\ rm {T}}}
.
This is a approach, and it is assumed that there are enough measurements to say something meaningful about {\ displaystyle {\ boldsymbol {\ beta}}}
. In the Bayesian approach, the data is accompanied by additional information in the form of an a priori probability distribution . A priori beliefs about the parameters are combined with the likelihood function of the data according to the Bayes theorem to obtain a posteriori confidence about the parameters {\ displaystyle {\ boldsymbol {\ beta}}}
and {\ displaystyle \ sigma}
. A priori data can take various forms depending on the area of application and the information that is available a priori .
Conjugate a priori distribution
For any a priori distribution, there may not be an analytical solution for the posterior distribution . In this section, we consider the so-called conjugate a priori distribution , for which the posterior distribution can be derived analytically.
Prior distribution {\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2})} is the conjugate of the likelihood function if it has the same functional form with {\ displaystyle {\ boldsymbol {\ beta}}} and {\ displaystyle \ sigma} . Since the logarithmic likelihood is quadratic with {\ displaystyle {\ boldsymbol {\ beta}}} , we rewrite it so that the credibility becomes normal from {\ displaystyle ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}}} . We write
- {\ displaystyle {\ begin {aligned} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) & = (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) \\ & + ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}}). \ End {aligned}}}
Credibility is now rewritten as
- {\ displaystyle {\ begin {aligned} \ rho (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) & \ propto (\ sigma ^ {2} ) ^ {- v / 2} e ^ {- {\ frac {vs ^ {2}} {2 {\ sigma} ^ {2}}}} (\ sigma ^ {2}) ^ {- (nv) / 2} \\ & \ times e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}} }) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta} }})}, \ end {aligned}}}
Where
- {\ displaystyle vs ^ {2} = (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) \ quad} and {\ displaystyle \ quad v = nk} ,
Where {\ displaystyle k} is the number of regression coefficients.
This indicates the type of a priori distribution:
- {\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2}) = \ rho (\ sigma ^ {2}) \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2 }),}
Where {\ displaystyle \ rho (\ sigma ^ {2})} is the
- {\ displaystyle \ rho (\ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {v_ {0}} {2}} - 1} e ^ {- {\ frac { v_ {0} s_ {0} ^ {2}} {2 {\ sigma} ^ {2}}}}.
In the notation introduced in the article, this is the distribution density {\ displaystyle {\ text {Inv-Gamma}} (a_ {0}, b_ {0})} from {\ displaystyle a_ {0} = {\ tfrac {v_ {0}} {2}}} and {\ displaystyle b_ {0} = {\ tfrac {1} {2}} v_ {0} s_ {0} ^ {2}} where {\ displaystyle v_ {0}} and {\ displaystyle s_ {0} ^ {2}} are a priori values {\ displaystyle v} and {\ displaystyle s ^ {2}} respectively. Equivalently, this density can be described as the {\ displaystyle {\ mbox {Scale-inv -}} \ chi ^ {2} (v_ {0}, s_ {0} ^ {2}).}
Further, the conditional a priori density {\ displaystyle \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2})} is a normal distribution
- {\ displaystyle \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {k} {2}}} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} \ mathbf {\ Lambda} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0})}.}
In the notation of the normal distribution, the conditional a priori distribution is {\ displaystyle {\ mathcal {N}} \ left ({\ boldsymbol {\ mu}} _ {0}, \ sigma ^ {2} \ mathbf {\ Lambda} _ {0} ^ {- 1} \ right) .}
Posterion distribution
Given the a priori distribution, the posterior distribution can be expressed as
- {\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto \ rho (\ mathbf {y} | \ mathbf {X} , {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}) \ rho (\ sigma ^ {2})}
- {\ displaystyle \ propto (\ sigma ^ {2}) ^ {- n / 2} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}}}
- {\ displaystyle \ times (\ sigma ^ {2}) ^ {- k / 2} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta} } - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0})}}
- {\ displaystyle \ times (\ sigma ^ {2}) ^ {- (a_ {0} +1)} e ^ {- {\ frac {b_ {0}} {{\ sigma} ^ {2}}}} .}
After some transformations [1], the posterior probability can be rewritten so that the posterior mean {\ displaystyle {\ boldsymbol {\ mu}} _ {n}} parameter vectors {\ displaystyle {\ boldsymbol {\ beta}}} can be expressed in terms of a least squares estimate {\ displaystyle {\ hat {\ boldsymbol {\ beta}}}} and a priori average {\ displaystyle {\ boldsymbol {\ mu}} _ {0}} , where support for a priori probability is expressed by a priori accuracy matrix {\ displaystyle {\ boldsymbol {\ Lambda}} _ {0}}
- {\ displaystyle {\ boldsymbol {\ mu}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {-1} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} + {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}).}
To confirm that {\ displaystyle {\ boldsymbol {\ mu}} _ {n}} is actually an a posteriori mean, the quadratic terms in the exponent can be transformed to the from {\ displaystyle {\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}} [2] .
- {\ displaystyle (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta }}) + ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) =}
- {\ displaystyle ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}) + \ mathbf {y} ^ {\ rm {T}} \ mathbf {y} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) {\ boldsymbol {\ mu}} _ {n} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T}} {\ boldsymbol { \ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}.}
Now, the posterior distribution can be expressed as the normal distribution times the :
- {\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {k} {2}}} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ { n}) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + \ mathbf {\ Lambda} _ {0}) ({\ boldsymbol {\ beta} } - {\ boldsymbol {\ mu}} _ {n})}}
- {\ displaystyle \ times (\ sigma ^ {2}) ^ {- {\ frac {n + 2a_ {0}} {2}} - 1} e ^ {- {\ frac {2b_ {0} + \ mathbf { y} ^ {\ rm {T}} \ mathbf {y} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) {\ boldsymbol {\ mu}} _ {n} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T }} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}} {2 {\ sigma} ^ {2}}}}.}
Therefore, the posterior distribution can be parameterized as follows.
- {\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}, \ mathbf {y}, \ mathbf {X}) \ rho (\ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}),}
where two factors correspond to distribution densities {\ displaystyle {\ mathcal {N}} \ left ({\ boldsymbol {\ mu}} _ {n}, \ sigma ^ {2} {\ boldsymbol {\ Lambda}} _ {n} ^ {- 1} \ right) \,} and {\ displaystyle {\ text {Inv-Gamma}} \ left (a_ {n}, b_ {n} \ right)} with parameters specified by expressions
- {\ displaystyle {\ boldsymbol {\ Lambda}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + \ mathbf {\ Lambda} _ {0}), \ quad {\ boldsymbol {\ mu}} _ {n} = ({\ boldsymbol {\ Lambda}} _ {n}) ^ {- 1} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X } {\ hat {\ boldsymbol {\ beta}}} + {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}),}
- {\ displaystyle a_ {n} = a_ {0} + {\ frac {n} {2}}, \ qquad b_ {n} = b_ {0} + {\ frac {1} {2}} (\ mathbf { y} ^ {\ rm {T}} \ mathbf {y} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {n} {\ boldsymbol {\ mu} } _ {n}).}
This can be interpreted as Bayesian training, in which the parameters are updated according to the following equalities.
- {\ displaystyle {\ boldsymbol {\ mu}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {-1} ({\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} + \ mathbf {X} ^ {\ rm {T}} \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {- 1} ({\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} + \ mathbf {X} ^ {\ rm {T}} \ mathbf {y}),}
- {\ displaystyle {\ boldsymbol {\ Lambda}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}), }
- {\ displaystyle a_ {n} = a_ {0} + {\ frac {n} {2}},}
- {\ displaystyle b_ {n} = b_ {0} + {\ frac {1} {2}} (\ mathbf {y} ^ {\ rm {T}} \ mathbf {y} + {\ boldsymbol {\ mu} } _ {0} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {n} {\ boldsymbol {\ mu}} _ {n}).}
Model validity
Model validity {\ displaystyle p (\ mathbf {y} | m)} Is the probability of data for this model {\ displaystyle m} . It is also known as marginal likelihood and as a priori predictive density . Here the model is determined by the likelihood function {\ displaystyle p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma)} and a priori distribution of parameters, that is, {\ displaystyle p ({\ boldsymbol {\ beta}}, \ sigma)} . The validity of the model is fixed by one number, showing how well this model explains the observations. The validity of the Bayesian linear regression model presented in this section can be used to compare competing linear models by Bayesian model comparison . These models can differ in the number and values of the predictive variables, as well as their a priori values in the model parameters. The complexity of the model is taken into account by the validity of the model, since it eliminates the parameters by integration {\ displaystyle p (\ mathbf {y}, {\ boldsymbol {\ beta}}, \ sigma | \ mathbf {X})} for all possible values {\ displaystyle {\ boldsymbol {\ beta}}} and {\ displaystyle \ sigma} .
- {\ displaystyle p (\ mathbf {y} | m) = \ int p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma) \, p ({\ boldsymbol { \ beta}}, \ sigma) \, d {\ boldsymbol {\ beta}} \, d \ sigma}
This integral can be calculated analytically and the solution is given by the following equality [3]
- {\ displaystyle p (\ mathbf {y} | m) = {\ frac {1} {(2 \ pi) ^ {n / 2}}} {\ sqrt {\ frac {\ det ({\ boldsymbol {\ Lambda }} _ {0})} {\ det ({\ boldsymbol {\ Lambda}} _ {n})}}} \ cdot {\ frac {b_ {0} ^ {a_ {0}}} {b_ {n } ^ {a_ {n}}}} \ cdot {\ frac {\ Gamma (a_ {n})} {\ Gamma (a_ {0})}}}
Here {\ displaystyle \ Gamma} means gamma function . Since we have chosen the conjugate a priori distribution, the maximum likelihood can be easily calculated by solving the following equality for arbitrary values {\ displaystyle {\ boldsymbol {\ beta}}} and {\ displaystyle \ sigma} .
- {\ displaystyle p (\ mathbf {y} | m) = {\ frac {p ({\ boldsymbol {\ beta}}, \ sigma | m) \, p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma, m)} {p ({\ boldsymbol {\ beta}}, \ sigma | \ mathbf {y}, \ mathbf {X}, m)}}}
Note that this equality is nothing more than a reformulation of Bayes' theorem . Substitution of the formula for a priori probability, likelihood and a posteriori probability and simplification of the resulting expression leads to the analytical expression given above.
In the general case, it may be impossible or inappropriate to obtain an a posteriori distribution analytically. However, it is possible to approximate the posterior probability by the method, such as Monte Carlo sampling [4] or .
Special case {\ displaystyle {\ boldsymbol {\ mu}} _ {0} = 0, \ mathbf {\ Lambda} _ {0} = c \ mathbf {E}} called ridge regression .
A similar analysis can be performed for the general case of multiple regression and partially for the Bayesian - see .