Bayesian linear regression

Bayesian linear regression is an approach in linear regression in which statistical analysis is carried out in the context of Bayesian inference . When the regression model has with a normal distribution , and if a certain form of a priori distribution is adopted, explicit results are available for posterior probability distributions of the model parameters.

Content

1 Model Configuration
2 Regression with adjoint distributions
- 2.1 Conjugate a priori distribution
- 2.2 Posterion distribution
- 2.3 Validity of the model
3 Other cases
4 See also
5 notes
6 Literature
7 Software

Model Configuration

Consider the standard linear regression problem, in which for $i=1,...,n$ ${\ displaystyle i = 1, ..., n}$ $i=1,...,n$ we indicate the mean of the conditional distribution of the magnitude $y_{i}$ ${\ displaystyle y_ {i}}$ $y_{i}$ for a given vector $k\times 1$ ${\ displaystyle k \ times 1}$ $k\times 1$ predictions $\mathbf {x} _{i}$ ${\ displaystyle \ mathbf {x} _ {i}}$ $\mathbf {x} _{i}$ :

y_{i}=\mathbf {x} _{i}^{\rm {T}}{\boldsymbol {\beta }}+\epsilon _{i},

{\ displaystyle y_ {i} = \ mathbf {x} _ {i} ^ {\ rm {T}} {\ boldsymbol {\ beta}} + \ epsilon _ {i},}

y_{i}=\mathbf {x} _{i}^{\rm {T}}{\boldsymbol {\beta }}+\epsilon _{i},

Where ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\boldsymbol {\beta }}$ is an $k\times 1$ ${\ displaystyle k \ times 1}$ $k\times 1$ vector, and $\epsilon _{i}$ ${\ displaystyle \ epsilon _ {i}}$ $\epsilon _{i}$ are independent and equally distributed normally random variables:

\epsilon _{i}\sim N(0,\sigma ^{2}).

{\ displaystyle \ epsilon _ {i} \ sim N (0, \ sigma ^ {2}).}

\epsilon _{i}\sim N(0,\sigma ^{2}).

This corresponds to the following likelihood function :

\rho (\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma ^{2})\propto (\sigma ^{2})^{-n/2}e^{-{\frac {1}{2{\sigma }^{2}}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})}.

{\ displaystyle \ rho (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- n / 2 } e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T }} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}})}.}

\rho (\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma ^{2})\propto (\sigma ^{2})^{-n/2}e^{-{\frac {1}{2{\sigma }^{2}}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})}.

The solution to the is to estimate the coefficient vector using :

{\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y}

{\ displaystyle {\ hat {\ boldsymbol {\ beta}}} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ^ {- 1} \ mathbf {X} ^ {\ rm {T}} \ mathbf {y}}

{\hat {\boldsymbol {\beta }}}=(\mathbf {X} ^{\rm {T}}\mathbf {X} )^{-1}\mathbf {X} ^{\rm {T}}\mathbf {y}

Where $\mathbf {X}$ ${\ displaystyle \ mathbf {X}}$ $\mathbf {X}$ is an $n\times k$ ${\ displaystyle n \ times k}$ $n \times k$ , each row of which is a prediction vector $\mathbf {x} _{i}^{\rm {T}}$ ${\ displaystyle \ mathbf {x} _ {i} ^ {\ rm {T}}}$ $\mathbf {x} _{i}^{\rm {T}}$ , but $\mathbf {y}$ ${\ displaystyle \ mathbf {y}}$ $\mathbf {y}$ is a column vector r $[y_{1}\;\cdots \;y_{n}]^{\rm {T}}$ ${\ displaystyle [y_ {1} \; \ cdots \; y_ {n}] ^ {\ rm {T}}}$ $[y_{1}\;\cdots \;y_{n}]^{\rm {T}}$ .

This is a approach, and it is assumed that there are enough measurements to say something meaningful about ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\boldsymbol {\beta }}$ . In the Bayesian approach, the data is accompanied by additional information in the form of an a priori probability distribution . A priori beliefs about the parameters are combined with the likelihood function of the data according to the Bayes theorem to obtain a posteriori confidence about the parameters ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ ${\boldsymbol {\beta }}$ and $\sigma$ ${\ displaystyle \ sigma}$ $\sigma$ . A priori data can take various forms depending on the area of application and the information that is available a priori .

Regression with adjoint distributions

Conjugate a priori distribution

For any a priori distribution, there may not be an analytical solution for the posterior distribution . In this section, we consider the so-called conjugate a priori distribution , for which the posterior distribution can be derived analytically.

Prior distribution $\rho ({\boldsymbol {\beta }},\sigma ^{2})$ ${\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2})}$ is the conjugate of the likelihood function if it has the same functional form with ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ and $\sigma$ ${\ displaystyle \ sigma}$ . Since the logarithmic likelihood is quadratic with ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ , we rewrite it so that the credibility becomes normal from $({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})$ ${\ displaystyle ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}}}$ . We write

{\begin{aligned}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})&=(\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})\\&+({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}}).\end{aligned}}

{\ displaystyle {\ begin {aligned} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) & = (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) \\ & + ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}}}). \ End {aligned}}}

Credibility is now rewritten as

{\begin{aligned}\rho (\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma ^{2})&\propto (\sigma ^{2})^{-v/2}e^{-{\frac {vs^{2}}{2{\sigma }^{2}}}}(\sigma ^{2})^{-(n-v)/2}\\&\times e^{-{\frac {1}{2{\sigma }^{2}}}({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} )({\boldsymbol {\beta }}-{\hat {\boldsymbol {\beta }}})},\end{aligned}}

{\ displaystyle {\ begin {aligned} \ rho (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma ^ {2}) & \ propto (\ sigma ^ {2} ) ^ {- v / 2} e ^ {- {\ frac {vs ^ {2}} {2 {\ sigma} ^ {2}}}} (\ sigma ^ {2}) ^ {- (nv) / 2} \\ & \ times e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta}} }) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X}) ({\ boldsymbol {\ beta}} - {\ hat {\ boldsymbol {\ beta} }})}, \ end {aligned}}}

Where

vs^{2}=(\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\hat {\boldsymbol {\beta }}})\quad

{\ displaystyle vs ^ {2} = (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) \ quad}

and

\quad v=n-k

{\ displaystyle \ quad v = nk}

,

Where $k$ ${\ displaystyle k}$ is the number of regression coefficients.

This indicates the type of a priori distribution:

\rho ({\boldsymbol {\beta }},\sigma ^{2})=\rho (\sigma ^{2})\rho ({\boldsymbol {\beta }}|\sigma ^{2}),

{\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2}) = \ rho (\ sigma ^ {2}) \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2 }),}

Where $\rho (\sigma ^{2})$ ${\ displaystyle \ rho (\ sigma ^ {2})}$ is the

\rho (\sigma ^{2})\propto (\sigma ^{2})^{-{\frac {v_{0}}{2}}-1}e^{-{\frac {v_{0}s_{0}^{2}}{2{\sigma }^{2}}}}.

{\ displaystyle \ rho (\ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {v_ {0}} {2}} - 1} e ^ {- {\ frac { v_ {0} s_ {0} ^ {2}} {2 {\ sigma} ^ {2}}}}.

In the notation introduced in the article, this is the distribution density ${\text{Inv-Gamma}}(a_{0},b_{0})$ ${\ displaystyle {\ text {Inv-Gamma}} (a_ {0}, b_ {0})}$ from $a_{0}={\tfrac {v_{0}}{2}}$ ${\ displaystyle a_ {0} = {\ tfrac {v_ {0}} {2}}}$ and $b_{0}={\tfrac {1}{2}}v_{0}s_{0}^{2}$ ${\ displaystyle b_ {0} = {\ tfrac {1} {2}} v_ {0} s_ {0} ^ {2}}$ where $v_{0}$ ${\ displaystyle v_ {0}}$ and $s_{0}^{2}$ ${\ displaystyle s_ {0} ^ {2}}$ are a priori values $v$ ${\ displaystyle v}$ and $s^{2}$ ${\ displaystyle s ^ {2}}$ respectively. Equivalently, this density can be described as the ${\mbox{Scale-inv-}}\chi ^{2}(v_{0},s_{0}^{2}).$ ${\ displaystyle {\ mbox {Scale-inv -}} \ chi ^ {2} (v_ {0}, s_ {0} ^ {2}).}$

Further, the conditional a priori density $\rho ({\boldsymbol {\beta }}|\sigma ^{2})$ ${\ displaystyle \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2})}$ is a normal distribution

\rho ({\boldsymbol {\beta }}|\sigma ^{2})\propto (\sigma ^{2})^{-{\frac {k}{2}}}e^{-{\frac {1}{2{\sigma }^{2}}}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})^{\rm {T}}\mathbf {\Lambda } _{0}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})}.

{\ displaystyle \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {k} {2}}} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} \ mathbf {\ Lambda} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0})}.}

In the notation of the normal distribution, the conditional a priori distribution is ${\mathcal {N}}\left({\boldsymbol {\mu }}_{0},\sigma ^{2}\mathbf {\Lambda } _{0}^{-1}\right).$ ${\ displaystyle {\ mathcal {N}} \ left ({\ boldsymbol {\ mu}} _ {0}, \ sigma ^ {2} \ mathbf {\ Lambda} _ {0} ^ {- 1} \ right) .}$

Posterion distribution

Given the a priori distribution, the posterior distribution can be expressed as

\rho ({\boldsymbol {\beta }},\sigma ^{2}|\mathbf {y} ,\mathbf {X} )\propto \rho (\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma ^{2})\rho ({\boldsymbol {\beta }}|\sigma ^{2})\rho (\sigma ^{2})

{\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto \ rho (\ mathbf {y} | \ mathbf {X} , {\ boldsymbol {\ beta}}, \ sigma ^ {2}) \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}) \ rho (\ sigma ^ {2})}

\propto (\sigma ^{2})^{-n/2}e^{-{\frac {1}{2{\sigma }^{2}}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})}

{\ displaystyle \ propto (\ sigma ^ {2}) ^ {- n / 2} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}}}

\times (\sigma ^{2})^{-k/2}e^{-{\frac {1}{2{\sigma }^{2}}}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})^{\rm {T}}{\boldsymbol {\Lambda }}_{0}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})}

{\ displaystyle \ times (\ sigma ^ {2}) ^ {- k / 2} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta} } - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0})}}

\times (\sigma ^{2})^{-(a_{0}+1)}e^{-{\frac {b_{0}}{{\sigma }^{2}}}}.

{\ displaystyle \ times (\ sigma ^ {2}) ^ {- (a_ {0} +1)} e ^ {- {\ frac {b_ {0}} {{\ sigma} ^ {2}}}} .}

After some transformations ^{[1], the} posterior probability can be rewritten so that the posterior mean ${\boldsymbol {\mu }}_{n}$ ${\ displaystyle {\ boldsymbol {\ mu}} _ {n}}$ parameter vectors ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ can be expressed in terms of a least squares estimate ${\hat {\boldsymbol {\beta }}}$ ${\ displaystyle {\ hat {\ boldsymbol {\ beta}}}}$ and a priori average ${\boldsymbol {\mu }}_{0}$ ${\ displaystyle {\ boldsymbol {\ mu}} _ {0}}$ , where support for a priori probability is expressed by a priori accuracy matrix ${\boldsymbol {\Lambda }}_{0}$ ${\ displaystyle {\ boldsymbol {\ Lambda}} _ {0}}$

{\boldsymbol {\mu }}_{n}=(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}(\mathbf {X} ^{\rm {T}}\mathbf {X} {\hat {\boldsymbol {\beta }}}+{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}).

{\ displaystyle {\ boldsymbol {\ mu}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {-1} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}} + {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}).}

To confirm that ${\boldsymbol {\mu }}_{n}$ ${\ displaystyle {\ boldsymbol {\ mu}} _ {n}}$ is actually an a posteriori mean, the quadratic terms in the exponent can be transformed to the from ${\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{n}$ ${\ displaystyle {\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}}$ ^[2] .

(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})^{\rm {T}}(\mathbf {y} -\mathbf {X} {\boldsymbol {\beta }})+({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})^{\rm {T}}{\boldsymbol {\Lambda }}_{0}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{0})=

{\ displaystyle (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta}}) ^ {\ rm {T}} (\ mathbf {y} - \ mathbf {X} {\ boldsymbol {\ beta }}) + ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {0}) =}

({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{n})^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{n})+\mathbf {y} ^{\rm {T}}\mathbf {y} -{\boldsymbol {\mu }}_{n}^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}){\boldsymbol {\mu }}_{n}+{\boldsymbol {\mu }}_{0}^{\rm {T}}{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}.

{\ displaystyle ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ {n}) + \ mathbf {y} ^ {\ rm {T}} \ mathbf {y} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) {\ boldsymbol {\ mu}} _ {n} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T}} {\ boldsymbol { \ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}.}

Now, the posterior distribution can be expressed as the normal distribution times the :

\rho ({\boldsymbol {\beta }},\sigma ^{2}|\mathbf {y} ,\mathbf {X} )\propto (\sigma ^{2})^{-{\frac {k}{2}}}e^{-{\frac {1}{2{\sigma }^{2}}}({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{n})^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +\mathbf {\Lambda } _{0})({\boldsymbol {\beta }}-{\boldsymbol {\mu }}_{n})}

{\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto (\ sigma ^ {2}) ^ {- {\ frac {k} {2}}} e ^ {- {\ frac {1} {2 {\ sigma} ^ {2}}} ({\ boldsymbol {\ beta}} - {\ boldsymbol {\ mu}} _ { n}) ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + \ mathbf {\ Lambda} _ {0}) ({\ boldsymbol {\ beta} } - {\ boldsymbol {\ mu}} _ {n})}}

\times (\sigma ^{2})^{-{\frac {n+2a_{0}}{2}}-1}e^{-{\frac {2b_{0}+\mathbf {y} ^{\rm {T}}\mathbf {y} -{\boldsymbol {\mu }}_{n}^{\rm {T}}(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}){\boldsymbol {\mu }}_{n}+{\boldsymbol {\mu }}_{0}^{\rm {T}}{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}}{2{\sigma }^{2}}}}.

{\ displaystyle \ times (\ sigma ^ {2}) ^ {- {\ frac {n + 2a_ {0}} {2}} - 1} e ^ {- {\ frac {2b_ {0} + \ mathbf { y} ^ {\ rm {T}} \ mathbf {y} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) {\ boldsymbol {\ mu}} _ {n} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T }} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}} {2 {\ sigma} ^ {2}}}}.}

Therefore, the posterior distribution can be parameterized as follows.

\rho ({\boldsymbol {\beta }},\sigma ^{2}|\mathbf {y} ,\mathbf {X} )\propto \rho ({\boldsymbol {\beta }}|\sigma ^{2},\mathbf {y} ,\mathbf {X} )\rho (\sigma ^{2}|\mathbf {y} ,\mathbf {X} ),

{\ displaystyle \ rho ({\ boldsymbol {\ beta}}, \ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}) \ propto \ rho ({\ boldsymbol {\ beta}} | \ sigma ^ {2}, \ mathbf {y}, \ mathbf {X}) \ rho (\ sigma ^ {2} | \ mathbf {y}, \ mathbf {X}),}

where two factors correspond to distribution densities ${\mathcal {N}}\left({\boldsymbol {\mu }}_{n},\sigma ^{2}{\boldsymbol {\Lambda }}_{n}^{-1}\right)\,$ ${\ displaystyle {\ mathcal {N}} \ left ({\ boldsymbol {\ mu}} _ {n}, \ sigma ^ {2} {\ boldsymbol {\ Lambda}} _ {n} ^ {- 1} \ right) \,}$ and ${\text{Inv-Gamma}}\left(a_{n},b_{n}\right)$ ${\ displaystyle {\ text {Inv-Gamma}} \ left (a_ {n}, b_ {n} \ right)}$ with parameters specified by expressions

{\boldsymbol {\Lambda }}_{n}=(\mathbf {X} ^{\rm {T}}\mathbf {X} +\mathbf {\Lambda } _{0}),\quad {\boldsymbol {\mu }}_{n}=({\boldsymbol {\Lambda }}_{n})^{-1}(\mathbf {X} ^{\rm {T}}\mathbf {X} {\hat {\boldsymbol {\beta }}}+{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}),

{\ displaystyle {\ boldsymbol {\ Lambda}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + \ mathbf {\ Lambda} _ {0}), \ quad {\ boldsymbol {\ mu}} _ {n} = ({\ boldsymbol {\ Lambda}} _ {n}) ^ {- 1} (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X } {\ hat {\ boldsymbol {\ beta}}} + {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0}),}

a_{n}=a_{0}+{\frac {n}{2}},\qquad b_{n}=b_{0}+{\frac {1}{2}}(\mathbf {y} ^{\rm {T}}\mathbf {y} +{\boldsymbol {\mu }}_{0}^{\rm {T}}{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}-{\boldsymbol {\mu }}_{n}^{\rm {T}}{\boldsymbol {\Lambda }}_{n}{\boldsymbol {\mu }}_{n}).

{\ displaystyle a_ {n} = a_ {0} + {\ frac {n} {2}}, \ qquad b_ {n} = b_ {0} + {\ frac {1} {2}} (\ mathbf { y} ^ {\ rm {T}} \ mathbf {y} + {\ boldsymbol {\ mu}} _ {0} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {n} {\ boldsymbol {\ mu} } _ {n}).}

This can be interpreted as Bayesian training, in which the parameters are updated according to the following equalities.

{\boldsymbol {\mu }}_{n}=(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}({\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}+\mathbf {X} ^{\rm {T}}\mathbf {X} {\hat {\boldsymbol {\beta }}})=(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0})^{-1}({\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}+\mathbf {X} ^{\rm {T}}\mathbf {y} ),

{\ displaystyle {\ boldsymbol {\ mu}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {-1} ({\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} + \ mathbf {X} ^ {\ rm {T}} \ mathbf {X} {\ hat {\ boldsymbol {\ beta}}}) = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}) ^ {- 1} ({\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} + \ mathbf {X} ^ {\ rm {T}} \ mathbf {y}),}

{\boldsymbol {\Lambda }}_{n}=(\mathbf {X} ^{\rm {T}}\mathbf {X} +{\boldsymbol {\Lambda }}_{0}),

{\ displaystyle {\ boldsymbol {\ Lambda}} _ {n} = (\ mathbf {X} ^ {\ rm {T}} \ mathbf {X} + {\ boldsymbol {\ Lambda}} _ {0}), }

a_{n}=a_{0}+{\frac {n}{2}},

{\ displaystyle a_ {n} = a_ {0} + {\ frac {n} {2}},}

b_{n}=b_{0}+{\frac {1}{2}}(\mathbf {y} ^{\rm {T}}\mathbf {y} +{\boldsymbol {\mu }}_{0}^{\rm {T}}{\boldsymbol {\Lambda }}_{0}{\boldsymbol {\mu }}_{0}-{\boldsymbol {\mu }}_{n}^{\rm {T}}{\boldsymbol {\Lambda }}_{n}{\boldsymbol {\mu }}_{n}).

{\ displaystyle b_ {n} = b_ {0} + {\ frac {1} {2}} (\ mathbf {y} ^ {\ rm {T}} \ mathbf {y} + {\ boldsymbol {\ mu} } _ {0} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {0} {\ boldsymbol {\ mu}} _ {0} - {\ boldsymbol {\ mu}} _ {n} ^ {\ rm {T}} {\ boldsymbol {\ Lambda}} _ {n} {\ boldsymbol {\ mu}} _ {n}).}

Model validity

Model validity $p(\mathbf {y} |m)$ ${\ displaystyle p (\ mathbf {y} | m)}$ Is the probability of data for this model $m$ ${\ displaystyle m}$ . It is also known as marginal likelihood and as a priori predictive density . Here the model is determined by the likelihood function $p(\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma )$ ${\ displaystyle p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma)}$ and a priori distribution of parameters, that is, $p({\boldsymbol {\beta }},\sigma )$ ${\ displaystyle p ({\ boldsymbol {\ beta}}, \ sigma)}$ . The validity of the model is fixed by one number, showing how well this model explains the observations. The validity of the Bayesian linear regression model presented in this section can be used to compare competing linear models by Bayesian model comparison . These models can differ in the number and values of the predictive variables, as well as their a priori values in the model parameters. The complexity of the model is taken into account by the validity of the model, since it eliminates the parameters by integration $p(\mathbf {y} ,{\boldsymbol {\beta }},\sigma |\mathbf {X} )$ ${\ displaystyle p (\ mathbf {y}, {\ boldsymbol {\ beta}}, \ sigma | \ mathbf {X})}$ for all possible values ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ and $\sigma$ ${\ displaystyle \ sigma}$ .

p(\mathbf {y} |m)=\int p(\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma )\,p({\boldsymbol {\beta }},\sigma )\,d{\boldsymbol {\beta }}\,d\sigma

{\ displaystyle p (\ mathbf {y} | m) = \ int p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma) \, p ({\ boldsymbol { \ beta}}, \ sigma) \, d {\ boldsymbol {\ beta}} \, d \ sigma}

This integral can be calculated analytically and the solution is given by the following equality ^[3]

p(\mathbf {y} |m)={\frac {1}{(2\pi )^{n/2}}}{\sqrt {\frac {\det({\boldsymbol {\Lambda }}_{0})}{\det({\boldsymbol {\Lambda }}_{n})}}}\cdot {\frac {b_{0}^{a_{0}}}{b_{n}^{a_{n}}}}\cdot {\frac {\Gamma (a_{n})}{\Gamma (a_{0})}}

{\ displaystyle p (\ mathbf {y} | m) = {\ frac {1} {(2 \ pi) ^ {n / 2}}} {\ sqrt {\ frac {\ det ({\ boldsymbol {\ Lambda }} _ {0})} {\ det ({\ boldsymbol {\ Lambda}} _ {n})}}} \ cdot {\ frac {b_ {0} ^ {a_ {0}}} {b_ {n } ^ {a_ {n}}}} \ cdot {\ frac {\ Gamma (a_ {n})} {\ Gamma (a_ {0})}}}

Here $\Gamma$ ${\ displaystyle \ Gamma}$ means gamma function . Since we have chosen the conjugate a priori distribution, the maximum likelihood can be easily calculated by solving the following equality for arbitrary values ${\boldsymbol {\beta }}$ ${\ displaystyle {\ boldsymbol {\ beta}}}$ and $\sigma$ ${\ displaystyle \ sigma}$ .

p(\mathbf {y} |m)={\frac {p({\boldsymbol {\beta }},\sigma |m)\,p(\mathbf {y} |\mathbf {X} ,{\boldsymbol {\beta }},\sigma ,m)}{p({\boldsymbol {\beta }},\sigma |\mathbf {y} ,\mathbf {X} ,m)}}

{\ displaystyle p (\ mathbf {y} | m) = {\ frac {p ({\ boldsymbol {\ beta}}, \ sigma | m) \, p (\ mathbf {y} | \ mathbf {X}, {\ boldsymbol {\ beta}}, \ sigma, m)} {p ({\ boldsymbol {\ beta}}, \ sigma | \ mathbf {y}, \ mathbf {X}, m)}}}

Note that this equality is nothing more than a reformulation of Bayes' theorem . Substitution of the formula for a priori probability, likelihood and a posteriori probability and simplification of the resulting expression leads to the analytical expression given above.

Other cases

In the general case, it may be impossible or inappropriate to obtain an a posteriori distribution analytically. However, it is possible to approximate the posterior probability by the method, such as Monte Carlo sampling ^[4] or .

Special case ${\boldsymbol {\mu }}_{0}=0,\mathbf {\Lambda } _{0}=c\mathbf {E}$ ${\ displaystyle {\ boldsymbol {\ mu}} _ {0} = 0, \ mathbf {\ Lambda} _ {0} = c \ mathbf {E}}$ called ridge regression .

A similar analysis can be performed for the general case of multiple regression and partially for the Bayesian - see .

Notes

↑ Intermediate calculations can be found in O'Hagan (1994) at the beginning of the linear model chapter.
↑ Interim calculations can be found in the book of Fahrmeir et al. (2009 on p. 188.
↑ Interim calculations can be found in O'Hagan (1994) on page 257.
↑ Carlin and Louis (Carlin, Louis, 2008) and Gelman et al. (Gelman, et al., 2003) explained how to use sampling techniques for Bayesian linear regression.

Literature

George EP Box , Tiao GC Bayesian Inference in Statistical Analysis. - Wiley, 1973. - ISBN 0-471-57428-7 .
Bradley P. Carlin, Thomas A. Louis. Bayesian Methods for Data Analysis, Third Edition. - Boca Raton, FL: Chapman and Hall / CRC, 2008 .-- ISBN 1-58488-697-8 .
Fahrmeir L., Kneib T., Lang S. Regression. Modelle, Methoden und Anwendungen. - 2nd. - Heidelberg: Springer, 2009 .-- ISBN 978-3-642-01836-7 . - DOI : 10.1007 / 978-3-642-01837-4 .
Fornalski KW, Parzych G., Pylak M., Satuła D., Dobrzyński L. Application of Bayesian reasoning and the Maximum Entropy Method to some reconstruction problems // Acta Physica Polonica A. - 2010 .-- V. 117 , no. 6 . - S. 892-899 . - DOI : 10.12693 / APhysPolA.117.892 .
Krzysztof W. Fornalski. Applications of the robust Bayesian regression analysis // International Journal of Society Systems Science. - 2015.- T. 7 , no. 4 . - S. 314–333 . - DOI : 10.1504 / IJSSS.2015.07.07233 .
Andrew Gelman, John B. Carlin, Hal S. Stern, Donald B. Rubin. Bayesian Data Analysis, Second Edition. - Boca Raton, FL: Chapman and Hall / CRC, 2003 .-- ISBN 1-58488-388-X .
Michael Goldstein, David Wooff. Bayes Linear Statistics, Theory & Methods. - Wiley, 2007 .-- ISBN 978-0-470-01562-9 .
Minka, Thomas P. (2001) Bayesian Linear Regression , Microsoft research web page
Peter E. Rossi, Greg M. Allenby, Robert McCulloch. Bayesian Statistics and Marketing. - John Wiley & Sons, 2006 .-- ISBN 0470863676 .
Anthony O'Hagan. Bayesian Inference. - First. - Halsted, 1994 .-- T. 2B. - (Kendall's Advanced Theory of Statistics). - ISBN 0-340-52922-9 .
Sivia, DS, Skilling, J. Data Analysis - A Bayesian Tutorial. - Second. - Oxford University Press, 2006.
Gero Walter, Thomas Augustin. Bayesian Linear Regression — Different Conjugate Models and Their (In) Sensitivity to Prior-Data Conflict // Technical Report Number 069, Department of Statistics, University of Munich. - 2009.

Software

Python
- Bayesian Type-II Linear Regression code , tutorial
- ARD Linear Regression code
- ARD Linear Regression with kernelized features code , tutorial

[1] Intermediate calculations can be found in O'Hagan (1994) at the beginning of the linear model chapter.

[2] Interim calculations can be found in the book of Fahrmeir et al. (2009 on p. 188.

[3] Interim calculations can be found in O'Hagan (1994) on page 257.

[4] Carlin and Louis (Carlin, Louis, 2008) and Gelman et al. (Gelman, et al., 2003) explained how to use sampling techniques for Bayesian linear regression.