Namespaces
Variants
Actions

Difference between revisions of "Least squares, method of"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
Line 1: Line 1:
 +
<!--
 +
l0577801.png
 +
$#A+1 = 322 n = 0
 +
$#C+1 = 322 : ~/encyclopedia/old_files/data/L057/L.0507780 Least squares, method of
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
A method in the theory of errors (cf. [[Errors, theory of|Errors, theory of]]) for estimating unknown quantities on the basis of results of measurement involving random errors. It is also used for the approximate representation of a given function by other (simpler) functions and it often proves useful for the processing of observations. It was first proposed by C.F. Gauss (1794–1795) and A. Legendre (1805–1806). The establishment of a rigorous basis for the method of least squares and the delineation of the limits of meaningful applicability of it were provided by A.A. Markov and A.N. Kolmogorov. In the simplest case, that of linear relationships (see below) and observations containing no systematic errors but only random ones, the least-square estimators of the unknown quantities are linear functions of the observed values. These estimators involve no systematic errors, i.e. they are unbiased (see [[Unbiased estimator|Unbiased estimator]]). If the random errors in the observations are independent and follow a normal distribution law, the method yields estimators with minimum variance, i.e. they are efficient estimators (see [[Statistical estimation|Statistical estimation]]). It is in this sense that the method of least squares is optimal among all other methods for the determination of unbiased estimators. However, if the distribution of the random errors significantly departs from the normal one, the method is no longer necessarily the best.
 
A method in the theory of errors (cf. [[Errors, theory of|Errors, theory of]]) for estimating unknown quantities on the basis of results of measurement involving random errors. It is also used for the approximate representation of a given function by other (simpler) functions and it often proves useful for the processing of observations. It was first proposed by C.F. Gauss (1794–1795) and A. Legendre (1805–1806). The establishment of a rigorous basis for the method of least squares and the delineation of the limits of meaningful applicability of it were provided by A.A. Markov and A.N. Kolmogorov. In the simplest case, that of linear relationships (see below) and observations containing no systematic errors but only random ones, the least-square estimators of the unknown quantities are linear functions of the observed values. These estimators involve no systematic errors, i.e. they are unbiased (see [[Unbiased estimator|Unbiased estimator]]). If the random errors in the observations are independent and follow a normal distribution law, the method yields estimators with minimum variance, i.e. they are efficient estimators (see [[Statistical estimation|Statistical estimation]]). It is in this sense that the method of least squares is optimal among all other methods for the determination of unbiased estimators. However, if the distribution of the random errors significantly departs from the normal one, the method is no longer necessarily the best.
  
In laying rigorous foundations for the method of least squares (in the Gaussian version), one assumes that the  "loss"  incurred by replacing the (unknown) exact value of some quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577801.png" /> by an approximate value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577802.png" />, calculated on the basis of observations, is proportional to the squared error <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577803.png" />; as an optimal estimator one takes a value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577804.png" />, free of systematic errors, for which the mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577805.png" /> of the  "loss"  is minimal. This requirement is fundamental to the method of least squares. In the general case, the task of finding an optimal estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577806.png" /> — in the least-square sense — is extremely complex, and so in practice one confines oneself to a narrower goal: One takes <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577807.png" /> to be a linear function of the observed results which involves no systematic errors, and such that the mean error is minimal relative to the class of all linear functions. If the random errors of the observations are normally distributed and the estimated quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577808.png" /> depends linearly on the mean values of the observed results (a case which arises very frequently in applications of the method of least squares), the solution of this restricted problem turns out to be also the solution of the general problem. Under these conditions, the optimal estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l0577809.png" /> also obeys a normal law with mean <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778010.png" />, and so the probability density of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778011.png" />,
+
In laying rigorous foundations for the method of least squares (in the Gaussian version), one assumes that the  "loss"  incurred by replacing the (unknown) exact value of some quantity $  \mu $
 +
by an approximate value $  X $,  
 +
calculated on the basis of observations, is proportional to the squared error $  ( X - \mu )  ^ {2} $;  
 +
as an optimal estimator one takes a value of $  X $,  
 +
free of systematic errors, for which the mean $  {\mathsf E} ( X - \mu )  ^ {2} $
 +
of the  "loss"  is minimal. This requirement is fundamental to the method of least squares. In the general case, the task of finding an optimal estimator $  X $—  
 +
in the least-square sense — is extremely complex, and so in practice one confines oneself to a narrower goal: One takes $  X $
 +
to be a linear function of the observed results which involves no systematic errors, and such that the mean error is minimal relative to the class of all linear functions. If the random errors of the observations are normally distributed and the estimated quantity $  \mu $
 +
depends linearly on the mean values of the observed results (a case which arises very frequently in applications of the method of least squares), the solution of this restricted problem turns out to be also the solution of the general problem. Under these conditions, the optimal estimator $  X $
 +
also obeys a normal law with mean $  \mu $,  
 +
and so the probability density of $  X $,
 +
 
 +
$$
 +
p ( x;  \mu , \sigma )  = \
 +
 
 +
\frac{1}{\sqrt {2 \pi \sigma } }
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778012.png" /></td> </tr></table>
+
\mathop{\rm exp} \left \{ -
  
attains a maximum at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778013.png" /> (this property is a rigorous expression of the statement, very common in the theory of errors, that  "the estimator X furnished by the method of least squares is the most probable value of the unknown parameter m" ).
+
\frac{( x - \mu )  ^ {2} }{2 \sigma  ^ {2} }
 +
 
 +
\right \} ,
 +
$$
 +
 
 +
attains a maximum at $  x = \mu $(
 +
this property is a rigorous expression of the statement, very common in the theory of errors, that  "the estimator X furnished by the method of least squares is the most probable value of the unknown parameter m" ).
  
 
==The case of one unknown.==
 
==The case of one unknown.==
Assume that an unknown constant <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778014.png" /> is to be estimated on the basis of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778015.png" /> independent observations, which have yielded results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778016.png" />, i.e. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778017.png" />, where the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778018.png" /> are random errors (by the usual definition in the classical theory of errors, random errors are independent random variables with expectation zero: <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778019.png" />; if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778020.png" />, then the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778021.png" /> are called systematic errors). According to the method of least squares, one adopts an estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778022.png" /> for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778023.png" /> that minimizes the sum of the squares (hence the name of the method)
+
Assume that an unknown constant $  \mu $
 +
is to be estimated on the basis of $  n $
 +
independent observations, which have yielded results $  Y _ {1} \dots Y _ {n} $,  
 +
i.e. $  Y _ {1} = \mu + \delta _ {1} , \dots Y _ {n} = \mu + \delta _ {n} $,  
 +
where the $  \delta _ {1} \dots \delta _ {n} $
 +
are random errors (by the usual definition in the classical theory of errors, random errors are independent random variables with expectation zero: $  {\mathsf E} \delta _ {i} = 0 $;  
 +
if $  {\mathsf E} \delta _ {i} \neq 0 $,  
 +
then the $  \delta _ {i} $
 +
are called systematic errors). According to the method of least squares, one adopts an estimator $  X $
 +
for $  \mu $
 +
that minimizes the sum of the squares (hence the name of the method)
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778024.png" /></td> </tr></table>
+
$$
 +
S ( X)  = \
 +
\sum _ {i = 1 } ^ { n }  p _ {i} ( X - Y _ {i} )  ^ {2} ,
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778025.png" /></td> </tr></table>
+
$$
 +
p _ {i}  =
 +
\frac{k}{\sigma _ {i}  ^ {2} }
 +
\ \
 +
\textrm{ and } \ \
 +
\sigma _ {i}  ^ {2}  = {\mathsf D} \delta _ {i}  = \
 +
{\mathsf E} \delta _ {i}  ^ {2}
 +
$$
  
(the coefficient <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778026.png" /> can be arbitrarily chosen). The number <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778027.png" /> is known as the weight and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778028.png" /> as the squared deviation of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778029.png" />-th measurement. In particular, if all measurements are equally precise, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778030.png" />, and one can then put <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778031.png" />; but if each <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778032.png" /> is the arithmetic mean of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778033.png" /> equally-precise measurements, one puts <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778034.png" />.
+
(the coefficient $  k > 0 $
 +
can be arbitrarily chosen). The number $  p _ {i} $
 +
is known as the weight and $  \sigma _ {i}  ^ {2} $
 +
as the squared deviation of the $  i $-
 +
th measurement. In particular, if all measurements are equally precise, then $  \sigma _ {1} = {} \dots = \sigma _ {n} $,  
 +
and one can then put $  p _ {1} = \dots = p _ {n} = 1 $;  
 +
but if each $  Y _ {i} $
 +
is the arithmetic mean of $  n _ {i} $
 +
equally-precise measurements, one puts $  p _ {i} = n _ {i} $.
  
The sum <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778035.png" /> will be minimal if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778036.png" /> is chosen to be the weighted average
+
The sum $  S ( X) $
 +
will be minimal if $  X $
 +
is chosen to be the weighted average
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778037.png" /></td> </tr></table>
+
$$
 +
= \overline{Y}\; = {
 +
\frac{1}{P}
 +
} \sum p _ {i} Y _ {i} ,\ \
 +
\textrm{ where } \  P  = \sum p _ {i} .
 +
$$
  
The estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778038.png" /> for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778039.png" /> is unbiased, has weight <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778040.png" /> and variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778041.png" />. In particular, if all measurements are equally precise, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778042.png" /> is the arithmetic mean of the results of measurement:
+
The estimator $  \overline{Y}\; $
 +
for $  \mu $
 +
is unbiased, has weight $  P $
 +
and variance $  {\mathsf D} \overline{Y}\; = k/P $.  
 +
In particular, if all measurements are equally precise, then $  Y $
 +
is the arithmetic mean of the results of measurement:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778043.png" /></td> </tr></table>
+
$$
 +
\overline{Y}\; = {
 +
\frac{1}{n}
 +
} \sum Y _ {i} \ \
 +
\textrm{ and } \ \
 +
{\mathsf D} \overline{Y}\; =
 +
\frac{\sigma  ^ {2} }{n}
 +
.
 +
$$
  
Subject to certain general assumptions one can show that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778044.png" />, the number of observations, is sufficiently large, then the distribution of the estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778045.png" /> differs only slightly from that of the normal law with expectation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778046.png" /> and variance <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778047.png" />. In this case the absolute error of the approximate equality <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778048.png" /> is less than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778049.png" /> with probability approximately equal to
+
Subject to certain general assumptions one can show that if $  n $,  
 +
the number of observations, is sufficiently large, then the distribution of the estimator $  \overline{Y}\; $
 +
differs only slightly from that of the normal law with expectation $  \mu $
 +
and variance $  k/P $.  
 +
In this case the absolute error of the approximate equality $  \mu \approx \overline{Y}\; $
 +
is less than $  t \sqrt {k/P } $
 +
with probability approximately equal to
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778050.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1)</td></tr></table>
+
$$ \tag{1 }
 +
I ( t) = \
  
(e.g. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778051.png" />; <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778052.png" />; <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778053.png" />).
+
\frac{2}{\sqrt {2 \pi } }
  
If the measurement weights <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778054.png" /> are given and the factor <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778055.png" /> remains undetermined before the observations are taken, then this factor and the variance of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778056.png" /> can be estimated by
+
\int\limits _ { 0 } ^ { t }
 +
e ^ {- u  ^ {2} /2 }  du
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778057.png" /></td> </tr></table>
+
(e.g.  $  I ( 1.96) = 0.950 $;  
 +
$  I ( 2.58) = 0.990 $;  
 +
$  I ( 3.00) = 0.997 $).
 +
 
 +
If the measurement weights  $  p _ {i} $
 +
are given and the factor  $  k $
 +
remains undetermined before the observations are taken, then this factor and the variance of  $  \overline{Y}\; $
 +
can be estimated by
 +
 
 +
$$
 +
k  \approx \
 +
 
 +
\frac{S ( \overline{Y}\; ) }{n - 1 }
 +
 
 +
$$
  
 
and
 
and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778058.png" /></td> </tr></table>
+
$$
 +
{\mathsf D} \overline{Y}\; = \
 +
 
 +
\frac{k}{P}
 +
  \approx  s  ^ {2}  = \
 +
 
 +
\frac{S ( \overline{Y}\; ) }{( n - 1) P }
 +
 
 +
$$
  
 
(both these estimators are also unbiased).
 
(both these estimators are also unbiased).
  
In the practically important case in which the errors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778059.png" /> are normally distributed, one can determine the exact probability with which the absolute error of the equality <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778060.png" /> will be less than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778061.png" /> (where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778062.png" /> is an arbitrary positive number) as
+
In the practically important case in which the errors $  \delta _ {i} $
 +
are normally distributed, one can determine the exact probability with which the absolute error of the equality $  \mu \approx \overline{Y}\; $
 +
will be less than $  ts $(
 +
where $  t $
 +
is an arbitrary positive number) as
 +
 
 +
$$ \tag{2 }
 +
I _ {n - 1 }  ( t)  = \
 +
C _ {n - 1 }
 +
\int\limits _ { 0 } ^ { t }
 +
\left ( 1 +
 +
 
 +
\frac{v  ^ {2} }{n - 1 }
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778063.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2)</td></tr></table>
+
\right )  ^ {-} n/2 dv,
 +
$$
  
where the constant <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778064.png" /> is so chosen that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778065.png" /> (the [[Student distribution|Student distribution]] with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778066.png" /> degrees of freedom). For large <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778067.png" />, formula (2) can be replaced by (1). For relatively small <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778068.png" />, however, the use of formula (1) may cause large errors. For example, according to (1), a value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778069.png" /> corresponds to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778070.png" />; the true values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778071.png" /> defined for small <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778072.png" /> as the solutions of the equations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778073.png" /> are shown in the following table:''''''<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778074.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778075.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778076.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778077.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778078.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778079.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778080.png" /></td> <td colname="8" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778081.png" /></td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778082.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778083.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778084.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778085.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778086.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778087.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778088.png" /></td> <td colname="8" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778089.png" /></td> </tr> </tbody> </table>
+
where the constant $  C _ {n - 1 }  $
 +
is so chosen that $  I _ {n - 1 }  ( \infty ) = 1 $(
 +
the [[Student distribution|Student distribution]] with $  n - 1 $
 +
degrees of freedom). For large $  n $,  
 +
formula (2) can be replaced by (1). For relatively small $  n $,  
 +
however, the use of formula (1) may cause large errors. For example, according to (1), a value $  I = 0.99 $
 +
corresponds to $  t = 2.58 $;  
 +
the true values of $  t $
 +
defined for small $  n $
 +
as the solutions of the equations $  I _ {n - 1 }  ( t) = 0.99 $
 +
are shown in the following table:<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  n \dots $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  2 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  3 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  4 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  5 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  10 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  20 $
 +
</td> <td colname="8" style="background-color:white;" colspan="1"> $  30 $
 +
</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  t \dots $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  63.66 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  9.92 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  5.84 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  4.60 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  3.25 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  2.86 $
 +
</td> <td colname="8" style="background-color:white;" colspan="1"> $  2.76 $
 +
</td> </tr> </tbody> </table>
  
 
</td></tr> </table>
 
</td></tr> </table>
  
Example. To determine the mass of a body, it was weighted ten times in exactly the same way assuming the different weighings to be independent; the results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778090.png" /> (in grams) were:''''''<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778091.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778092.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778093.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778094.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778095.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778096.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778097.png" /></td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778098.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l05778099.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780100.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780101.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780102.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780103.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780104.png" /></td> </tr> </tbody> </table>
+
Example. To determine the mass of a body, it was weighted ten times in exactly the same way assuming the different weighings to be independent; the results $  Y _ {i} $(
 +
in grams) were:<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  Y _ {i} \dots $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  18.41 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  18.42 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  18.43 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  18.44 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  18.45 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  18.46 $
 +
</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  n _ {i} \dots $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  1 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  3 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  3 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  1 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  1 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  1 $
 +
</td> </tr> </tbody> </table>
  
 
</td></tr> </table>
 
</td></tr> </table>
  
(<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780105.png" /> denotes the number of cases in which the mass <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780106.png" /> was observed; <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780107.png" />). Since the different weighings were assumed to be equally precise, one puts <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780108.png" /> and the estimate for the unknown weight <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780109.png" /> is chosen to be <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780110.png" />. Putting, say, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780111.png" /> and checking the tables of the Student distribution with nine degrees of freedom, one finds that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780112.png" />, so that the maximum absolute error in the approximate equality <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780113.png" /> is
+
( $  n _ {i} $
 +
denotes the number of cases in which the mass $  Y _ {i} $
 +
was observed; $  n = \sum n _ {i} = 10 $).  
 +
Since the different weighings were assumed to be equally precise, one puts $  p _ {i} = n _ {i} $
 +
and the estimate for the unknown weight $  \mu $
 +
is chosen to be $  Y = \sum n _ {i} Y _ {i} / \sum n _ {i} = 18.431 $.  
 +
Putting, say, $  I _ {9} = 0.95 $
 +
and checking the tables of the Student distribution with nine degrees of freedom, one finds that $  t = 2.262 $,  
 +
so that the maximum absolute error in the approximate equality $  \mu \approx 18.431 $
 +
is
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780114.png" /></td> </tr></table>
+
$$
 +
ts  = \
 +
t \sqrt {\sum n _ {i}
 +
\frac{( Y _ {i} - \overline{Y}\; )  ^ {2} }{90}
 +
= \
 +
2.262 \cdot 0.0048  = 0.011.
 +
$$
  
Thus, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780115.png" /> with probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780116.png" />.
+
Thus, $  18.420 < \mu < 18.442 $
 +
with probability 0.05 $.
  
 
==The case of several unknowns (linear relationships).==
 
==The case of several unknowns (linear relationships).==
Suppose that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780117.png" /> measurement results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780118.png" /> are related to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780119.png" /> unknown constants <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780120.png" /> (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780121.png" />) by means of independent linear relationships
+
Suppose that $  n $
 +
measurement results $  Y _ {1} \dots Y _ {n} $
 +
are related to $  m $
 +
unknown constants $  x _ {1} \dots x _ {m} $(
 +
$  m < n $)  
 +
by means of independent linear relationships
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780122.png" /></td> <td valign="top" style="width:5%;text-align:right;">(3)</td></tr></table>
+
$$ \tag{3 }
 +
Y _ {i}  = \
 +
\sum _ {j = 1 } ^ { m }  a _ {ij} x _ {j} + \delta _ {i} ,\ \
 +
i = 1 \dots n,
 +
$$
  
where the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780123.png" /> are known coefficients and the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780124.png" /> are independent random errors in the measurements. One wishes to estimate the unknown quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780125.png" /> (this problem can be regarded as a generalization of the previous problem, with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780126.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780127.png" />; <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780128.png" />).
+
where the $  a _ {ij} $
 +
are known coefficients and the $  \delta _ {i} $
 +
are independent random errors in the measurements. One wishes to estimate the unknown quantities $  x _ {j} $(
 +
this problem can be regarded as a generalization of the previous problem, with $  \mu = x _ {i} $
 +
and $  m = a _ {i1} = 1 $;  
 +
$  i = 1 \dots n $).
  
Since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780129.png" />, the mean values of the results, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780130.png" />, are related to the unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780131.png" /> by means of the following linear equations:
+
Since $  {\mathsf E} \delta _ {i} = 0 $,  
 +
the mean values of the results, $  y _ {i} = {\mathsf E} Y _ {i} $,  
 +
are related to the unknowns $  x _ {1} \dots x _ {m} $
 +
by means of the following linear equations:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780132.png" /></td> <td valign="top" style="width:5%;text-align:right;">(4)</td></tr></table>
+
$$ \tag{4 }
 +
y _ {i}  = \
 +
\sum _ {j = 1 } ^ { m }
 +
a _ {ij} x _ {j} ,\ \
 +
i = 1 \dots n.
 +
$$
  
Consequently, the desired values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780133.png" /> are a solution of the system (4) (on the assumption that the equations are consistent). The precise values of the measured magnitudes <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780134.png" /> and the random errors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780135.png" /> are usually unknown, and so one generally replaces the systems (3) and (4) by the conditional equations:
+
Consequently, the desired values of $  x _ {i} $
 +
are a solution of the system (4) (on the assumption that the equations are consistent). The precise values of the measured magnitudes $  y _ {i} $
 +
and the random errors $  \delta _ {i} $
 +
are usually unknown, and so one generally replaces the systems (3) and (4) by the conditional equations:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780136.png" /></td> </tr></table>
+
$$
 +
Y _ {i}  = \
 +
\sum _ {j = 1 } ^ { m }
 +
a _ {ij} x _ {j} ,\ \
 +
i = 1 \dots n.
 +
$$
  
In the method of least squares, estimators for the unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780137.png" /> are the quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780138.png" /> that minimize the sum of the squared deviations
+
In the method of least squares, estimators for the unknowns $  x _ {i} $
 +
are the quantities $  X _ {i} $
 +
that minimize the sum of the squared deviations
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780139.png" /></td> </tr></table>
+
$$
 +
= \
 +
\sum _ {i = 1 } ^ { n }
 +
p _ {i} \left (
 +
Y _ {i} - \sum _ {j = 1 } ^ { m }  a _ {ij} X _ {j} \right )  ^ {2} .
 +
$$
  
(As in the previous case, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780140.png" /> is the weight of the measurement and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780141.png" /> is the quantity inversely proportional to the variance of the random error <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780142.png" />.) The conditional equations are as a rule inconsistent, i.e. for arbitrary values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780143.png" /> the differences
+
(As in the previous case, $  p _ {i} $
 +
is the weight of the measurement and $  Y _ {i} $
 +
is the quantity inversely proportional to the variance of the random error $  \delta _ {i} $.)  
 +
The conditional equations are as a rule inconsistent, i.e. for arbitrary values of $  X _ {j} $
 +
the differences
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780144.png" /></td> </tr></table>
+
$$
 +
\Delta _ {i}  = \
 +
Y _ {i} - \sum _ {j = 1 } ^ { m }  a _ {ij} X _ {j} ,\ \
 +
i = 1 \dots n,
 +
$$
  
cannot all vanish, in general. The method of least squares prescribes taking as estimators those values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780145.png" /> that minimize the sum <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780146.png" />. In those exceptional cases in which the conditional equations are consistent, and therefore solvable, the solution consists precisely of the estimators furnished by the method of least squares.
+
cannot all vanish, in general. The method of least squares prescribes taking as estimators those values of $  X _ {j} $
 +
that minimize the sum $  S $.  
 +
In those exceptional cases in which the conditional equations are consistent, and therefore solvable, the solution consists precisely of the estimators furnished by the method of least squares.
  
The sum <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780147.png" /> of the squares is a quadratic polynomial in the variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780148.png" />; this polynomial attains a minimum at the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780149.png" /> for which all the first partial derivatives vanish:
+
The sum $  S $
 +
of the squares is a quadratic polynomial in the variables $  X _ {j} $;  
 +
this polynomial attains a minimum at the values of $  X _ {1} \dots X _ {m} $
 +
for which all the first partial derivatives vanish:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780150.png" /></td> </tr></table>
+
$$
  
Hence it follows that the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780151.png" /> furnished by the method of least squares must satisfy the so-called normal equations. In the notation proposed by Gauss, these equations are:
+
\frac{\partial  S }{\partial  X _ {j} }
 +
  = \
 +
- 2 \sum _ {i = 1 } ^ { n }  p _ {i} \Delta _ {i}  = 0,\ \
 +
j = 1 \dots m.
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780152.png" /></td> <td valign="top" style="width:5%;text-align:right;">(5)</td></tr></table>
+
Hence it follows that the  $  X _ {j} $
 +
furnished by the method of least squares must satisfy the so-called normal equations. In the notation proposed by Gauss, these equations are:
 +
 
 +
$$ \tag{5 }
 +
\sum _ {j = 1 } ^ { m }  [ pa _ {j} a _ {l} ] X _ {j}  = [ pYa _ {l} ],\ \
 +
l = 1 \dots m,
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780153.png" /></td> </tr></table>
+
$$
 +
[ pa _ {j} a _ {l} ]  = \
 +
\sum _ {i = 1 } ^ { n }
 +
p _ {i} a _ {ij} a _ {il}  $$
  
 
and
 
and
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780154.png" /></td> </tr></table>
+
$$
 +
[ pYa _ {l} ]  = \
 +
\sum _ {i = 1 } ^ { n }
 +
p _ {i} Y _ {i} a _ {il} .
 +
$$
  
Estimators obtained by solving the system of normal equations are unbiased <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780155.png" />; the variances <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780156.png" /> are <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780157.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780158.png" /> is the determinant of the system (5) and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780159.png" /> is the minor corresponding to the diagonal entry <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780160.png" /> (in other words, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780161.png" /> is the weight of the estimator <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780162.png" />). If the proportionality factor <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780163.png" /> (known as the variance per unit weight) is not known in advance, it can be estimated, and with it the variances <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780164.png" />, by the formulas:
+
Estimators obtained by solving the system of normal equations are unbiased $  ( {\mathsf E} X _ {j} = x _ {j} ) $;  
 +
the variances $  {\mathsf D} X _ {j} $
 +
are $  kd _ {jj} /d $,  
 +
where $  d $
 +
is the determinant of the system (5) and $  d _ {jj} $
 +
is the minor corresponding to the diagonal entry $  [ pa _ {j} a _ {j} ] $(
 +
in other words, $  d _ {jj} /d $
 +
is the weight of the estimator $  X _ {j} $).  
 +
If the proportionality factor $  k $(
 +
known as the variance per unit weight) is not known in advance, it can be estimated, and with it the variances $  {\mathsf D} X _ {j} $,  
 +
by the formulas:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780165.png" /></td> </tr></table>
+
$$
 +
k  \approx 
 +
\frac{S}{n - m }
 +
\ \
 +
\textrm{ and } \ \
 +
{\mathsf D} X _ {j}  \approx \
 +
s _ {j}  ^ {2}  = \
  
(<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780166.png" /> is the minimum value of the original sum of the squares). Subject to certain general assumptions one can show that if the number of observations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780167.png" /> is sufficiently large, the absolute error of the approximate equality <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780168.png" /> is less than <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780169.png" /> with probability approximately given by the integral (1). If the random errors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780170.png" /> obey a normal law, then all quotients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780171.png" /> are distributed according to the Student law with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780172.png" /> degrees of freedom (a precise estimation of the absolute error of the approximate equality makes use of the integral (2), just as in the case of one unknown). In addition, the minimum value of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780173.png" /> is independent of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780174.png" /> (in the probabilistic sense), and therefore the approximate values of the variances <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780175.png" /> are independent of the estimators <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780176.png" /> themselves.
+
\frac{Sd _ {jj} }{d ( n - m) }
  
One of the most typical applications of the method of least squares is in the  "smoothing"  of observed results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780177.png" /> for which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780178.png" /> in equations (3), where the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780179.png" /> are known functions of some parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780180.png" /> (if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780181.png" /> is time, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780182.png" /> are the times at which the observations are made). A case encountered quite frequently in practical work is that known as parabolic interpolation, when the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780183.png" /> are polynomials (e.g. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780184.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780185.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780186.png" />); if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780187.png" /> and if the observations are equally precise, then the estimators <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780188.png" /> can be calculated by using tables of orthogonal polynomials. Another case of practical importance is harmonic interpolation, when the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780189.png" /> are trigonometric functions (e.g. <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780190.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780191.png" />).
+
$$
  
Example. In order to estimate the precision of a certain method of chemical analysis, the method was used to determine the concentration of CaO in ten standard samples of known composition. The results are listed in the following table (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780192.png" /> is the number of the experiment, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780193.png" /> the true concentration of CaO, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780194.png" /> is the concentration of CaO determined by the chemical analysis, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780195.png" /> the error in chemical analysis):''''''<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780196.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780197.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780198.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780199.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780200.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780201.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780202.png" /></td> <td colname="8" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780203.png" /></td> <td colname="9" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780204.png" /></td> <td colname="10" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780205.png" /></td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780206.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780207.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780208.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780209.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780210.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780211.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780212.png" /></td> <td colname="8" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780213.png" /></td> <td colname="9" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780214.png" /></td> <td colname="10" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780215.png" /></td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780216.png" /></td> <td colname="2" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780217.png" /></td> <td colname="3" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780218.png" /></td> <td colname="4" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780219.png" /></td> <td colname="5" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780220.png" /></td> <td colname="6" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780221.png" /></td> <td colname="7" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780222.png" /></td> <td colname="8" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780223.png" /></td> <td colname="9" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780224.png" /></td> <td colname="10" style="background-color:white;" colspan="1"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780225.png" /></td> </tr> </tbody> </table>
+
( $  S $
 +
is the minimum value of the original sum of the squares). Subject to certain general assumptions one can show that if the number of observations  $  n $
 +
is sufficiently large, the absolute error of the approximate equality  $  x _ {j} \approx X _ {j} $
 +
is less than  $  ts _ {j} $
 +
with probability approximately given by the integral (1). If the random errors  $  \delta _ {i} $
 +
obey a normal law, then all quotients  $  ( X _ {j} - x _ {j} )/s _ {j} $
 +
are distributed according to the Student law with  $  n - m $
 +
degrees of freedom (a precise estimation of the absolute error of the approximate equality makes use of the integral (2), just as in the case of one unknown). In addition, the minimum value of $  S $
 +
is independent of  $  X _ {1} \dots X _ {m} $(
 +
in the probabilistic sense), and therefore the approximate values of the variances  $  {\mathsf D} X _ {j} \approx s _ {j}  ^ {2} $
 +
are independent of the estimators  $  X _ {j} $
 +
themselves.
 +
 
 +
One of the most typical applications of the method of least squares is in the  "smoothing" of observed results  $  Y _ {j} $
 +
for which  $  a _ {ij} = a _ {j} ( t _ {i} ) $
 +
in equations (3), where the  $  a _ {j} ( t) $
 +
are known functions of some parameter  $  t $(
 +
if  $  t $
 +
is time, then  $  t _ {1} , t _ {2} \dots $
 +
are the times at which the observations are made). A case encountered quite frequently in practical work is that known as parabolic interpolation, when the  $  a _ {j} ( t) $
 +
are polynomials (e.g. $  a _ {1} ( t) = 1 $,
 +
$  a _ {2} ( t) = t $,
 +
$  a _ {3} ( t) = t  ^ {2} ,\dots $);
 +
if  $  t _ {2} - t _ {1} = \dots = t _ {n} - t _ {n - 1 }  $
 +
and if the observations are equally precise, then the estimators  $  X _ {j} $
 +
can be calculated by using tables of orthogonal polynomials. Another case of practical importance is harmonic interpolation, when the  $  a _ {j} ( t) $
 +
are trigonometric functions (e.g.  $  a _ {j} ( t) = \cos  ( j - 1) t $,
 +
$  j = 1 \dots m $).
 +
 
 +
Example. In order to estimate the precision of a certain method of chemical analysis, the method was used to determine the concentration of CaO in ten standard samples of known composition. The results are listed in the following table ( $  i $
 +
is the number of the experiment,  $  t _ {i} $
 +
the true concentration of CaO,  $  T _ {i} $
 +
is the concentration of CaO determined by the chemical analysis,  $  Y _ {i} = T _ {i} - t _ {i} $
 +
the error in chemical analysis):<table border="0" cellpadding="0" cellspacing="0" style="background-color:black;"> <tr><td> <table border="0" cellspacing="1" cellpadding="4" style="background-color:black;"> <tbody> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  1 $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  2 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  3 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  4 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  5 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  6 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  7 $
 +
</td> <td colname="8" style="background-color:white;" colspan="1"> $  8 $
 +
</td> <td colname="9" style="background-color:white;" colspan="1"> $  9 $
 +
</td> <td colname="10" style="background-color:white;" colspan="1"> $  10 $
 +
</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  4 $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  8 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  12.5 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  16 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  20 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  25 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  31 $
 +
</td> <td colname="8" style="background-color:white;" colspan="1"> $  36 $
 +
</td> <td colname="9" style="background-color:white;" colspan="1"> $  40 $
 +
</td> <td colname="10" style="background-color:white;" colspan="1"> $  40 $
 +
</td> </tr> <tr> <td colname="1" style="background-color:white;" colspan="1"> $  - 0.3 $
 +
</td> <td colname="2" style="background-color:white;" colspan="1"> $  - 0.2 $
 +
</td> <td colname="3" style="background-color:white;" colspan="1"> $  - 0.4 $
 +
</td> <td colname="4" style="background-color:white;" colspan="1"> $  - 0.4 $
 +
</td> <td colname="5" style="background-color:white;" colspan="1"> $  - 0.2 $
 +
</td> <td colname="6" style="background-color:white;" colspan="1"> $  - 0.5 $
 +
</td> <td colname="7" style="background-color:white;" colspan="1"> $  + 0.1 $
 +
</td> <td colname="8" style="background-color:white;" colspan="1"> $  - 0.5 $
 +
</td> <td colname="9" style="background-color:white;" colspan="1"> $  - 0.6 $
 +
</td> <td colname="10" style="background-color:white;" colspan="1"> $  - 0.5 $
 +
</td> </tr> </tbody> </table>
  
 
</td></tr> </table>
 
</td></tr> </table>
  
If the results of the chemical analysis do not involve systematic errors, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780226.png" />. But if such errors exist, they may be represented as a first approximation by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780227.png" /> (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780228.png" /> is known as the constant bias, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780229.png" /> as the methodical bias) or, equivalently,
+
If the results of the chemical analysis do not involve systematic errors, then $  {\mathsf E} Y _ {i} = 0 $.  
 +
But if such errors exist, they may be represented as a first approximation by $  {\mathsf E} Y _ {i} = \alpha + \beta t _ {i} $(
 +
$  \alpha $
 +
is known as the constant bias, $  \beta t _ {i} $
 +
as the methodical bias) or, equivalently,
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780230.png" /></td> </tr></table>
+
$$
 +
{\mathsf E} Y _ {i}  = ( \alpha + \beta \overline{t}\; ) +
 +
\beta ( t _ {i} - \overline{t}\; ),
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780231.png" /></td> </tr></table>
+
$$
 +
\overline{t}\; = {
 +
\frac{1}{10}
 +
} \sum _ {i = 1 } ^ { 10 }  t _ {i}  = 23.25.
 +
$$
  
To find estimates of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780232.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780233.png" />, it will suffice to estimate the quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780234.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780235.png" />. The conditional equations here are
+
To find estimates of $  \alpha $
 +
and $  \beta $,  
 +
it will suffice to estimate the quantities $  x _ {1} = \alpha + \beta \overline{t}\; $
 +
and $  x _ {2} = \beta $.  
 +
The conditional equations here are
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780236.png" /></td> </tr></table>
+
$$
 +
Y _ {i}  = x _ {1} + x _ {2} ( t _ {i} - \overline{t}\; ),\ \
 +
i = 1 \dots 10,
 +
$$
  
so that <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780237.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780238.png" /> (since by assumption the observations are equally precise or homoscedastic (cf. [[Homoscedasticity|Homoscedasticity]]), <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780239.png" />). Since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780240.png" />, the system of normal equations assumes a particularly simple form:
+
so that $  a _ {i1} = 1 $,  
 +
$  a _ {i2} = t _ {i} - \overline{t}\; $(
 +
since by assumption the observations are equally precise or homoscedastic (cf. [[Homoscedasticity|Homoscedasticity]]), $  p _ {i} = 1 $).  
 +
Since $  [ a _ {1} a _ {2} ] = [ a _ {2} a _ {1} ] = \sum ( t _ {i} - \overline{t}\; ) = 0 $,  
 +
the system of normal equations assumes a particularly simple form:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780241.png" /></td> </tr></table>
+
$$
 +
[ a _ {1} a _ {1} ] X _ {1}  = [ Ya _ {1} ]; \ \
 +
[ a _ {2} a _ {2} ] X _ {2}  = [ Ya _ {2} ],
 +
$$
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780242.png" /></td> </tr></table>
+
$$
 +
[ a _ {1} a _ {1} ]  = 10,\ \
 +
[ a _ {2} a _ {2} ]  = \sum ( t _ {i} - \overline{t}\; )  ^ {2}  = \
 +
1569,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780243.png" /></td> </tr></table>
+
$$
 +
[ Ya _ {1} ]  = \sum y _ {i}  = - 3.5,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780244.png" /></td> </tr></table>
+
$$
 +
[ Ya _ {2} ]  = \sum Y _ {i} ( t _ {i} - \overline{t}\; = - 8.225.
 +
$$
  
 
The variances of the components of the solution are
 
The variances of the components of the solution are
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780245.png" /></td> </tr></table>
+
$$
 +
{\mathsf D} X _ {1}  =
 +
\frac{k}{[ a _ {1} a _ {1} ] }
 +
  = {
 +
\frac{k}{10}
 +
} \ \
 +
\textrm{ and } \ \
 +
{\mathsf D} X _ {2}  =
 +
\frac{k}{[ a _ {2} a _ {2} ] }
 +
  = {
 +
\frac{k}{1569}
 +
} ,
 +
$$
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780246.png" /> is the unknown variance per unit weight (in this case, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780247.png" /> is the variance of any of the quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780248.png" />). Since in this example the components of the solution have values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780249.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780250.png" />, one finds
+
where $  k $
 +
is the unknown variance per unit weight (in this case, $  k $
 +
is the variance of any of the quantities $  Y _ {i} $).  
 +
Since in this example the components of the solution have values $  X _ {1} = - 0.35 $
 +
and $  X _ {2} = - 0.00524 $,  
 +
one finds
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780251.png" /></td> </tr></table>
+
$$
 +
k  \approx 
 +
\frac{S}{n - m }
 +
=
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780252.png" /></td> </tr></table>
+
$$
 +
= \
 +
{
 +
\frac{1}{8}
 +
} \sum _ {i = 1 } ^ { 10 }  [ Y _ {i} - X _ {1} - X _ {2} ( t _ {i} - \overline{t}\; )]  ^ {2}  = 0.0427,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780253.png" /></td> </tr></table>
+
$$
 +
{\mathsf D} X _ {1}  \approx  s _ {1}  ^ {2}  = 0.00427,\ \
 +
{\mathsf D} X _ {2}  \approx  s _ {2}  ^ {2}  = 0.000272,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780254.png" /></td> </tr></table>
+
$$
 +
s _ {1}  = 0.065,\  s _ {2}  = 0.00522.
 +
$$
  
If the random errors in the observations follow a normal law, then the quotients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780255.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780256.png" />, are distributed according to a Student <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780257.png" />-distribution. In particular, if the results of the observations are free of systematic errors, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780258.png" />, so that the quotients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780259.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780260.png" /> must follow a Student <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780261.png" />-distribution. Using tables of the Student distribution with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780262.png" /> degrees of freedom, one verifies that if indeed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780263.png" />, then with probability 0.999 each of these quotients cannot exceed 5.04 and with probability 0.95 they will not exceed 2.31. In this case <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780264.png" />, and therefore the hypothesis that there are no systematic errors should be rejected. At the same time, it should be realized that the hypothesis of absence of systematic errors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780265.png" /> is not contradicted by the results of the observations, since <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780266.png" />. Thus, it follows that the approximate formula <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780267.png" /> is a logical choice for the determination of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780268.png" /> on the basis of the observed value <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780269.png" />.
+
If the random errors in the observations follow a normal law, then the quotients $  | X _ {j} - x _ {j} | /s _ {j} $,  
 +
$  j = 1, 2 $,  
 +
are distributed according to a Student $  t $-
 +
distribution. In particular, if the results of the observations are free of systematic errors, then $  x _ {1} = x _ {2} = 0 $,  
 +
so that the quotients $  | X _ {1} | /s _ {1} $
 +
and $  | X _ {2} | /s _ {2} $
 +
must follow a Student $  t $-
 +
distribution. Using tables of the Student distribution with $  n - m = 8 $
 +
degrees of freedom, one verifies that if indeed $  x _ {1} = x _ {2} = 0 $,  
 +
then with probability 0.999 each of these quotients cannot exceed 5.04 and with probability 0.95 they will not exceed 2.31. In this case $  | X _ {1} | /s _ {1} = 5.38 > 5.04 $,  
 +
and therefore the hypothesis that there are no systematic errors should be rejected. At the same time, it should be realized that the hypothesis of absence of systematic errors $  ( x _ {2} = 0) $
 +
is not contradicted by the results of the observations, since $  | X _ {2} | /s _ {2} = 1.004 < 2.31 $.  
 +
Thus, it follows that the approximate formula $  t = T + 0.35 $
 +
is a logical choice for the determination of $  t $
 +
on the basis of the observed value $  T $.
  
 
==The case of several unknowns (non-linear relationships).==
 
==The case of several unknowns (non-linear relationships).==
Suppose that the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780270.png" /> results <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780271.png" /> of measurements are related to the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780272.png" /> unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780273.png" /> (<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780274.png" />) by means of functions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780275.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780276.png" />, where the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780277.png" /> are independent random errors, and the functions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780278.png" /> (which need not be linear) are differentiable. According to the method of least squares, estimators <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780279.png" /> for the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780280.png" /> are those for which the sum of squares is smallest. Since the functions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780281.png" /> are non-linear, solving the normal equations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780282.png" /> may present considerable difficulties. Sometimes non-linear relationships may be reduced to linear forms by transformation.
+
Suppose that the $  n $
 +
results $  Y _ {i} $
 +
of measurements are related to the $  m $
 +
unknowns $  x _ {j} $(
 +
$  m < n $)  
 +
by means of functions $  Y _ {i} = f _ {i} ( x _ {1} \dots x _ {m} ) + \delta _ {i} $,  
 +
$  i = 1 \dots n $,  
 +
where the $  \delta _ {i} $
 +
are independent random errors, and the functions $  f _ {i} $(
 +
which need not be linear) are differentiable. According to the method of least squares, estimators $  X _ {j} $
 +
for the $  x _ {j} $
 +
are those for which the sum of squares is smallest. Since the functions $  f _ {i} $
 +
are non-linear, solving the normal equations $  \partial  S/ \partial  X _ {j} = 0 $
 +
may present considerable difficulties. Sometimes non-linear relationships may be reduced to linear forms by transformation.
  
For example, in the magnetization of iron, the magnetic field strength <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780283.png" /> is related to the magnetic induction via the empirical formula <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780284.png" /> (the coefficients <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780285.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780286.png" /> being determined by measuring the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780287.png" /> for various preassigned values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780288.png" />). The induction <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780289.png" /> is a non-linear function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780290.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780291.png" />. However, the reciprocal of the induction is a linear function of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780292.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780293.png" />. Application of the method of least squares to the original and the transformed equations may well yield different estimates for the unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780294.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780295.png" />, but if the variance of the random errors involved in measuring the induction is significantly less than the measured quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780296.png" />, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780297.png" />. Thus the quantities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780298.png" /> should be assigned weights <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780299.png" />; it is natural to expect that under these conditions the difference between estimates in the non-linear and linear cases is negligible for practical purposes.
+
For example, in the magnetization of iron, the magnetic field strength $  H $
 +
is related to the magnetic induction via the empirical formula $  B = H/( x _ {1} + H x _ {2} ) $(
 +
the coefficients $  x _ {1} $
 +
and $  x _ {2} $
 +
being determined by measuring the values of $  B _ {i} $
 +
for various preassigned values $  H _ {i} $).  
 +
The induction $  B $
 +
is a non-linear function of $  x _ {1} $
 +
and $  x _ {2} $.  
 +
However, the reciprocal of the induction is a linear function of $  x _ {1} $
 +
and $  x _ {2} $.  
 +
Application of the method of least squares to the original and the transformed equations may well yield different estimates for the unknowns $  x _ {1} $
 +
and $  x _ {2} $,  
 +
but if the variance of the random errors involved in measuring the induction is significantly less than the measured quantities $  B _ {i} $,  
 +
then $  {\mathsf D} ( 1/B) \approx B  ^ {-} 4 $.  
 +
Thus the quantities $  1/B _ {i} $
 +
should be assigned weights $  ( 1/B _ {i} )  ^ {4} $;  
 +
it is natural to expect that under these conditions the difference between estimates in the non-linear and linear cases is negligible for practical purposes.
  
In cases when it is not possible to replace the non-linear equations by linear ones via identity transformations, one resorts to another linearization technique. Inspecting the original <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780300.png" /> equations, single out any <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780301.png" /> of them, the solution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780302.png" /> of which will serve as a zero-th approximation for the unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780303.png" />. Putting <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780304.png" />, the conditional equations may be rewritten as
+
In cases when it is not possible to replace the non-linear equations by linear ones via identity transformations, one resorts to another linearization technique. Inspecting the original $  n $
 +
equations, single out any $  m $
 +
of them, the solution $  X _ {1}  ^ {0} \dots X _ {m}  ^ {0} $
 +
of which will serve as a zero-th approximation for the unknowns $  x _ {j} $.  
 +
Putting $  \xi = X _ {j} - X _ {j}  ^ {0} $,  
 +
the conditional equations may be rewritten as
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780305.png" /></td> </tr></table>
+
$$
 +
Y _ {i}  = f _ {i} ( X _ {1}  ^ {0} + \xi _ {1} \dots X _ {m}  ^ {0} + \xi _ {m} ),\ \
 +
i = 1 \dots n.
 +
$$
  
Expanding the terms on the right in powers of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780306.png" /> and dropping all but the linear terms, one obtains
+
Expanding the terms on the right in powers of $  \xi _ {j} $
 +
and dropping all but the linear terms, one obtains
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780307.png" /></td> </tr></table>
+
$$
 +
Y _ {i} - ( f _ {i} ) _ {0}  = \
 +
\sum _ {j = 1 } ^ { m }
 +
\left (
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780308.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780309.png" /> are the values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780310.png" /> and its derivative with respect to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780311.png" /> at the point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780312.png" />. This is a linear system of equations so that the method of least squares is readily applicable to determine estimates for the unknowns <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780313.png" />. Once this has been done, one obtains a first approximation for the unknown: <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780314.png" />. These quantities are now treated as the initial approximation, and the entire procedure is repeated, as long as two consecutive approximations do not coincide to within the desired precision. If the variance of the errors <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780315.png" /> decreases, the procedure is convergent.
+
\frac{\partial  f _ {i} }{\partial  x _ {j} }
  
Very commonly, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780316.png" /> is small, the first approximation alone is quite sufficient; there is no point in requiring <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780317.png" /> to be determined with precision significantly exceeding <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780318.png" />.
+
\right ) _ {0} \xi _ {j} ,\ \
 +
i = 1 \dots n,
 +
$$
 +
 
 +
where  $  ( f _ {i} ) _ {0} $
 +
and  $  ( \partial  f _ {i} / \partial  x _ {j} ) _ {0} $
 +
are the values of  $  f _ {i} $
 +
and its derivative with respect to  $  x _ {j} $
 +
at the point  $  x _ {1} = X _ {1}  ^ {0} \dots x _ {m} = X _ {m}  ^ {0} $.
 +
This is a linear system of equations so that the method of least squares is readily applicable to determine estimates for the unknowns  $  \xi _ {j} $.
 +
Once this has been done, one obtains a first approximation for the unknown: $  X _ {j}  ^ {1} = X _ {j}  ^ {0} + \xi _ {j} $.  
 +
These quantities are now treated as the initial approximation, and the entire procedure is repeated, as long as two consecutive approximations do not coincide to within the desired precision. If the variance of the errors  $  \delta _ {i} $
 +
decreases, the procedure is convergent.
 +
 
 +
Very commonly, if  $  {\mathsf D} \delta _ {i} $
 +
is small, the first approximation alone is quite sufficient; there is no point in requiring $  X _ {j} $
 +
to be determined with precision significantly exceeding $  \sqrt { {\mathsf D} X _ {j} } $.
  
 
In many cases of practical importance (in particular, in the estimation of complicated non-linear relationships), the number of unknown parameters becomes very large, and so implementation of the method of least squares becomes effective only if modern computing techniques are employed.
 
In many cases of practical importance (in particular, in the estimation of complicated non-linear relationships), the number of unknown parameters becomes very large, and so implementation of the method of least squares becomes effective only if modern computing techniques are employed.
Line 174: Line 638:
 
====References====
 
====References====
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  A.A. Markov,  "Wahrscheinlichkeitsrechung" , Teubner  (1912)  (Translated from Russian)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  A.N. Kolmogorov,  "On substantiation of the method of least squares"  ''Uspekhi Mat. Nauk'' , '''1''' :  1  (1946)  pp. 57–70  (In Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  Yu.V. Linnik,  "Methode der kleinste Quadraten in moderner Darstellung" , Deutsch. Verlag Wissenschaft.  (1961)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  V.V. Nalimov,  "The application of mathematical statistics to chemical analysis" , Pergamon  (1963)  (Translated from Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  F.R. Helmert,  "Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate" , Teubner  (1907)</TD></TR></table>
 
<table><TR><TD valign="top">[1]</TD> <TD valign="top">  A.A. Markov,  "Wahrscheinlichkeitsrechung" , Teubner  (1912)  (Translated from Russian)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  A.N. Kolmogorov,  "On substantiation of the method of least squares"  ''Uspekhi Mat. Nauk'' , '''1''' :  1  (1946)  pp. 57–70  (In Russian)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  Yu.V. Linnik,  "Methode der kleinste Quadraten in moderner Darstellung" , Deutsch. Verlag Wissenschaft.  (1961)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  V.V. Nalimov,  "The application of mathematical statistics to chemical analysis" , Pergamon  (1963)  (Translated from Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  F.R. Helmert,  "Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate" , Teubner  (1907)</TD></TR></table>
 
 
  
 
====Comments====
 
====Comments====
In fact, Gauss proposed the [[Normal distribution|normal distribution]] as a way of justifying the method of least squares. Moreover, his original derivation of the least-square estimator as the  "best linear unbiased estimator"  (as it is nowadays interpreted; the so-called Gauss–Markov theorem) was actually a derivation of this estimator as  "most likely value of m"  in a Bayesian sense (cf. [[Bayesian approach|Bayesian approach]]), with respect to a uniform pivot on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780319.png" />, cf. [[#References|[a1]]].
+
In fact, Gauss proposed the [[Normal distribution|normal distribution]] as a way of justifying the method of least squares. Moreover, his original derivation of the least-square estimator as the  "best linear unbiased estimator"  (as it is nowadays interpreted; the so-called Gauss–Markov theorem) was actually a derivation of this estimator as  "most likely value of m"  in a Bayesian sense (cf. [[Bayesian approach|Bayesian approach]]), with respect to a uniform pivot on $  \mu $,  
 +
cf. [[#References|[a1]]].
  
For (the results of) the use of least squares in estimating the coefficients of (presumed) linear relationships <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780320.png" /> between repeatedly-measured vectors of variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780321.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/l/l057/l057780/l057780322.png" /> (linear regression), cf. [[Regression matrix|Regression matrix]] and [[Regression analysis|Regression analysis]].
+
For (the results of) the use of least squares in estimating the coefficients of (presumed) linear relationships $  y = A x $
 +
between repeatedly-measured vectors of variables $  y $
 +
and $  x $(
 +
linear regression), cf. [[Regression matrix|Regression matrix]] and [[Regression analysis|Regression analysis]].
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  S.M. Stigler,  "The history of statistics" , Harvard Univ. Press  (1986)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  N.R. Draper,  H. Smith,  "Applied regression analysis" , Wiley  (1981)</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its applications" , Wiley  (1965)</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  C. Daniel,  F.S. Wood,  "Fitting equations to data: computer analysis of multifactor data for scientists and engineers" , Wiley  (1971)</TD></TR></table>
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  S.M. Stigler,  "The history of statistics" , Harvard Univ. Press  (1986)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  N.R. Draper,  H. Smith,  "Applied regression analysis" , Wiley  (1981)</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  C.R. Rao,  "Linear statistical inference and its applications" , Wiley  (1965)</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  C. Daniel,  F.S. Wood,  "Fitting equations to data: computer analysis of multifactor data for scientists and engineers" , Wiley  (1971)</TD></TR></table>

Latest revision as of 22:16, 5 June 2020


A method in the theory of errors (cf. Errors, theory of) for estimating unknown quantities on the basis of results of measurement involving random errors. It is also used for the approximate representation of a given function by other (simpler) functions and it often proves useful for the processing of observations. It was first proposed by C.F. Gauss (1794–1795) and A. Legendre (1805–1806). The establishment of a rigorous basis for the method of least squares and the delineation of the limits of meaningful applicability of it were provided by A.A. Markov and A.N. Kolmogorov. In the simplest case, that of linear relationships (see below) and observations containing no systematic errors but only random ones, the least-square estimators of the unknown quantities are linear functions of the observed values. These estimators involve no systematic errors, i.e. they are unbiased (see Unbiased estimator). If the random errors in the observations are independent and follow a normal distribution law, the method yields estimators with minimum variance, i.e. they are efficient estimators (see Statistical estimation). It is in this sense that the method of least squares is optimal among all other methods for the determination of unbiased estimators. However, if the distribution of the random errors significantly departs from the normal one, the method is no longer necessarily the best.

In laying rigorous foundations for the method of least squares (in the Gaussian version), one assumes that the "loss" incurred by replacing the (unknown) exact value of some quantity $ \mu $ by an approximate value $ X $, calculated on the basis of observations, is proportional to the squared error $ ( X - \mu ) ^ {2} $; as an optimal estimator one takes a value of $ X $, free of systematic errors, for which the mean $ {\mathsf E} ( X - \mu ) ^ {2} $ of the "loss" is minimal. This requirement is fundamental to the method of least squares. In the general case, the task of finding an optimal estimator $ X $— in the least-square sense — is extremely complex, and so in practice one confines oneself to a narrower goal: One takes $ X $ to be a linear function of the observed results which involves no systematic errors, and such that the mean error is minimal relative to the class of all linear functions. If the random errors of the observations are normally distributed and the estimated quantity $ \mu $ depends linearly on the mean values of the observed results (a case which arises very frequently in applications of the method of least squares), the solution of this restricted problem turns out to be also the solution of the general problem. Under these conditions, the optimal estimator $ X $ also obeys a normal law with mean $ \mu $, and so the probability density of $ X $,

$$ p ( x; \mu , \sigma ) = \ \frac{1}{\sqrt {2 \pi \sigma } } \mathop{\rm exp} \left \{ - \frac{( x - \mu ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$

attains a maximum at $ x = \mu $( this property is a rigorous expression of the statement, very common in the theory of errors, that "the estimator X furnished by the method of least squares is the most probable value of the unknown parameter m" ).

The case of one unknown.

Assume that an unknown constant $ \mu $ is to be estimated on the basis of $ n $ independent observations, which have yielded results $ Y _ {1} \dots Y _ {n} $, i.e. $ Y _ {1} = \mu + \delta _ {1} , \dots Y _ {n} = \mu + \delta _ {n} $, where the $ \delta _ {1} \dots \delta _ {n} $ are random errors (by the usual definition in the classical theory of errors, random errors are independent random variables with expectation zero: $ {\mathsf E} \delta _ {i} = 0 $; if $ {\mathsf E} \delta _ {i} \neq 0 $, then the $ \delta _ {i} $ are called systematic errors). According to the method of least squares, one adopts an estimator $ X $ for $ \mu $ that minimizes the sum of the squares (hence the name of the method)

$$ S ( X) = \ \sum _ {i = 1 } ^ { n } p _ {i} ( X - Y _ {i} ) ^ {2} , $$

where

$$ p _ {i} = \frac{k}{\sigma _ {i} ^ {2} } \ \ \textrm{ and } \ \ \sigma _ {i} ^ {2} = {\mathsf D} \delta _ {i} = \ {\mathsf E} \delta _ {i} ^ {2} $$

(the coefficient $ k > 0 $ can be arbitrarily chosen). The number $ p _ {i} $ is known as the weight and $ \sigma _ {i} ^ {2} $ as the squared deviation of the $ i $- th measurement. In particular, if all measurements are equally precise, then $ \sigma _ {1} = {} \dots = \sigma _ {n} $, and one can then put $ p _ {1} = \dots = p _ {n} = 1 $; but if each $ Y _ {i} $ is the arithmetic mean of $ n _ {i} $ equally-precise measurements, one puts $ p _ {i} = n _ {i} $.

The sum $ S ( X) $ will be minimal if $ X $ is chosen to be the weighted average

$$ X = \overline{Y}\; = { \frac{1}{P} } \sum p _ {i} Y _ {i} ,\ \ \textrm{ where } \ P = \sum p _ {i} . $$

The estimator $ \overline{Y}\; $ for $ \mu $ is unbiased, has weight $ P $ and variance $ {\mathsf D} \overline{Y}\; = k/P $. In particular, if all measurements are equally precise, then $ Y $ is the arithmetic mean of the results of measurement:

$$ \overline{Y}\; = { \frac{1}{n} } \sum Y _ {i} \ \ \textrm{ and } \ \ {\mathsf D} \overline{Y}\; = \frac{\sigma ^ {2} }{n} . $$

Subject to certain general assumptions one can show that if $ n $, the number of observations, is sufficiently large, then the distribution of the estimator $ \overline{Y}\; $ differs only slightly from that of the normal law with expectation $ \mu $ and variance $ k/P $. In this case the absolute error of the approximate equality $ \mu \approx \overline{Y}\; $ is less than $ t \sqrt {k/P } $ with probability approximately equal to

$$ \tag{1 } I ( t) = \ \frac{2}{\sqrt {2 \pi } } \int\limits _ { 0 } ^ { t } e ^ {- u ^ {2} /2 } du $$

(e.g. $ I ( 1.96) = 0.950 $; $ I ( 2.58) = 0.990 $; $ I ( 3.00) = 0.997 $).

If the measurement weights $ p _ {i} $ are given and the factor $ k $ remains undetermined before the observations are taken, then this factor and the variance of $ \overline{Y}\; $ can be estimated by

$$ k \approx \ \frac{S ( \overline{Y}\; ) }{n - 1 } $$

and

$$ {\mathsf D} \overline{Y}\; = \ \frac{k}{P} \approx s ^ {2} = \ \frac{S ( \overline{Y}\; ) }{( n - 1) P } $$

(both these estimators are also unbiased).

In the practically important case in which the errors $ \delta _ {i} $ are normally distributed, one can determine the exact probability with which the absolute error of the equality $ \mu \approx \overline{Y}\; $ will be less than $ ts $( where $ t $ is an arbitrary positive number) as

$$ \tag{2 } I _ {n - 1 } ( t) = \ C _ {n - 1 } \int\limits _ { 0 } ^ { t } \left ( 1 + \frac{v ^ {2} }{n - 1 } \right ) ^ {-} n/2 dv, $$

where the constant $ C _ {n - 1 } $ is so chosen that $ I _ {n - 1 } ( \infty ) = 1 $( the Student distribution with $ n - 1 $ degrees of freedom). For large $ n $, formula (2) can be replaced by (1). For relatively small $ n $, however, the use of formula (1) may cause large errors. For example, according to (1), a value $ I = 0.99 $ corresponds to $ t = 2.58 $; the true values of $ t $ defined for small $ n $ as the solutions of the equations $ I _ {n - 1 } ( t) = 0.99 $

are shown in the following table:

<tbody> </tbody>
$ n \dots $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 10 $ $ 20 $ $ 30 $
$ t \dots $ $ 63.66 $ $ 9.92 $ $ 5.84 $ $ 4.60 $ $ 3.25 $ $ 2.86 $ $ 2.76 $

Example. To determine the mass of a body, it was weighted ten times in exactly the same way assuming the different weighings to be independent; the results $ Y _ {i} $(

in grams) were:

<tbody> </tbody>
$ Y _ {i} \dots $ $ 18.41 $ $ 18.42 $ $ 18.43 $ $ 18.44 $ $ 18.45 $ $ 18.46 $
$ n _ {i} \dots $ $ 1 $ $ 3 $ $ 3 $ $ 1 $ $ 1 $ $ 1 $

( $ n _ {i} $ denotes the number of cases in which the mass $ Y _ {i} $ was observed; $ n = \sum n _ {i} = 10 $). Since the different weighings were assumed to be equally precise, one puts $ p _ {i} = n _ {i} $ and the estimate for the unknown weight $ \mu $ is chosen to be $ Y = \sum n _ {i} Y _ {i} / \sum n _ {i} = 18.431 $. Putting, say, $ I _ {9} = 0.95 $ and checking the tables of the Student distribution with nine degrees of freedom, one finds that $ t = 2.262 $, so that the maximum absolute error in the approximate equality $ \mu \approx 18.431 $ is

$$ ts = \ t \sqrt {\sum n _ {i} \frac{( Y _ {i} - \overline{Y}\; ) ^ {2} }{90} } = \ 2.262 \cdot 0.0048 = 0.011. $$

Thus, $ 18.420 < \mu < 18.442 $ with probability $ 0.05 $.

The case of several unknowns (linear relationships).

Suppose that $ n $ measurement results $ Y _ {1} \dots Y _ {n} $ are related to $ m $ unknown constants $ x _ {1} \dots x _ {m} $( $ m < n $) by means of independent linear relationships

$$ \tag{3 } Y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} + \delta _ {i} ,\ \ i = 1 \dots n, $$

where the $ a _ {ij} $ are known coefficients and the $ \delta _ {i} $ are independent random errors in the measurements. One wishes to estimate the unknown quantities $ x _ {j} $( this problem can be regarded as a generalization of the previous problem, with $ \mu = x _ {i} $ and $ m = a _ {i1} = 1 $; $ i = 1 \dots n $).

Since $ {\mathsf E} \delta _ {i} = 0 $, the mean values of the results, $ y _ {i} = {\mathsf E} Y _ {i} $, are related to the unknowns $ x _ {1} \dots x _ {m} $ by means of the following linear equations:

$$ \tag{4 } y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} ,\ \ i = 1 \dots n. $$

Consequently, the desired values of $ x _ {i} $ are a solution of the system (4) (on the assumption that the equations are consistent). The precise values of the measured magnitudes $ y _ {i} $ and the random errors $ \delta _ {i} $ are usually unknown, and so one generally replaces the systems (3) and (4) by the conditional equations:

$$ Y _ {i} = \ \sum _ {j = 1 } ^ { m } a _ {ij} x _ {j} ,\ \ i = 1 \dots n. $$

In the method of least squares, estimators for the unknowns $ x _ {i} $ are the quantities $ X _ {i} $ that minimize the sum of the squared deviations

$$ S = \ \sum _ {i = 1 } ^ { n } p _ {i} \left ( Y _ {i} - \sum _ {j = 1 } ^ { m } a _ {ij} X _ {j} \right ) ^ {2} . $$

(As in the previous case, $ p _ {i} $ is the weight of the measurement and $ Y _ {i} $ is the quantity inversely proportional to the variance of the random error $ \delta _ {i} $.) The conditional equations are as a rule inconsistent, i.e. for arbitrary values of $ X _ {j} $ the differences

$$ \Delta _ {i} = \ Y _ {i} - \sum _ {j = 1 } ^ { m } a _ {ij} X _ {j} ,\ \ i = 1 \dots n, $$

cannot all vanish, in general. The method of least squares prescribes taking as estimators those values of $ X _ {j} $ that minimize the sum $ S $. In those exceptional cases in which the conditional equations are consistent, and therefore solvable, the solution consists precisely of the estimators furnished by the method of least squares.

The sum $ S $ of the squares is a quadratic polynomial in the variables $ X _ {j} $; this polynomial attains a minimum at the values of $ X _ {1} \dots X _ {m} $ for which all the first partial derivatives vanish:

$$ \frac{\partial S }{\partial X _ {j} } = \ - 2 \sum _ {i = 1 } ^ { n } p _ {i} \Delta _ {i} = 0,\ \ j = 1 \dots m. $$

Hence it follows that the $ X _ {j} $ furnished by the method of least squares must satisfy the so-called normal equations. In the notation proposed by Gauss, these equations are:

$$ \tag{5 } \sum _ {j = 1 } ^ { m } [ pa _ {j} a _ {l} ] X _ {j} = [ pYa _ {l} ],\ \ l = 1 \dots m, $$

where

$$ [ pa _ {j} a _ {l} ] = \ \sum _ {i = 1 } ^ { n } p _ {i} a _ {ij} a _ {il} $$

and

$$ [ pYa _ {l} ] = \ \sum _ {i = 1 } ^ { n } p _ {i} Y _ {i} a _ {il} . $$

Estimators obtained by solving the system of normal equations are unbiased $ ( {\mathsf E} X _ {j} = x _ {j} ) $; the variances $ {\mathsf D} X _ {j} $ are $ kd _ {jj} /d $, where $ d $ is the determinant of the system (5) and $ d _ {jj} $ is the minor corresponding to the diagonal entry $ [ pa _ {j} a _ {j} ] $( in other words, $ d _ {jj} /d $ is the weight of the estimator $ X _ {j} $). If the proportionality factor $ k $( known as the variance per unit weight) is not known in advance, it can be estimated, and with it the variances $ {\mathsf D} X _ {j} $, by the formulas:

$$ k \approx \frac{S}{n - m } \ \ \textrm{ and } \ \ {\mathsf D} X _ {j} \approx \ s _ {j} ^ {2} = \ \frac{Sd _ {jj} }{d ( n - m) } $$

( $ S $ is the minimum value of the original sum of the squares). Subject to certain general assumptions one can show that if the number of observations $ n $ is sufficiently large, the absolute error of the approximate equality $ x _ {j} \approx X _ {j} $ is less than $ ts _ {j} $ with probability approximately given by the integral (1). If the random errors $ \delta _ {i} $ obey a normal law, then all quotients $ ( X _ {j} - x _ {j} )/s _ {j} $ are distributed according to the Student law with $ n - m $ degrees of freedom (a precise estimation of the absolute error of the approximate equality makes use of the integral (2), just as in the case of one unknown). In addition, the minimum value of $ S $ is independent of $ X _ {1} \dots X _ {m} $( in the probabilistic sense), and therefore the approximate values of the variances $ {\mathsf D} X _ {j} \approx s _ {j} ^ {2} $ are independent of the estimators $ X _ {j} $ themselves.

One of the most typical applications of the method of least squares is in the "smoothing" of observed results $ Y _ {j} $ for which $ a _ {ij} = a _ {j} ( t _ {i} ) $ in equations (3), where the $ a _ {j} ( t) $ are known functions of some parameter $ t $( if $ t $ is time, then $ t _ {1} , t _ {2} \dots $ are the times at which the observations are made). A case encountered quite frequently in practical work is that known as parabolic interpolation, when the $ a _ {j} ( t) $ are polynomials (e.g. $ a _ {1} ( t) = 1 $, $ a _ {2} ( t) = t $, $ a _ {3} ( t) = t ^ {2} ,\dots $); if $ t _ {2} - t _ {1} = \dots = t _ {n} - t _ {n - 1 } $ and if the observations are equally precise, then the estimators $ X _ {j} $ can be calculated by using tables of orthogonal polynomials. Another case of practical importance is harmonic interpolation, when the $ a _ {j} ( t) $ are trigonometric functions (e.g. $ a _ {j} ( t) = \cos ( j - 1) t $, $ j = 1 \dots m $).

Example. In order to estimate the precision of a certain method of chemical analysis, the method was used to determine the concentration of CaO in ten standard samples of known composition. The results are listed in the following table ( $ i $ is the number of the experiment, $ t _ {i} $ the true concentration of CaO, $ T _ {i} $ is the concentration of CaO determined by the chemical analysis, $ Y _ {i} = T _ {i} - t _ {i} $

the error in chemical analysis):

<tbody> </tbody>
$ 1 $ $ 2 $ $ 3 $ $ 4 $ $ 5 $ $ 6 $ $ 7 $ $ 8 $ $ 9 $ $ 10 $
$ 4 $ $ 8 $ $ 12.5 $ $ 16 $ $ 20 $ $ 25 $ $ 31 $ $ 36 $ $ 40 $ $ 40 $
$ - 0.3 $ $ - 0.2 $ $ - 0.4 $ $ - 0.4 $ $ - 0.2 $ $ - 0.5 $ $ + 0.1 $ $ - 0.5 $ $ - 0.6 $ $ - 0.5 $

If the results of the chemical analysis do not involve systematic errors, then $ {\mathsf E} Y _ {i} = 0 $. But if such errors exist, they may be represented as a first approximation by $ {\mathsf E} Y _ {i} = \alpha + \beta t _ {i} $( $ \alpha $ is known as the constant bias, $ \beta t _ {i} $ as the methodical bias) or, equivalently,

$$ {\mathsf E} Y _ {i} = ( \alpha + \beta \overline{t}\; ) + \beta ( t _ {i} - \overline{t}\; ), $$

where

$$ \overline{t}\; = { \frac{1}{10} } \sum _ {i = 1 } ^ { 10 } t _ {i} = 23.25. $$

To find estimates of $ \alpha $ and $ \beta $, it will suffice to estimate the quantities $ x _ {1} = \alpha + \beta \overline{t}\; $ and $ x _ {2} = \beta $. The conditional equations here are

$$ Y _ {i} = x _ {1} + x _ {2} ( t _ {i} - \overline{t}\; ),\ \ i = 1 \dots 10, $$

so that $ a _ {i1} = 1 $, $ a _ {i2} = t _ {i} - \overline{t}\; $( since by assumption the observations are equally precise or homoscedastic (cf. Homoscedasticity), $ p _ {i} = 1 $). Since $ [ a _ {1} a _ {2} ] = [ a _ {2} a _ {1} ] = \sum ( t _ {i} - \overline{t}\; ) = 0 $, the system of normal equations assumes a particularly simple form:

$$ [ a _ {1} a _ {1} ] X _ {1} = [ Ya _ {1} ]; \ \ [ a _ {2} a _ {2} ] X _ {2} = [ Ya _ {2} ], $$

where

$$ [ a _ {1} a _ {1} ] = 10,\ \ [ a _ {2} a _ {2} ] = \sum ( t _ {i} - \overline{t}\; ) ^ {2} = \ 1569, $$

$$ [ Ya _ {1} ] = \sum y _ {i} = - 3.5, $$

$$ [ Ya _ {2} ] = \sum Y _ {i} ( t _ {i} - \overline{t}\; ) = - 8.225. $$

The variances of the components of the solution are

$$ {\mathsf D} X _ {1} = \frac{k}{[ a _ {1} a _ {1} ] } = { \frac{k}{10} } \ \ \textrm{ and } \ \ {\mathsf D} X _ {2} = \frac{k}{[ a _ {2} a _ {2} ] } = { \frac{k}{1569} } , $$

where $ k $ is the unknown variance per unit weight (in this case, $ k $ is the variance of any of the quantities $ Y _ {i} $). Since in this example the components of the solution have values $ X _ {1} = - 0.35 $ and $ X _ {2} = - 0.00524 $, one finds

$$ k \approx \frac{S}{n - m } = $$

$$ = \ { \frac{1}{8} } \sum _ {i = 1 } ^ { 10 } [ Y _ {i} - X _ {1} - X _ {2} ( t _ {i} - \overline{t}\; )] ^ {2} = 0.0427, $$

$$ {\mathsf D} X _ {1} \approx s _ {1} ^ {2} = 0.00427,\ \ {\mathsf D} X _ {2} \approx s _ {2} ^ {2} = 0.000272, $$

$$ s _ {1} = 0.065,\ s _ {2} = 0.00522. $$

If the random errors in the observations follow a normal law, then the quotients $ | X _ {j} - x _ {j} | /s _ {j} $, $ j = 1, 2 $, are distributed according to a Student $ t $- distribution. In particular, if the results of the observations are free of systematic errors, then $ x _ {1} = x _ {2} = 0 $, so that the quotients $ | X _ {1} | /s _ {1} $ and $ | X _ {2} | /s _ {2} $ must follow a Student $ t $- distribution. Using tables of the Student distribution with $ n - m = 8 $ degrees of freedom, one verifies that if indeed $ x _ {1} = x _ {2} = 0 $, then with probability 0.999 each of these quotients cannot exceed 5.04 and with probability 0.95 they will not exceed 2.31. In this case $ | X _ {1} | /s _ {1} = 5.38 > 5.04 $, and therefore the hypothesis that there are no systematic errors should be rejected. At the same time, it should be realized that the hypothesis of absence of systematic errors $ ( x _ {2} = 0) $ is not contradicted by the results of the observations, since $ | X _ {2} | /s _ {2} = 1.004 < 2.31 $. Thus, it follows that the approximate formula $ t = T + 0.35 $ is a logical choice for the determination of $ t $ on the basis of the observed value $ T $.

The case of several unknowns (non-linear relationships).

Suppose that the $ n $ results $ Y _ {i} $ of measurements are related to the $ m $ unknowns $ x _ {j} $( $ m < n $) by means of functions $ Y _ {i} = f _ {i} ( x _ {1} \dots x _ {m} ) + \delta _ {i} $, $ i = 1 \dots n $, where the $ \delta _ {i} $ are independent random errors, and the functions $ f _ {i} $( which need not be linear) are differentiable. According to the method of least squares, estimators $ X _ {j} $ for the $ x _ {j} $ are those for which the sum of squares is smallest. Since the functions $ f _ {i} $ are non-linear, solving the normal equations $ \partial S/ \partial X _ {j} = 0 $ may present considerable difficulties. Sometimes non-linear relationships may be reduced to linear forms by transformation.

For example, in the magnetization of iron, the magnetic field strength $ H $ is related to the magnetic induction via the empirical formula $ B = H/( x _ {1} + H x _ {2} ) $( the coefficients $ x _ {1} $ and $ x _ {2} $ being determined by measuring the values of $ B _ {i} $ for various preassigned values $ H _ {i} $). The induction $ B $ is a non-linear function of $ x _ {1} $ and $ x _ {2} $. However, the reciprocal of the induction is a linear function of $ x _ {1} $ and $ x _ {2} $. Application of the method of least squares to the original and the transformed equations may well yield different estimates for the unknowns $ x _ {1} $ and $ x _ {2} $, but if the variance of the random errors involved in measuring the induction is significantly less than the measured quantities $ B _ {i} $, then $ {\mathsf D} ( 1/B) \approx B ^ {-} 4 $. Thus the quantities $ 1/B _ {i} $ should be assigned weights $ ( 1/B _ {i} ) ^ {4} $; it is natural to expect that under these conditions the difference between estimates in the non-linear and linear cases is negligible for practical purposes.

In cases when it is not possible to replace the non-linear equations by linear ones via identity transformations, one resorts to another linearization technique. Inspecting the original $ n $ equations, single out any $ m $ of them, the solution $ X _ {1} ^ {0} \dots X _ {m} ^ {0} $ of which will serve as a zero-th approximation for the unknowns $ x _ {j} $. Putting $ \xi = X _ {j} - X _ {j} ^ {0} $, the conditional equations may be rewritten as

$$ Y _ {i} = f _ {i} ( X _ {1} ^ {0} + \xi _ {1} \dots X _ {m} ^ {0} + \xi _ {m} ),\ \ i = 1 \dots n. $$

Expanding the terms on the right in powers of $ \xi _ {j} $ and dropping all but the linear terms, one obtains

$$ Y _ {i} - ( f _ {i} ) _ {0} = \ \sum _ {j = 1 } ^ { m } \left ( \frac{\partial f _ {i} }{\partial x _ {j} } \right ) _ {0} \xi _ {j} ,\ \ i = 1 \dots n, $$

where $ ( f _ {i} ) _ {0} $ and $ ( \partial f _ {i} / \partial x _ {j} ) _ {0} $ are the values of $ f _ {i} $ and its derivative with respect to $ x _ {j} $ at the point $ x _ {1} = X _ {1} ^ {0} \dots x _ {m} = X _ {m} ^ {0} $. This is a linear system of equations so that the method of least squares is readily applicable to determine estimates for the unknowns $ \xi _ {j} $. Once this has been done, one obtains a first approximation for the unknown: $ X _ {j} ^ {1} = X _ {j} ^ {0} + \xi _ {j} $. These quantities are now treated as the initial approximation, and the entire procedure is repeated, as long as two consecutive approximations do not coincide to within the desired precision. If the variance of the errors $ \delta _ {i} $ decreases, the procedure is convergent.

Very commonly, if $ {\mathsf D} \delta _ {i} $ is small, the first approximation alone is quite sufficient; there is no point in requiring $ X _ {j} $ to be determined with precision significantly exceeding $ \sqrt { {\mathsf D} X _ {j} } $.

In many cases of practical importance (in particular, in the estimation of complicated non-linear relationships), the number of unknown parameters becomes very large, and so implementation of the method of least squares becomes effective only if modern computing techniques are employed.

References

[1] A.A. Markov, "Wahrscheinlichkeitsrechung" , Teubner (1912) (Translated from Russian)
[2] A.N. Kolmogorov, "On substantiation of the method of least squares" Uspekhi Mat. Nauk , 1 : 1 (1946) pp. 57–70 (In Russian)
[3] Yu.V. Linnik, "Methode der kleinste Quadraten in moderner Darstellung" , Deutsch. Verlag Wissenschaft. (1961) (Translated from Russian)
[4] V.V. Nalimov, "The application of mathematical statistics to chemical analysis" , Pergamon (1963) (Translated from Russian)
[5] F.R. Helmert, "Die Ausgleichungsrechnung nach der Methode der kleinsten Quadrate" , Teubner (1907)

Comments

In fact, Gauss proposed the normal distribution as a way of justifying the method of least squares. Moreover, his original derivation of the least-square estimator as the "best linear unbiased estimator" (as it is nowadays interpreted; the so-called Gauss–Markov theorem) was actually a derivation of this estimator as "most likely value of m" in a Bayesian sense (cf. Bayesian approach), with respect to a uniform pivot on $ \mu $, cf. [a1].

For (the results of) the use of least squares in estimating the coefficients of (presumed) linear relationships $ y = A x $ between repeatedly-measured vectors of variables $ y $ and $ x $( linear regression), cf. Regression matrix and Regression analysis.

References

[a1] S.M. Stigler, "The history of statistics" , Harvard Univ. Press (1986)
[a2] N.R. Draper, H. Smith, "Applied regression analysis" , Wiley (1981)
[a3] C.R. Rao, "Linear statistical inference and its applications" , Wiley (1965)
[a4] C. Daniel, F.S. Wood, "Fitting equations to data: computer analysis of multifactor data for scientists and engineers" , Wiley (1971)
How to Cite This Entry:
Least squares, method of. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Least_squares,_method_of&oldid=11849
This article was adapted from an original article by L.N. Bol'shev (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article