Difference between revisions of "Maximum-likelihood method"

Revision as of 08:00, 6 June 2020

One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory.

Suppose one has, for an observation $ X $ with distribution $ {\mathsf P} _ \theta $ depending on an unknown parameter $ \theta \in \Theta \subseteq \mathbf R ^ {k} $, the task to estimate $ \theta $. Assuming that all measures $ {\mathsf P} _ \theta $ are absolutely continuous relative to a common measure $ \nu $, the likelihood function is defined by

$$ L ( \theta ) = \ \frac{d {\mathsf P} _ \theta }{d \nu } ( X ) . $$

The maximum-likelihood method recommends taking as an estimator for $ \theta $ the statistic $ \widehat \theta $ defined by

$$ L ( \widehat \theta ) = \ \max _ {\theta \in \Theta ^ {c} } \ L ( \theta ) . $$

$ \widehat \theta $ is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a likelihood equation

$$ \tag{1 } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} L ( \theta ) = 0 ,\ \ i = 1 \dots k ,\ \ \theta = ( \theta _ {1} \dots \theta _ {k} ) . $$

Example 1. Let $ X = ( X _ {1} \dots X _ {n} ) $ be a sequence of independent random variables (observations) with common distribution $ {\mathsf P} _ \theta $, $ \theta \in \Theta $. If there is a density

$$ f ( x , \theta ) = \ \frac{d {\mathsf P} _ \theta }{dm} ( x) $$

relative to some measure $ m $, then

$$ L ( \theta ) = \ \prod _ { j= } 1 ^ { n } f ( X _ {j} , \theta ) $$

and the equations (1) take the form

$$ \tag{2 } \sum _ { j= } 1 ^ { n } \frac \partial {\partial \theta _ {i} } \mathop{\rm log} f ( X _ {j} , \theta ) = 0 ,\ \ i = 1 \dots k . $$

Example 2. In Example 1, let $ {\mathsf P} _ \theta $ be the normal distribution with density

$$ \frac{1}{\sigma \sqrt {2 \pi } } \mathop{\rm exp} \left \{ - \frac{( x - a ) ^ {2} }{2 \sigma ^ {2} } \right \} , $$

where $ x \in \mathbf R ^ {1} $, $ \theta = ( a , \sigma ^ {2} ) $, $ - \infty < a < \infty $, $ \sigma ^ {2} > 0 $. Equations (2) become

$$ \frac{1}{\sigma ^ {2} } \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) = 0 , $$

$$ \frac{1}{2 \sigma ^ {4} } \sum _ { j= } 1 ^ { n } ( X _ {j} - a ) ^ {2} - \frac{n}{2 \sigma ^ {2} } = 0 ; $$

and the maximum-likelihood estimator is given by

$$ \widehat{a} = X = \frac{1}{n} \sum _ { j= } 1 ^ { n } X _ {j} ,\ \ \widehat \sigma {} ^ {2} = \frac{1}{n} \sum _ { j= } 1 ^ { n } ( X _ {j} - \overline{X}\; ) ^ {2} . $$

Example 3. In Example 1, let $ X _ {j} $ take the values $ 0 $ and $ 1 $ with probabilities $ 1 - \theta $, $ \theta $, respectively. Then

$$ L ( \theta ) = \ \prod _ { j= } 1 ^ { n } \theta ^ {X _ {j} } ( 1 - \theta ) ^ {1 - X _ {j} } , $$

and the maximum-likelihood estimator is $ \widehat \theta = \overline{X}\; $.

Example 4. Let the observation $ X = X _ {t} $ be a diffusion process with stochastic differential

$$ d X _ {t} = \theta a _ {t} ( X _ {t} ) + d W _ {t} ,\ \ X _ {0} = 0 ,\ 0 \leq t \leq T , $$

where $ W _ {t} $ is a Wiener process and $ \theta $ is an unknown one-dimensional parameter. Here (see [3]),

$$ \mathop{\rm log} L ( \theta ) = \ \theta \int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} - \frac{\theta ^ {2} }{2} \int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {T} ) d t , $$

$$ \widehat \theta = \frac{\int\limits _ { 0 } ^ { T } a _ {t} ( X _ {t} ) d X _ {t} }{\int\limits _ { 0 } ^ { T } a _ {t} ^ {2} ( X _ {t} ) d t } . $$

There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions, $ \widehat \theta _ {n} \rightarrow \theta $ with $ {\mathsf P} _ \theta $- probability $ 1 $. If the Fisher information

$$ I ( \theta ) = \ \int\limits \frac{| f _ \theta ^ { \prime } ( x , \theta ) | ^ {2} }{f ( x , \theta ) } m ( dx ) $$

exists, then the difference $ \sqrt n ( \widehat \theta _ {n} - \theta ) $ is asymptotically normal with parameters $ ( 0 , I ^ {-} 1 ( \theta ) ) $, and $ \widehat \theta _ {n} $, in a well-defined sense, has an asymptotically-minimal mean-square deviation from $ \theta $( see [4], [5]).

References

[1]	H. Cramér, "Mathematical methods of statistics" , Princeton Univ. Press (1946)
[2]	S. Zacks, "The theory of statistical inference" , Wiley (1975)
[3]	R.S. Liptser, A.N. Shiryaev, "Statistics of random processes" , 1 , Springer (1977) (Translated from Russian)
[4]	A.I. Ibragimov, "Statistical estimation: asymptotic theory" , Springer (1981) (Translated from Russian)
[5]	E.L. Lehmann, "Theory of point estimation" , Wiley (1983)

How to Cite This Entry:
Maximum-likelihood method. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Maximum-likelihood_method&oldid=39772

This article was adapted from an original article by I.A. Ibragimov (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article

Navigation

Tools

Namespaces

Variants

Views

Actions

Difference between revisions of "Maximum-likelihood method"

Revision as of 08:00, 6 June 2020

References

@@ Line 1: / Line 1: @@
+<!--
+m0631001.png
+$#A+1 = 49 n = 0
+$#C+1 = 49 : ~/encyclopedia/old_files/data/M063/M.0603100 Maximum\AAhlikelihood method
+Automatically converted into TeX, above some diagnostics.
+Please remove this comment and the {{TEX|auto}} line below,
+if TeX found to be correct.
+-->
+{{TEX|auto}}
+{{TEX|done}}
 One of the fundamental general methods for constructing estimators of unknown parameters in statistical estimation theory.
-Suppose one has, for an observation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631001.png" /> with distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631002.png" /> depending on an unknown parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631003.png" />, the task to estimate <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631004.png" />. Assuming that all measures <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631005.png" /> are absolutely continuous relative to a common measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631006.png" />, the likelihood function is defined by
+Suppose one has, for an observation  $  X $
+with distribution  $  {\mathsf P} _  \theta  $
+depending on an unknown parameter  $  \theta \in \Theta \subseteq \mathbf R  ^ {k} $,
+the task to estimate  $  \theta $.
+Assuming that all measures  $  {\mathsf P} _  \theta  $
+are absolutely continuous relative to a common measure  $  \nu $,
+the likelihood function is defined by
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631007.png" /></td> </tr></table>
+$$
+L ( \theta )  = \
-The maximum-likelihood method recommends taking as an estimator for <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631008.png" /> the statistic <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m0631009.png" /> defined by
+\frac{d {\mathsf P} _  \theta  }{d \nu }
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310010.png" /></td> </tr></table>
+( X ) .
+$$
-<img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310011.png" /> is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a [[likelihood equation]]
+The maximum-likelihood method recommends taking as an estimator for  $  \theta $
+the statistic  $  \widehat \theta   $
+defined by
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310012.png" /></td> <td valign="top" style="width:5%;text-align:right;">(1)</td></tr></table>
+$$
+L ( \widehat \theta   )  = \
+\max _ {\theta \in \Theta  ^ {c} } \
+L ( \theta ) .
+$$
-Example 1. Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310013.png" /> be a sequence of independent random variables (observations) with common distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310014.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310015.png" />. If there is a density
+$  \widehat \theta   $
+is called the maximum-likelihood estimator. In a broad class of cases the maximum-likelihood estimator is the solution of a [[likelihood equation]]
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310016.png" /></td> </tr></table>
+$$ \tag{1 }
-relative to some measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310017.png" />, then
+\frac \partial {\partial  \theta _ {i} }
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310018.png" /></td> </tr></table>
+ \mathop{\rm log}  L ( \theta )  =  0 ,\ \
+i = 1 \dots k ,\ \
+\theta = ( \theta _ {1} \dots \theta _ {k} ) .
+$$
+Example 1. Let  $  X = ( X _ {1} \dots X _ {n} ) $
+be a sequence of independent random variables (observations) with common distribution  $  {\mathsf P} _  \theta  $,
+$  \theta \in \Theta $.
+If there is a density
+$$
+f ( x , \theta )  = \
+\frac{d {\mathsf P} _  \theta  }{dm}
+ ( x)
+$$
+relative to some measure  $  m $,
+then
+$$
+L ( \theta )  = \
+\prod _ { j= } 1 ^ { n }  f ( X _ {j} , \theta )
+$$
 and the equations (1) take the form
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310019.png" /></td> <td valign="top" style="width:5%;text-align:right;">(2)</td></tr></table>
+$$ \tag{2 }
+\sum _ { j= } 1 ^ { n }
+\frac \partial {\partial  \theta _ {i} }
-Example 2. In Example 1, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310020.png" /> be the [[Normal distribution|normal distribution]] with density
+ \mathop{\rm log}  f ( X _ {j} , \theta )  =  0 ,\ \
+i = 1 \dots k .
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310021.png" /></td> </tr></table>
+Example 2. In Example 1, let  $  {\mathsf P} _  \theta  $
+be the [[Normal distribution|normal distribution]] with density
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310022.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310023.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310024.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310025.png" />. Equations (2) become
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310026.png" /></td> </tr></table>
+\frac{1}{\sigma \sqrt {2 \pi } }
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310027.png" /></td> </tr></table>
+ \mathop{\rm exp} \left \{
+-
+\frac{( x - a )  ^ {2} }{2 \sigma  ^ {2} }
+ \right \} ,
+$$
+where  $  x \in \mathbf R  ^ {1} $,
+$  \theta = ( a , \sigma  ^ {2} ) $,
+$  - \infty < a < \infty $,
+$  \sigma  ^ {2} > 0 $.
+Equations (2) become
+$$
+\frac{1}{\sigma  ^ {2} }
+\sum _ { j= } 1 ^ { n }  ( X _ {j} - a )  =  0 ,
+$$
+$$
+\frac{1}{2 \sigma  ^ {4} }
+ \sum _ { j= } 1 ^ { n }  ( X _ {j} - a )  ^ {2} -
+\frac{n}{2 \sigma  ^ {2} }
+  =  0 ;
+$$
 and the maximum-likelihood estimator is given by
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310028.png" /></td> </tr></table>
+$$
+\widehat{a}   =  X  =
+\frac{1}{n}
+\sum _ { j= } 1 ^ { n }  X _ {j} ,\ \
+\widehat \sigma   {}  ^ {2}  =
+\frac{1}{n}
+\sum _ { j= } 1 ^ { n }  ( X _ {j} - \overline{X}\; )  ^ {2} .
+$$
+Example 3. In Example 1, let  $  X _ {j} $
+take the values  $  0 $
+and  $  1 $
+with probabilities  $  1 - \theta $,
+$  \theta $,
+respectively. Then
-Example 3. In Example 1, let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310029.png" /> take the values <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310030.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310031.png" /> with probabilities <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310032.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310033.png" />, respectively. Then
+$$
+L ( \theta )  = \
+\prod _ { j= } 1 ^ { n }
+\theta ^ {X _ {j} } ( 1 - \theta ) ^ {1 - X _ {j} } ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310034.png" /></td> </tr></table>
+and the maximum-likelihood estimator is  $  \widehat \theta   = \overline{X}\; $.
-and the maximum-likelihood estimator is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310035.png" />.
+Example 4. Let the observation  $  X = X _ {t} $
+be a [[Diffusion process|diffusion process]] with [[Stochastic differential|stochastic differential]]
-Example 4. Let the observation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310036.png" /> be a [[Diffusion process|diffusion process]] with [[Stochastic differential|stochastic differential]]
+$$
+d X _ {t}  =  \theta a _ {t} ( X _ {t} ) + d W _ {t} ,\ \
+X _ {0}  =  0 ,\  0 \leq  t \leq  T ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310037.png" /></td> </tr></table>
+where  $  W _ {t} $
+is a [[Wiener process|Wiener process]] and  $  \theta $
+is an unknown one-dimensional parameter. Here (see [[#References|[3]]]),
-where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310038.png" /> is a [[Wiener process|Wiener process]] and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310039.png" /> is an unknown one-dimensional parameter. Here (see [[#References|[3]]]),
+$$
+ \mathop{\rm log}  L ( \theta )  = \
+\theta \int\limits _ { 0 } ^ { T }  a _ {t} ( X _ {t} )  d X _ {t} -
+\frac{\theta  ^ {2} }{2}
+ \int\limits _ { 0 } ^ { T }
+a _ {t}  ^ {2} ( X _ {T} )  d t ,
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310040.png" /></td> </tr></table>
+$$
+\widehat \theta    =
+\frac{\int\limits _ { 0 } ^ { T }  a _ {t} ( X _ {t} )  d X _ {t} }{\int\limits _ { 0 } ^ { T }  a _ {t}  ^ {2} ( X _ {t} )  d t }
+ .
+$$
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310041.png" /></td> </tr></table>
+There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions,  $  \widehat \theta   _ {n} \rightarrow \theta $
+with  $  {\mathsf P} _  \theta  $-
+probability  $  1 $.
+If the Fisher information
-There are no definitive reasons for optimality of the maximum-likelihood method and the widespread belief in its efficiency is partially based on the great success with which it has been applied to numerous concrete problems, and partially on rigorously established asymptotic optimality properties. For example, in Example 1, under broad assumptions, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310042.png" /> with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310043.png" />-probability <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310044.png" />. If the Fisher information
+$$
+I ( \theta )  = \
+\int\limits
+\frac{| f _  \theta  ^ { \prime } ( x , \theta ) |  ^ {2} }{f ( x , \theta ) }
-<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310045.png" /></td> </tr></table>
+m ( dx )
+$$
-exists, then the difference <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310046.png" /> is asymptotically normal with parameters <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310047.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310048.png" />, in a well-defined sense, has an asymptotically-minimal mean-square deviation from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/m/m063/m063100/m06310049.png" /> (see [[#References|[4]]], [[#References|[5]]]).
+exists, then the difference  $  \sqrt n ( \widehat \theta   _ {n} - \theta ) $
+is asymptotically normal with parameters  $  ( 0 , I  ^ {-} 1 ( \theta ) ) $,
+and  $  \widehat \theta   _ {n} $,
+in a well-defined sense, has an asymptotically-minimal mean-square deviation from  $  \theta $(
+see [[#References|[4]]], [[#References|[5]]]).
 ====References====
 <table><TR><TD valign="top">[1]</TD> <TD valign="top">  H. Cramér,   "Mathematical methods of statistics" , Princeton Univ. Press  (1946)</TD></TR><TR><TD valign="top">[2]</TD> <TD valign="top">  S. Zacks,   "The theory of statistical inference" , Wiley  (1975)</TD></TR><TR><TD valign="top">[3]</TD> <TD valign="top">  R.S. Liptser,   A.N. Shiryaev,   "Statistics of random processes" , '''1''' , Springer  (1977)  (Translated from Russian)</TD></TR><TR><TD valign="top">[4]</TD> <TD valign="top">  A.I. Ibragimov,   "Statistical estimation: asymptotic theory" , Springer  (1981)  (Translated from Russian)</TD></TR><TR><TD valign="top">[5]</TD> <TD valign="top">  E.L. Lehmann,   "Theory of point estimation" , Wiley  (1983)</TD></TR></table>