Namespaces
Variants
Actions

Difference between revisions of "Adaptive algorithm"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (AUTOMATIC EDIT (latexlist): Replaced 69 formulas out of 69 by TEX code with an average confidence of 2.0 and a minimal confidence of 2.0.)
 
Line 1: Line 1:
The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [[#References|[a8]]], [[#References|[a1]]], [[#References|[a5]]]; see also, e.g., [[Adaptive sampling|Adaptive sampling]]. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201301.png" />.
+
<!--This article has been texified automatically. Since there was no Nroff source code for this article,
 +
the semi-automatic procedure described at https://encyclopediaofmath.org/wiki/User:Maximilian_Janisch/latexlist
 +
was used.
 +
If the TeX and formula formatting is correct, please remove this message and the {{TEX|semi-auto}} category.
 +
 
 +
Out of 69 formulas, 69 were replaced by TEX code.-->
 +
 
 +
{{TEX|semi-auto}}{{TEX|done}}
 +
The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [[#References|[a8]]], [[#References|[a1]]], [[#References|[a5]]]; see also, e.g., [[Adaptive sampling|Adaptive sampling]]. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by $\theta$.
  
 
The study of adaptive algorithms is generally made through the theory of [[Stochastic approximation|stochastic approximation]].
 
The study of adaptive algorithms is generally made through the theory of [[Stochastic approximation|stochastic approximation]].
  
Assume that observations <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201302.png" /> are related to the true parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201303.png" /> via an equation
+
Assume that observations $X_n$ are related to the true parameter $\theta ^ { * }$ via an equation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201304.png" /></td> </tr></table>
+
\begin{equation*} \mathsf{E} _ { \theta } [ H ( \theta , X ) ] = 0 , \quad \text { if } \theta = \theta ^ { * }, \end{equation*}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201305.png" /> is known, but the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201306.png" /> is unknown and may or may not depend on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201307.png" /> (as indicated by the subscript on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201308.png" />). In many situations, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a1201309.png" /> is the gradient of a functional to be minimized.
+
where $H$ is known, but the distribution of $X$ is unknown and may or may not depend on $\theta$ (as indicated by the subscript on $-$). In many situations, $H$ is the gradient of a functional to be minimized.
  
 
The general structure considered for stochastic algorithms is the following:
 
The general structure considered for stochastic algorithms is the following:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013010.png" /></td> </tr></table>
+
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}
  
where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013011.png" /> is a non-negative non-increasing sequence, typically <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013012.png" /> or a constant, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013013.png" /> is the  "somehow stationary"  observed sequence and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013014.png" /> is at each <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013015.png" /> an improved estimate of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013016.png" />.
+
where $\gamma _ { n }$ is a non-negative non-increasing sequence, typically $1 / n$ or a constant, $X_n$ is the  "somehow stationary"  observed sequence and $\theta _ { n }$ is at each $n$ an improved estimate of $\theta ^ { * }$.
  
Note that when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013017.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013018.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013019.png" /> is simply the [[Arithmetic mean|arithmetic mean]] of the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013020.png" />.
+
Note that when $H ( \theta , X ) = \theta - X$, and $\gamma _ { n } = 1 / n$, $\theta _ { n }$ is simply the [[Arithmetic mean|arithmetic mean]] of the $X_n$.
  
The easiest situation is when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013021.png" /> is a given sequence of independent identically-distributed random variables. However, this is not always the case: the classical Robbins–Monro algorithms correspond to the situation where a parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013022.png" /> (e.g., a dosage of a chemical product) needs to be tuned so that the effect measured by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013023.png" /> is at an average given level <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013024.png" />. In this case <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013025.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013026.png" /> is the result of an experiment made with <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013027.png" />. Study of stochastic algorithms is generally restricted to those which have the Markovian structure introduced in [[#References|[a1]]], where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013028.png" /> is a [[Random variable|random variable]] depending only on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013029.png" />; more precisely, the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013030.png" /> conditional to the whole past <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013031.png" /> is given by a transition kernel <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013032.png" />. In this case, the expectation above has to be taken with respect to the stationary distribution of the [[Markov chain|Markov chain]] <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013033.png" /> when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013034.png" /> is fixed.
+
The easiest situation is when $X_n$ is a given sequence of independent identically-distributed random variables. However, this is not always the case: the classical Robbins–Monro algorithms correspond to the situation where a parameter $\theta$ (e.g., a dosage of a chemical product) needs to be tuned so that the effect measured by $X$ is at an average given level $\alpha$. In this case $H ( \theta , X ) = X - \alpha$ and $X_n$ is the result of an experiment made with $\theta _ { n - 1} $. Study of stochastic algorithms is generally restricted to those which have the Markovian structure introduced in [[#References|[a1]]], where $X_n$ is a [[Random variable|random variable]] depending only on $( \theta _ { n  - 1} , X _ { n  - 1} )$; more precisely, the distribution of $X_n$ conditional to the whole past $( X _ { n - 1 }  , \theta _ { n - 1 } , \ldots )$ is given by a transition kernel $P _ { \theta _ { n } } ( X _ { n - 1 } , d  x  )$. In this case, the expectation above has to be taken with respect to the stationary distribution of the [[Markov chain|Markov chain]] $X_n$ when $\theta$ is fixed.
  
 
==Decreasing gain.==
 
==Decreasing gain.==
In the case of decreasing gain, typical results which are proved are the almost-sure convergence of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013035.png" /> to a solution of the equation <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013036.png" /> (where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013037.png" />), and, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013038.png" />, the asymptotic normality of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013039.png" /> and the [[Law of the iterated logarithm|law of the iterated logarithm]] ([[#References|[a3]]], [[#References|[a6]]], [[#References|[a1]]], [[#References|[a4]]]). The asymptotic covariance matrix <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013040.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013041.png" /> is the solution of the Lyapunov equation
+
In the case of decreasing gain, typical results which are proved are the almost-sure convergence of $\theta _ { n }$ to a solution of the equation $h ( \theta ) = 0$ (where $h ( \theta ) = \mathsf{E} _ { \theta } [ H ( \theta , X ) ]$), and, if $\gamma _ { n } = 1 / n$, the asymptotic normality of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ and the [[Law of the iterated logarithm|law of the iterated logarithm]] ([[#References|[a3]]], [[#References|[a6]]], [[#References|[a1]]], [[#References|[a4]]]). The asymptotic covariance matrix $V$ of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ is the solution of the Lyapunov equation
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013042.png" /></td> </tr></table>
+
\begin{equation*} \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) V + V \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) ^ { T } = R ( \theta ^ { * } ), \end{equation*}
  
 
where
 
where
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013043.png" /></td> </tr></table>
+
\begin{equation*} h _ { \theta } ^ { * } = \nabla h ( \theta ^ { * } ), \end{equation*}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013044.png" /></td> </tr></table>
+
\begin{equation*} R ( \theta ^ { * } ) = \sum _ { n = - \infty } ^ { \infty } \operatorname { cov } ( H ( \theta ^ { * } , X _ { n } ) , H ( \theta ^ { * } , X _ { 0 } ) ). \end{equation*}
  
The expectation is taken with respect to the distribution of the sequence <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013045.png" /> (in the general Markovian situation, one has to consider here the stationary distribution of the chain of transition kernels <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013046.png" />). In particular, asymptotic normality occurs only if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013047.png" /> has only negative eigenvalues (otherwise the Lyapunov equation above has no solution). Improving this covariance may be done by insertion of a matrix gain (smallest <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013048.png" />),
+
The expectation is taken with respect to the distribution of the sequence $( X _ { n } )$ (in the general Markovian situation, one has to consider here the stationary distribution of the chain of transition kernels $P _ { \theta ^ *} ( X _ { n - 1 }, d x )$). In particular, asymptotic normality occurs only if $I / 2 - h _ { \theta } ^ { * }$ has only negative eigenvalues (otherwise the Lyapunov equation above has no solution). Improving this covariance may be done by insertion of a matrix gain (smallest $V$),
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013049.png" /></td> </tr></table>
+
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } \Gamma H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}
  
 
and a simple computation shows that the optimal gain is
 
and a simple computation shows that the optimal gain is
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013050.png" /></td> </tr></table>
+
\begin{equation*} \Gamma ^ { * } = h _ { \theta } ^ { * } \square ^ { - 1 }. \end{equation*}
  
 
This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [[#References|[a7]]]:
 
This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [[#References|[a7]]]:
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013051.png" /></td> </tr></table>
+
\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , Y _ { n } ) \end{equation*}
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013052.png" /></td> </tr></table>
+
\begin{equation*} \overline { \theta } _ { n } = \overline { \theta } _ { n - 1 } + \frac { 1 } { n } ( \theta _ { n - 1 } - \overline { \theta } _ { n - 1 } ). \end{equation*}
  
with a  "larger"  <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013053.png" /> (typically <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013054.png" />). One can prove the asymptotic optimality of this algorithm.
+
with a  "larger"  $\gamma _ { n }$ (typically $\gamma _ { n } = n ^ { - 2 / 3 }$). One can prove the asymptotic optimality of this algorithm.
  
 
==Constant gain.==
 
==Constant gain.==
Constant-gain algorithms are used for tracking a slowly varying optimal parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013055.png" />. The asymptotic behaviour of the algorithm may be studied when the speed of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013056.png" /> and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013057.png" /> tend to zero [[#References|[a1]]]; this so-called method of diffusion approximation consists in proving that, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013058.png" /> is fixed, then the trajectory of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013059.png" />, suitably centred and rescaled in space and time, is well approximated by a diffusion. One shows that in the case of a smooth variation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013060.png" />, the gain has to be tuned approximately proportional to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013061.png" />, where <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013062.png" /> is the speed of variation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013063.png" />. On the other hand, if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013064.png" /> moves along a [[Random walk|random walk]], the gain has to be chosen proportional to the average amplitude of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013065.png" />. Other authors study the stationary limit distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013066.png" /> when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013067.png" /> has a given distribution ([[#References|[a2]]] and references therein). The direct on-line estimation of a good gain <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013068.png" /> without a priori knowledge on the variation of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/a/a120/a120130/a12013069.png" /> remains an important open problem.
+
Constant-gain algorithms are used for tracking a slowly varying optimal parameter $\theta _ { n } ^ { * }$. The asymptotic behaviour of the algorithm may be studied when the speed of $\theta _ { n } ^ { * }$ and $\gamma$ tend to zero [[#References|[a1]]]; this so-called method of diffusion approximation consists in proving that, if $\theta _ { n } ^ { * }$ is fixed, then the trajectory of $\theta _ { n }$, suitably centred and rescaled in space and time, is well approximated by a diffusion. One shows that in the case of a smooth variation of $\theta _ { n } ^ { * }$, the gain has to be tuned approximately proportional to $v ^ { 2 / 3 }$, where $v$ is the speed of variation of $\theta _ { n } ^ { * }$. On the other hand, if $\theta _ { n } ^ { * }$ moves along a [[Random walk|random walk]], the gain has to be chosen proportional to the average amplitude of $| \theta _ { n + 1 } ^ { * } - \theta _ { n } ^ { * } |$. Other authors study the stationary limit distribution of $\theta _ { n }$ when $\theta _ { n } ^ { * }$ has a given distribution ([[#References|[a2]]] and references therein). The direct on-line estimation of a good gain $\gamma$ without a priori knowledge on the variation of $\theta ^ { * }$ remains an important open problem.
  
 
====References====
 
====References====
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  A. Benveniste,  M. Métivier,  P. Priouret,  "Adaptive Algorithms and Stochastic Approximations" , Springer  (1990)</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  B. Delyon,  A. Juditsky,  "Asymptotical study of parameter tracking algorithms"  ''SIAM J. Control and Optimization'' , '''33''' :  1  (1995)  pp. 323–345</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  P. Hall,  C.C. Heyde,  "Martingale limit theory and its applications" , Acad. Press  (1980)</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  H.J. Kushner,  D.S. Clark,  "Stochastic Approximation Methods for Constrained and Unconstrained Systems" , Springer  (1978)</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  L. Ljung,  T. Soderström,  "Theory and practice of recursive identification" , MIT  (1983)</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  B.M. Nevel'son,  R.Z. Khas'minskii,  "Stochastic approximation and recursive estimation" , ''Transl. Math. Monogr.'' , '''47''' , Amer. Math. Soc.  (1976)</TD></TR><TR><TD valign="top">[a7]</TD> <TD valign="top">  B.T. Polyak,  "New stochastic approximation type procedures"  ''Autom. Remote Contr.'' , '''51''' :  7  (1960)  pp. 937–946</TD></TR><TR><TD valign="top">[a8]</TD> <TD valign="top">  G.N. Saridis,  "Stochastic approximation methods for identification and control — a survey"  ''IEEE-AC'' , '''19''' :  l6  (1974)</TD></TR></table>
+
<table><tr><td valign="top">[a1]</td> <td valign="top">  A. Benveniste,  M. Métivier,  P. Priouret,  "Adaptive Algorithms and Stochastic Approximations" , Springer  (1990)</td></tr><tr><td valign="top">[a2]</td> <td valign="top">  B. Delyon,  A. Juditsky,  "Asymptotical study of parameter tracking algorithms"  ''SIAM J. Control and Optimization'' , '''33''' :  1  (1995)  pp. 323–345</td></tr><tr><td valign="top">[a3]</td> <td valign="top">  P. Hall,  C.C. Heyde,  "Martingale limit theory and its applications" , Acad. Press  (1980)</td></tr><tr><td valign="top">[a4]</td> <td valign="top">  H.J. Kushner,  D.S. Clark,  "Stochastic Approximation Methods for Constrained and Unconstrained Systems" , Springer  (1978)</td></tr><tr><td valign="top">[a5]</td> <td valign="top">  L. Ljung,  T. Soderström,  "Theory and practice of recursive identification" , MIT  (1983)</td></tr><tr><td valign="top">[a6]</td> <td valign="top">  B.M. Nevel'son,  R.Z. Khas'minskii,  "Stochastic approximation and recursive estimation" , ''Transl. Math. Monogr.'' , '''47''' , Amer. Math. Soc.  (1976)</td></tr><tr><td valign="top">[a7]</td> <td valign="top">  B.T. Polyak,  "New stochastic approximation type procedures"  ''Autom. Remote Contr.'' , '''51''' :  7  (1960)  pp. 937–946</td></tr><tr><td valign="top">[a8]</td> <td valign="top">  G.N. Saridis,  "Stochastic approximation methods for identification and control — a survey"  ''IEEE-AC'' , '''19''' :  l6  (1974)</td></tr></table>

Latest revision as of 15:31, 1 July 2020

The use of adaptive algorithms is now (1998) very widespread across such varied applications as system identification, adaptive control, transmission systems, adaptive filtering for signal processing, and several aspects of pattern recognition and learning. Many examples can be found in the literature [a8], [a1], [a5]; see also, e.g., Adaptive sampling. The aim of an adaptive algorithm is to estimate an unknown time-invariant (or slowly varying) parameter vector, traditionally denoted by $\theta$.

The study of adaptive algorithms is generally made through the theory of stochastic approximation.

Assume that observations $X_n$ are related to the true parameter $\theta ^ { * }$ via an equation

\begin{equation*} \mathsf{E} _ { \theta } [ H ( \theta , X ) ] = 0 , \quad \text { if } \theta = \theta ^ { * }, \end{equation*}

where $H$ is known, but the distribution of $X$ is unknown and may or may not depend on $\theta$ (as indicated by the subscript on $-$). In many situations, $H$ is the gradient of a functional to be minimized.

The general structure considered for stochastic algorithms is the following:

\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}

where $\gamma _ { n }$ is a non-negative non-increasing sequence, typically $1 / n$ or a constant, $X_n$ is the "somehow stationary" observed sequence and $\theta _ { n }$ is at each $n$ an improved estimate of $\theta ^ { * }$.

Note that when $H ( \theta , X ) = \theta - X$, and $\gamma _ { n } = 1 / n$, $\theta _ { n }$ is simply the arithmetic mean of the $X_n$.

The easiest situation is when $X_n$ is a given sequence of independent identically-distributed random variables. However, this is not always the case: the classical Robbins–Monro algorithms correspond to the situation where a parameter $\theta$ (e.g., a dosage of a chemical product) needs to be tuned so that the effect measured by $X$ is at an average given level $\alpha$. In this case $H ( \theta , X ) = X - \alpha$ and $X_n$ is the result of an experiment made with $\theta _ { n - 1} $. Study of stochastic algorithms is generally restricted to those which have the Markovian structure introduced in [a1], where $X_n$ is a random variable depending only on $( \theta _ { n - 1} , X _ { n - 1} )$; more precisely, the distribution of $X_n$ conditional to the whole past $( X _ { n - 1 } , \theta _ { n - 1 } , \ldots )$ is given by a transition kernel $P _ { \theta _ { n } } ( X _ { n - 1 } , d x )$. In this case, the expectation above has to be taken with respect to the stationary distribution of the Markov chain $X_n$ when $\theta$ is fixed.

Decreasing gain.

In the case of decreasing gain, typical results which are proved are the almost-sure convergence of $\theta _ { n }$ to a solution of the equation $h ( \theta ) = 0$ (where $h ( \theta ) = \mathsf{E} _ { \theta } [ H ( \theta , X ) ]$), and, if $\gamma _ { n } = 1 / n$, the asymptotic normality of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ and the law of the iterated logarithm ([a3], [a6], [a1], [a4]). The asymptotic covariance matrix $V$ of $\sqrt { n } ( \theta _ { n } - \theta ^ { * } )$ is the solution of the Lyapunov equation

\begin{equation*} \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) V + V \left( h _ { \theta } ^ { * } - \frac { I } { 2 } \right) ^ { T } = R ( \theta ^ { * } ), \end{equation*}

where

\begin{equation*} h _ { \theta } ^ { * } = \nabla h ( \theta ^ { * } ), \end{equation*}

\begin{equation*} R ( \theta ^ { * } ) = \sum _ { n = - \infty } ^ { \infty } \operatorname { cov } ( H ( \theta ^ { * } , X _ { n } ) , H ( \theta ^ { * } , X _ { 0 } ) ). \end{equation*}

The expectation is taken with respect to the distribution of the sequence $( X _ { n } )$ (in the general Markovian situation, one has to consider here the stationary distribution of the chain of transition kernels $P _ { \theta ^ *} ( X _ { n - 1 }, d x )$). In particular, asymptotic normality occurs only if $I / 2 - h _ { \theta } ^ { * }$ has only negative eigenvalues (otherwise the Lyapunov equation above has no solution). Improving this covariance may be done by insertion of a matrix gain (smallest $V$),

\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } \Gamma H ( \theta _ { n - 1 } , X _ { n } ), \end{equation*}

and a simple computation shows that the optimal gain is

\begin{equation*} \Gamma ^ { * } = h _ { \theta } ^ { * } \square ^ { - 1 }. \end{equation*}

This matrix may be estimated during the algorithm, but convergence becomes quite difficult to prove. With this choice of the gain, the Cramér–Rao bound is attained. Another way is to use the Polyak–Ruppert averaging method [a7]:

\begin{equation*} \theta _ { n } = \theta _ { n - 1 } - \gamma _ { n } H ( \theta _ { n - 1 } , Y _ { n } ) \end{equation*}

\begin{equation*} \overline { \theta } _ { n } = \overline { \theta } _ { n - 1 } + \frac { 1 } { n } ( \theta _ { n - 1 } - \overline { \theta } _ { n - 1 } ). \end{equation*}

with a "larger" $\gamma _ { n }$ (typically $\gamma _ { n } = n ^ { - 2 / 3 }$). One can prove the asymptotic optimality of this algorithm.

Constant gain.

Constant-gain algorithms are used for tracking a slowly varying optimal parameter $\theta _ { n } ^ { * }$. The asymptotic behaviour of the algorithm may be studied when the speed of $\theta _ { n } ^ { * }$ and $\gamma$ tend to zero [a1]; this so-called method of diffusion approximation consists in proving that, if $\theta _ { n } ^ { * }$ is fixed, then the trajectory of $\theta _ { n }$, suitably centred and rescaled in space and time, is well approximated by a diffusion. One shows that in the case of a smooth variation of $\theta _ { n } ^ { * }$, the gain has to be tuned approximately proportional to $v ^ { 2 / 3 }$, where $v$ is the speed of variation of $\theta _ { n } ^ { * }$. On the other hand, if $\theta _ { n } ^ { * }$ moves along a random walk, the gain has to be chosen proportional to the average amplitude of $| \theta _ { n + 1 } ^ { * } - \theta _ { n } ^ { * } |$. Other authors study the stationary limit distribution of $\theta _ { n }$ when $\theta _ { n } ^ { * }$ has a given distribution ([a2] and references therein). The direct on-line estimation of a good gain $\gamma$ without a priori knowledge on the variation of $\theta ^ { * }$ remains an important open problem.

References

[a1] A. Benveniste, M. Métivier, P. Priouret, "Adaptive Algorithms and Stochastic Approximations" , Springer (1990)
[a2] B. Delyon, A. Juditsky, "Asymptotical study of parameter tracking algorithms" SIAM J. Control and Optimization , 33 : 1 (1995) pp. 323–345
[a3] P. Hall, C.C. Heyde, "Martingale limit theory and its applications" , Acad. Press (1980)
[a4] H.J. Kushner, D.S. Clark, "Stochastic Approximation Methods for Constrained and Unconstrained Systems" , Springer (1978)
[a5] L. Ljung, T. Soderström, "Theory and practice of recursive identification" , MIT (1983)
[a6] B.M. Nevel'son, R.Z. Khas'minskii, "Stochastic approximation and recursive estimation" , Transl. Math. Monogr. , 47 , Amer. Math. Soc. (1976)
[a7] B.T. Polyak, "New stochastic approximation type procedures" Autom. Remote Contr. , 51 : 7 (1960) pp. 937–946
[a8] G.N. Saridis, "Stochastic approximation methods for identification and control — a survey" IEEE-AC , 19 : l6 (1974)
How to Cite This Entry:
Adaptive algorithm. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Adaptive_algorithm&oldid=18379
This article was adapted from an original article by B. Delyon (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article