Namespaces
Variants
Actions

Difference between revisions of "Dirichlet process"

From Encyclopedia of Mathematics
Jump to: navigation, search
(Importing text file)
 
m (tex encoded by computer)
 
(One intermediate revision by one other user not shown)
Line 1: Line 1:
 +
<!--
 +
d1102101.png
 +
$#A+1 = 104 n = 0
 +
$#C+1 = 104 : ~/encyclopedia/old_files/data/D110/D.1100210 Dirichlet process
 +
Automatically converted into TeX, above some diagnostics.
 +
Please remove this comment and the {{TEX|auto}} line below,
 +
if TeX found to be correct.
 +
-->
 +
 +
{{TEX|auto}}
 +
{{TEX|done}}
 +
 
The Dirichlet process provides one means of placing a [[Probability distribution|probability distribution]] on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also [[Bayesian approach|Bayesian approach]]). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [[#References|[a4]]], [[#References|[a5]]].
 
The Dirichlet process provides one means of placing a [[Probability distribution|probability distribution]] on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also [[Bayesian approach|Bayesian approach]]). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [[#References|[a4]]], [[#References|[a5]]].
  
The Dirichlet process is indexed by its parameter, a non-null, finite measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102101.png" />. Formally, consider a space <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102102.png" /> with a collection of Borel sets <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102103.png" /> on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102104.png" />. The random probability distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102105.png" /> has a Dirichlet process prior distribution with parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102106.png" />, denoted by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102107.png" />, if for every measurable partition <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102108.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d1102109.png" /> the random vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021010.png" /> has the Dirichlet distribution with parameter vector <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021011.png" />.
+
The Dirichlet process is indexed by its parameter, a non-null, finite measure $  \alpha $.  
 +
Formally, consider a space $  {\mathcal X} $
 +
with a collection of Borel sets $  {\mathcal B} $
 +
on $  {\mathcal X} $.  
 +
The random probability distribution $  P $
 +
has a Dirichlet process prior distribution with parameter $  \alpha $,  
 +
denoted by $  {\mathcal D} _  \alpha  $,  
 +
if for every measurable partition $  \{ A _ {1} \dots A _ {m} \} $
 +
of $  {\mathcal X} $
 +
the random vector $  ( P ( A _ {1} ) \dots P ( A _ {m} ) ) $
 +
has the Dirichlet distribution with parameter vector $  ( \alpha ( A _ {1} ) \dots \alpha ( A _ {m} ) ) $.
  
When a prior distribution is put on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021012.png" />, then for every measurable subset <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021013.png" /> of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021014.png" />, the quantity <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021015.png" /> is a random variable. Then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021016.png" /> is a probability measure on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021017.png" />. From the definition one sees that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021018.png" />, then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021019.png" />.
+
When a [[prior distribution]] is put on $  {\mathcal X} $,  
 +
then for every measurable subset $  A $
 +
of $  {\mathcal X} $,  
 +
the quantity $  P ( A ) $
 +
is a random variable. Then $  \alpha _ {0} = {\alpha / {\alpha ( {\mathcal X} ) } } $
 +
is a probability measure on $  {\mathcal X} $.  
 +
From the definition one sees that if $  P \sim {\mathcal D} _  \alpha  $,  
 +
then $  {\mathsf E} P ( A ) = \alpha _ {0} ( A ) $.
  
An alternative representation of the Dirichlet process is given in [[#References|[a6]]]: Let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021020.png" /> be independent and identically distributed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021021.png" /> random variables, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021022.png" /> be a sequence of independent and identically distributed random variables with distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021023.png" />, and independent of the random variables <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021024.png" />. Define <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021025.png" />, and <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021026.png" />. The random distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021027.png" /> has the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021028.png" />. Here, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021029.png" /> represents the point mass at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021030.png" />. This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021031.png" />. For example, as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021032.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021033.png" /> converges to the point mass at <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021034.png" /> (in the [[Weak topology|weak topology]] induced by <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021035.png" />); and as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021036.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021037.png" /> converges to the random distribution which is degenerate at a point <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021038.png" />, whose location has distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021039.png" />.
+
An alternative representation of the Dirichlet process is given in [[#References|[a6]]]: Let $  B _ {1} , B _ {2} , \dots $
 +
be independent and identically distributed $  { \mathop{\rm Beta} } ( 1, \alpha ( {\mathcal X} ) ) $
 +
random variables, and let $  V _ {1} , V _ {2} , \dots $
 +
be a sequence of independent and identically distributed random variables with distribution $  \alpha _ {0} ( A ) $,  
 +
and independent of the random variables $  B $.  
 +
Define $  B _ {0} = 0 $,  
 +
and $  P _ {i} = B _ {i} \prod _ {j = 0 }  ^ {i - 1 } ( 1 - B _ {j} ) $.  
 +
The random distribution $  \sum _ {i = 1 }  ^  \infty  P _ {i} \delta _ {V _ {i}  } $
 +
has the distribution $  {\mathcal D} _  \alpha  $.  
 +
Here, $  \delta _ {a} $
 +
represents the point mass at $  a $.  
 +
This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure $  \alpha $.  
 +
For example, as $  \alpha ( {\mathcal X} ) \rightarrow \infty $,  
 +
$  {\mathcal D} _  \alpha  $
 +
converges to the point mass at $  \alpha _ {0} $(
 +
in the [[Weak topology|weak topology]] induced by $  {\mathcal B} $);  
 +
and as $  \alpha ( {\mathcal X} ) \rightarrow 0 $,  
 +
$  {\mathcal D} _  \alpha  $
 +
converges to the random distribution which is degenerate at a point $  V $,  
 +
whose location has distribution $  \alpha _ {0} $.
  
The Dirichlet process is conjugate, in that if <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021040.png" />, and data points <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021041.png" /> independent and identically drawn from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021042.png" /> are observed, then the conditional distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021043.png" /> given <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021044.png" /> is <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021045.png" />. This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.
+
The Dirichlet process is conjugate, in that if $  P \sim {\mathcal D} _  \alpha  $,  
 +
and data points $  X _ {1} \dots X _ {n} $
 +
independent and identically drawn from $  P $
 +
are observed, then the conditional distribution of $  P $
 +
given $  X _ {1} \dots X _ {n} $
 +
is $  {\mathcal D} _ {\alpha + \sum _ {i = 1 }  ^ {n} \delta _ {X _ {i}  } } $.  
 +
This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.
  
An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021046.png" />. Suppose that for every <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021047.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021048.png" /> is a positive constant, and let <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021049.png" />. If <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021050.png" /> is a probability distribution on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021051.png" />, and if, first, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021052.png" /> is chosen from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021053.png" />, and then <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021054.png" /> is chosen from <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021055.png" />, one says that the prior on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021056.png" /> is a mixture of Dirichlet processes (with parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021057.png" />). A reference for this is [[#References|[a1]]]. Often, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021058.png" />, i.e., the constants <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021059.png" /> do not depend on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021060.png" />. In this case, large values of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021061.png" /> indicate that the prior on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021062.png" /> is  "concentrated around the parametric family aq,0qQ" . More precisely, as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021063.png" />, the distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021064.png" /> converges to <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021065.png" />, the standard Bayesian model for the parametric family <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021066.png" /> in which <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021067.png" /> has prior <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021068.png" />.
+
An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions $  \{ {\alpha _ {\theta,0 }  } : {\theta \in \Theta } \} $.  
 +
Suppose that for every $  \theta \in \Theta $,  
 +
$  \alpha _  \theta  ( {\mathcal X} ) $
 +
is a positive constant, and let $  \alpha _  \theta  = \alpha _  \theta  ( {\mathcal X} ) \cdot \alpha _ {\theta,0 }  $.  
 +
If $  \nu $
 +
is a probability distribution on $  \Theta $,  
 +
and if, first, $  \theta $
 +
is chosen from $  \nu $,  
 +
and then $  P $
 +
is chosen from $  {\mathcal D} _ {\alpha _  \theta  } $,  
 +
one says that the prior on $  P $
 +
is a mixture of Dirichlet processes (with parameter $  ( \{ \alpha _  \theta  \} _ {\theta \in \Theta }  , \nu ) $).  
 +
A reference for this is [[#References|[a1]]]. Often, $  \alpha _  \theta  ( {\mathcal X} ) \equiv M $,  
 +
i.e., the constants $  \alpha _  \theta  ( {\mathcal X} ) $
 +
do not depend on $  \theta $.  
 +
In this case, large values of $  M $
 +
indicate that the prior on $  P $
 +
is  "concentrated around the parametric family aq,0qQ" . More precisely, as $  M \rightarrow \infty $,  
 +
the distribution of $  P $
 +
converges to $  \int {\alpha _ {\theta,0 }  }  {\nu ( d \theta ) } $,  
 +
the standard Bayesian model for the parametric family $  \{ {\alpha _ {\theta,0 }  } : {\theta \in \Theta } \} $
 +
in which $  \theta $
 +
has prior $  \nu $.
  
The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021069.png" /> patients in hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021070.png" />, <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021071.png" />. One might model the number of failures <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021072.png" /> in hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021073.png" /> as a [[Binomial distribution|binomial distribution]], with success probability depending on the hospital. And one might wish to view the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021074.png" /> binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as
+
The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has $  n _ {i} $
 +
patients in hospital $  i $,  
 +
$  i = 1 \dots I $.  
 +
One might model the number of failures $  X _ {i} $
 +
in hospital $  i $
 +
as a [[Binomial distribution|binomial distribution]], with success probability depending on the hospital. And one might wish to view the $  I $
 +
binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021075.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a1)</td></tr></table>
+
$$ \tag{a1 }
 +
\textrm{ given  }  \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021076.png" /></td> </tr></table>
+
$$
 +
\theta _ {i} \sim { \mathop{\rm Beta} } ( a, b )  \textrm{ iid  } ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021077.png" /></td> </tr></table>
+
$$
 +
( a, b ) \sim G ( \cdot, \cdot ) .
 +
$$
  
Here, the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021078.png" /> are unobserved, or latent, variables. If the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021079.png" /> was degenerate, then the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021080.png" /> would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021081.png" /> is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021082.png" />.
+
Here, the $  \theta _ {i} $
 +
are unobserved, or latent, variables. If the distribution $  G $
 +
was degenerate, then the $  \theta _ {i} $
 +
would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when $  G $
 +
is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital $  i $.
  
Consider now the problem of prediction of the number of successes for a new hospital, indexed <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021083.png" />. A disadvantage of the model (a1) is that if the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021084.png" /> are independent and identically drawn from a distribution which is not a Beta, then even as <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021085.png" />, the predictive distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021086.png" /> based on the (incorrect) model (a1) need not converge to the actual predictive distribution of <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021087.png" />. An alternative model, using a mixture of Dirichlet processes prior, would be written as
+
Consider now the problem of prediction of the number of successes for a new hospital, indexed $  I + 1 $.  
 +
A disadvantage of the model (a1) is that if the $  \theta _ {i} $
 +
are independent and identically drawn from a distribution which is not a Beta, then even as $  I \rightarrow \infty $,  
 +
the predictive distribution of $  X _ {I + 1 }  $
 +
based on the (incorrect) model (a1) need not converge to the actual predictive distribution of $  X _ {I + 1 }  $.  
 +
An alternative model, using a mixture of Dirichlet processes prior, would be written as
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021088.png" /></td> <td valign="top" style="width:5%;text-align:right;">(a2)</td></tr></table>
+
$$ \tag{a2 }
 +
\textrm{ given  }  \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021089.png" /></td> </tr></table>
+
$$
 +
\theta _ {i} \sim P  \textrm{ iid  } ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021090.png" /></td> </tr></table>
+
$$
 +
P \sim {\mathcal D} _ {M \cdot { \mathop{\rm Beta}  } ( a,b ) } ,
 +
$$
  
<table class="eq" style="width:100%;"> <tr><td valign="top" style="width:94%;text-align:center;"><img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021091.png" /></td> </tr></table>
+
$$
 +
( a, b ) \sim G ( \cdot, \cdot ) .
 +
$$
  
The model (a2) does not have the defect suffered by (a1), because the support of the distribution on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021092.png" /> is the set of all distributions concentrated in the interval <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021093.png" />.
+
The model (a2) does not have the defect suffered by (a1), because the support of the distribution on $  P $
 +
is the set of all distributions concentrated in the interval $  [0,1] $.
  
 
It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [[#References|[a3]]] and C.A. Bush and S.N. MacEachern [[#References|[a2]]].
 
It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [[#References|[a3]]] and C.A. Bush and S.N. MacEachern [[#References|[a2]]].
  
The parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021094.png" /> plays an interesting role. When <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021095.png" /> is small, then, with high probability, the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021096.png" /> are all equal, so that, in effect, one is working with the model in which the <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021097.png" /> are independent binomial samples with the same success probability. On the other hand, when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021098.png" /> is large, the model (a2) is very close to (a1).
+
The parameter $  M $
 +
plays an interesting role. When $  M $
 +
is small, then, with high probability, the $  \theta _ {i} $
 +
are all equal, so that, in effect, one is working with the model in which the $  X _ {i} $
 +
are independent binomial samples with the same success probability. On the other hand, when $  M $
 +
is large, the model (a2) is very close to (a1).
  
It is interesting to note that when <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d11021099.png" /> is large and the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210100.png" /> is degenerate, then the measure on <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210101.png" /> is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210102.png" /> is degenerate, the parameter <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210103.png" /> determines the extent to which data from other hospitals is used when making an inference about hospital <img align="absmiddle" border="0" src="https://www.encyclopediaofmath.org/legacyimages/d/d110/d110210/d110210104.png" />, and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.
+
It is interesting to note that when $  M $
 +
is large and the distribution $  G $
 +
is degenerate, then the measure on $  P $
 +
is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution $  G $
 +
is degenerate, the parameter $  M $
 +
determines the extent to which data from other hospitals is used when making an inference about hospital $  I $,  
 +
and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.
  
 
====References====
 
====References====
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  C. Antoniak,  "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 1152–1174</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.A. Bush,  S.N. MacEachern,  "A semi-parametric Bayesian model for randomized block designs"  ''Biometrika'' , '''83'''  (1996)  pp. 275–285</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  M. Escobar,  M. West,  "Bayesian density estimation and inference using mixtures"  ''J. Amer. Statist. Assoc.'' , '''90'''  (1995)  pp. 577–588</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  T.S. Ferguson,  "A Bayesian analysis of some nonparametric problems"  ''Ann. Statist.'' , '''1'''  (1973)  pp. 209–230</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  T.S. Ferguson,  "Prior distributions on spaces of probability measures"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 615–629</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  J. Sethuraman,  "A constructive definition of Dirichlet priors"  ''Statistica Sinica'' , '''4'''  (1994)  pp. 639–650</TD></TR></table>
 
<table><TR><TD valign="top">[a1]</TD> <TD valign="top">  C. Antoniak,  "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 1152–1174</TD></TR><TR><TD valign="top">[a2]</TD> <TD valign="top">  C.A. Bush,  S.N. MacEachern,  "A semi-parametric Bayesian model for randomized block designs"  ''Biometrika'' , '''83'''  (1996)  pp. 275–285</TD></TR><TR><TD valign="top">[a3]</TD> <TD valign="top">  M. Escobar,  M. West,  "Bayesian density estimation and inference using mixtures"  ''J. Amer. Statist. Assoc.'' , '''90'''  (1995)  pp. 577–588</TD></TR><TR><TD valign="top">[a4]</TD> <TD valign="top">  T.S. Ferguson,  "A Bayesian analysis of some nonparametric problems"  ''Ann. Statist.'' , '''1'''  (1973)  pp. 209–230</TD></TR><TR><TD valign="top">[a5]</TD> <TD valign="top">  T.S. Ferguson,  "Prior distributions on spaces of probability measures"  ''Ann. Statist.'' , '''2'''  (1974)  pp. 615–629</TD></TR><TR><TD valign="top">[a6]</TD> <TD valign="top">  J. Sethuraman,  "A constructive definition of Dirichlet priors"  ''Statistica Sinica'' , '''4'''  (1994)  pp. 639–650</TD></TR></table>

Latest revision as of 19:35, 5 June 2020


The Dirichlet process provides one means of placing a probability distribution on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also Bayesian approach). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [a4], [a5].

The Dirichlet process is indexed by its parameter, a non-null, finite measure $ \alpha $. Formally, consider a space $ {\mathcal X} $ with a collection of Borel sets $ {\mathcal B} $ on $ {\mathcal X} $. The random probability distribution $ P $ has a Dirichlet process prior distribution with parameter $ \alpha $, denoted by $ {\mathcal D} _ \alpha $, if for every measurable partition $ \{ A _ {1} \dots A _ {m} \} $ of $ {\mathcal X} $ the random vector $ ( P ( A _ {1} ) \dots P ( A _ {m} ) ) $ has the Dirichlet distribution with parameter vector $ ( \alpha ( A _ {1} ) \dots \alpha ( A _ {m} ) ) $.

When a prior distribution is put on $ {\mathcal X} $, then for every measurable subset $ A $ of $ {\mathcal X} $, the quantity $ P ( A ) $ is a random variable. Then $ \alpha _ {0} = {\alpha / {\alpha ( {\mathcal X} ) } } $ is a probability measure on $ {\mathcal X} $. From the definition one sees that if $ P \sim {\mathcal D} _ \alpha $, then $ {\mathsf E} P ( A ) = \alpha _ {0} ( A ) $.

An alternative representation of the Dirichlet process is given in [a6]: Let $ B _ {1} , B _ {2} , \dots $ be independent and identically distributed $ { \mathop{\rm Beta} } ( 1, \alpha ( {\mathcal X} ) ) $ random variables, and let $ V _ {1} , V _ {2} , \dots $ be a sequence of independent and identically distributed random variables with distribution $ \alpha _ {0} ( A ) $, and independent of the random variables $ B $. Define $ B _ {0} = 0 $, and $ P _ {i} = B _ {i} \prod _ {j = 0 } ^ {i - 1 } ( 1 - B _ {j} ) $. The random distribution $ \sum _ {i = 1 } ^ \infty P _ {i} \delta _ {V _ {i} } $ has the distribution $ {\mathcal D} _ \alpha $. Here, $ \delta _ {a} $ represents the point mass at $ a $. This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure $ \alpha $. For example, as $ \alpha ( {\mathcal X} ) \rightarrow \infty $, $ {\mathcal D} _ \alpha $ converges to the point mass at $ \alpha _ {0} $( in the weak topology induced by $ {\mathcal B} $); and as $ \alpha ( {\mathcal X} ) \rightarrow 0 $, $ {\mathcal D} _ \alpha $ converges to the random distribution which is degenerate at a point $ V $, whose location has distribution $ \alpha _ {0} $.

The Dirichlet process is conjugate, in that if $ P \sim {\mathcal D} _ \alpha $, and data points $ X _ {1} \dots X _ {n} $ independent and identically drawn from $ P $ are observed, then the conditional distribution of $ P $ given $ X _ {1} \dots X _ {n} $ is $ {\mathcal D} _ {\alpha + \sum _ {i = 1 } ^ {n} \delta _ {X _ {i} } } $. This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.

An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $. Suppose that for every $ \theta \in \Theta $, $ \alpha _ \theta ( {\mathcal X} ) $ is a positive constant, and let $ \alpha _ \theta = \alpha _ \theta ( {\mathcal X} ) \cdot \alpha _ {\theta,0 } $. If $ \nu $ is a probability distribution on $ \Theta $, and if, first, $ \theta $ is chosen from $ \nu $, and then $ P $ is chosen from $ {\mathcal D} _ {\alpha _ \theta } $, one says that the prior on $ P $ is a mixture of Dirichlet processes (with parameter $ ( \{ \alpha _ \theta \} _ {\theta \in \Theta } , \nu ) $). A reference for this is [a1]. Often, $ \alpha _ \theta ( {\mathcal X} ) \equiv M $, i.e., the constants $ \alpha _ \theta ( {\mathcal X} ) $ do not depend on $ \theta $. In this case, large values of $ M $ indicate that the prior on $ P $ is "concentrated around the parametric family aq,0qQ" . More precisely, as $ M \rightarrow \infty $, the distribution of $ P $ converges to $ \int {\alpha _ {\theta,0 } } {\nu ( d \theta ) } $, the standard Bayesian model for the parametric family $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $ in which $ \theta $ has prior $ \nu $.

The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has $ n _ {i} $ patients in hospital $ i $, $ i = 1 \dots I $. One might model the number of failures $ X _ {i} $ in hospital $ i $ as a binomial distribution, with success probability depending on the hospital. And one might wish to view the $ I $ binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as

$$ \tag{a1 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim { \mathop{\rm Beta} } ( a, b ) \textrm{ iid } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

Here, the $ \theta _ {i} $ are unobserved, or latent, variables. If the distribution $ G $ was degenerate, then the $ \theta _ {i} $ would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when $ G $ is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital $ i $.

Consider now the problem of prediction of the number of successes for a new hospital, indexed $ I + 1 $. A disadvantage of the model (a1) is that if the $ \theta _ {i} $ are independent and identically drawn from a distribution which is not a Beta, then even as $ I \rightarrow \infty $, the predictive distribution of $ X _ {I + 1 } $ based on the (incorrect) model (a1) need not converge to the actual predictive distribution of $ X _ {I + 1 } $. An alternative model, using a mixture of Dirichlet processes prior, would be written as

$$ \tag{a2 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim P \textrm{ iid } , $$

$$ P \sim {\mathcal D} _ {M \cdot { \mathop{\rm Beta} } ( a,b ) } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

The model (a2) does not have the defect suffered by (a1), because the support of the distribution on $ P $ is the set of all distributions concentrated in the interval $ [0,1] $.

It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [a3] and C.A. Bush and S.N. MacEachern [a2].

The parameter $ M $ plays an interesting role. When $ M $ is small, then, with high probability, the $ \theta _ {i} $ are all equal, so that, in effect, one is working with the model in which the $ X _ {i} $ are independent binomial samples with the same success probability. On the other hand, when $ M $ is large, the model (a2) is very close to (a1).

It is interesting to note that when $ M $ is large and the distribution $ G $ is degenerate, then the measure on $ P $ is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution $ G $ is degenerate, the parameter $ M $ determines the extent to which data from other hospitals is used when making an inference about hospital $ I $, and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.

References

[a1] C. Antoniak, "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems" Ann. Statist. , 2 (1974) pp. 1152–1174
[a2] C.A. Bush, S.N. MacEachern, "A semi-parametric Bayesian model for randomized block designs" Biometrika , 83 (1996) pp. 275–285
[a3] M. Escobar, M. West, "Bayesian density estimation and inference using mixtures" J. Amer. Statist. Assoc. , 90 (1995) pp. 577–588
[a4] T.S. Ferguson, "A Bayesian analysis of some nonparametric problems" Ann. Statist. , 1 (1973) pp. 209–230
[a5] T.S. Ferguson, "Prior distributions on spaces of probability measures" Ann. Statist. , 2 (1974) pp. 615–629
[a6] J. Sethuraman, "A constructive definition of Dirichlet priors" Statistica Sinica , 4 (1994) pp. 639–650
How to Cite This Entry:
Dirichlet process. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Dirichlet_process&oldid=13886
This article was adapted from an original article by H. DossS.N. MacEachern (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article