Namespaces
Variants
Actions

Dirichlet process

From Encyclopedia of Mathematics
Jump to: navigation, search


The Dirichlet process provides one means of placing a probability distribution on the space of distribution functions, as is done in Bayesian statistical analysis (cf. also Bayesian approach). The support of the Dirichlet process is large: For each distribution function there is a set of distributions nearby that receives positive probability. This contrasts with a typical probability distribution on the space of distribution functions where, for example, one might place a probability distribution on the mean and variance of a normal distribution. The support in this example would be contained in the collection of normal distributions. The large support of the Dirichlet process accounts for its use in non-parametric Bayesian analysis. General references are [a4], [a5].

The Dirichlet process is indexed by its parameter, a non-null, finite measure $ \alpha $. Formally, consider a space $ {\mathcal X} $ with a collection of Borel sets $ {\mathcal B} $ on $ {\mathcal X} $. The random probability distribution $ P $ has a Dirichlet process prior distribution with parameter $ \alpha $, denoted by $ {\mathcal D} _ \alpha $, if for every measurable partition $ \{ A _ {1} \dots A _ {m} \} $ of $ {\mathcal X} $ the random vector $ ( P ( A _ {1} ) \dots P ( A _ {m} ) ) $ has the Dirichlet distribution with parameter vector $ ( \alpha ( A _ {1} ) \dots \alpha ( A _ {m} ) ) $.

When a prior distribution is put on $ {\mathcal X} $, then for every measurable subset $ A $ of $ {\mathcal X} $, the quantity $ P ( A ) $ is a random variable. Then $ \alpha _ {0} = {\alpha / {\alpha ( {\mathcal X} ) } } $ is a probability measure on $ {\mathcal X} $. From the definition one sees that if $ P \sim {\mathcal D} _ \alpha $, then $ {\mathsf E} P ( A ) = \alpha _ {0} ( A ) $.

An alternative representation of the Dirichlet process is given in [a6]: Let $ B _ {1} , B _ {2} , \dots $ be independent and identically distributed $ { \mathop{\rm Beta} } ( 1, \alpha ( {\mathcal X} ) ) $ random variables, and let $ V _ {1} , V _ {2} , \dots $ be a sequence of independent and identically distributed random variables with distribution $ \alpha _ {0} ( A ) $, and independent of the random variables $ B $. Define $ B _ {0} = 0 $, and $ P _ {i} = B _ {i} \prod _ {j = 0 } ^ {i - 1 } ( 1 - B _ {j} ) $. The random distribution $ \sum _ {i = 1 } ^ \infty P _ {i} \delta _ {V _ {i} } $ has the distribution $ {\mathcal D} _ \alpha $. Here, $ \delta _ {a} $ represents the point mass at $ a $. This representation makes clear the fact that the Dirichlet process assigns probability one to the set of discrete distributions, and emphasizes the role of the mass of the measure $ \alpha $. For example, as $ \alpha ( {\mathcal X} ) \rightarrow \infty $, $ {\mathcal D} _ \alpha $ converges to the point mass at $ \alpha _ {0} $( in the weak topology induced by $ {\mathcal B} $); and as $ \alpha ( {\mathcal X} ) \rightarrow 0 $, $ {\mathcal D} _ \alpha $ converges to the random distribution which is degenerate at a point $ V $, whose location has distribution $ \alpha _ {0} $.

The Dirichlet process is conjugate, in that if $ P \sim {\mathcal D} _ \alpha $, and data points $ X _ {1} \dots X _ {n} $ independent and identically drawn from $ P $ are observed, then the conditional distribution of $ P $ given $ X _ {1} \dots X _ {n} $ is $ {\mathcal D} _ {\alpha + \sum _ {i = 1 } ^ {n} \delta _ {X _ {i} } } $. This conjugation property is an extension of the conjugacy of the Dirichlet distribution for multinomial data. It ensures the existence of analytical results with a simple form for many problems. The combination of simplicity and usefulness has given the Dirichlet process its reputation as the standard non-parametric model for a probability distribution on the space of distribution functions.

An important extension of the class of Dirichlet processes is the class of mixtures of Dirichlet processes. A mixture of Dirichlet processes is a Dirichlet process in which the parameter measure is itself random. In applications, the parameter measure ranges over a finite-dimensional parametric family. Formally, one considers a parametric family of probability distributions $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $. Suppose that for every $ \theta \in \Theta $, $ \alpha _ \theta ( {\mathcal X} ) $ is a positive constant, and let $ \alpha _ \theta = \alpha _ \theta ( {\mathcal X} ) \cdot \alpha _ {\theta,0 } $. If $ \nu $ is a probability distribution on $ \Theta $, and if, first, $ \theta $ is chosen from $ \nu $, and then $ P $ is chosen from $ {\mathcal D} _ {\alpha _ \theta } $, one says that the prior on $ P $ is a mixture of Dirichlet processes (with parameter $ ( \{ \alpha _ \theta \} _ {\theta \in \Theta } , \nu ) $). A reference for this is [a1]. Often, $ \alpha _ \theta ( {\mathcal X} ) \equiv M $, i.e., the constants $ \alpha _ \theta ( {\mathcal X} ) $ do not depend on $ \theta $. In this case, large values of $ M $ indicate that the prior on $ P $ is "concentrated around the parametric family aq,0qQ" . More precisely, as $ M \rightarrow \infty $, the distribution of $ P $ converges to $ \int {\alpha _ {\theta,0 } } {\nu ( d \theta ) } $, the standard Bayesian model for the parametric family $ \{ {\alpha _ {\theta,0 } } : {\theta \in \Theta } \} $ in which $ \theta $ has prior $ \nu $.

The Dirichlet process has been used in many applications. A particularly interesting one is the Bayesian hierarchical model, which is the Bayesian version of the random effects model. A typical example is as follows. Suppose one is studying the success of a certain type of operation for patients from different hospitals. Suppose one has $ n _ {i} $ patients in hospital $ i $, $ i = 1 \dots I $. One might model the number of failures $ X _ {i} $ in hospital $ i $ as a binomial distribution, with success probability depending on the hospital. And one might wish to view the $ I $ binomial parameters as being independent and identically distributed drawn from a common distribution. The typical hierarchical model then is written as

$$ \tag{a1 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim { \mathop{\rm Beta} } ( a, b ) \textrm{ iid } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

Here, the $ \theta _ {i} $ are unobserved, or latent, variables. If the distribution $ G $ was degenerate, then the $ \theta _ {i} $ would be independent, so that data from one hospital would not give any information on the success rate from any other hospital. On the other hand, when $ G $ is not degenerate, then data coming from the other hospitals provide some information on the success rate of hospital $ i $.

Consider now the problem of prediction of the number of successes for a new hospital, indexed $ I + 1 $. A disadvantage of the model (a1) is that if the $ \theta _ {i} $ are independent and identically drawn from a distribution which is not a Beta, then even as $ I \rightarrow \infty $, the predictive distribution of $ X _ {I + 1 } $ based on the (incorrect) model (a1) need not converge to the actual predictive distribution of $ X _ {I + 1 } $. An alternative model, using a mixture of Dirichlet processes prior, would be written as

$$ \tag{a2 } \textrm{ given } \theta _ {i} , X _ {i} \sim { \mathop{\rm Bin} } ( n _ {i} , \theta _ {i} ) , $$

$$ \theta _ {i} \sim P \textrm{ iid } , $$

$$ P \sim {\mathcal D} _ {M \cdot { \mathop{\rm Beta} } ( a,b ) } , $$

$$ ( a, b ) \sim G ( \cdot, \cdot ) . $$

The model (a2) does not have the defect suffered by (a1), because the support of the distribution on $ P $ is the set of all distributions concentrated in the interval $ [0,1] $.

It is not possible to obtain closed-form expressions for the posterior distributions in (a2). Computational schemes to obtain these have been developed by M. Escobar and M. West [a3] and C.A. Bush and S.N. MacEachern [a2].

The parameter $ M $ plays an interesting role. When $ M $ is small, then, with high probability, the $ \theta _ {i} $ are all equal, so that, in effect, one is working with the model in which the $ X _ {i} $ are independent binomial samples with the same success probability. On the other hand, when $ M $ is large, the model (a2) is very close to (a1).

It is interesting to note that when $ M $ is large and the distribution $ G $ is degenerate, then the measure on $ P $ is essentially degenerate, so that one is treating the data from the hospitals as independent. Thus, when the distribution $ G $ is degenerate, the parameter $ M $ determines the extent to which data from other hospitals is used when making an inference about hospital $ I $, and in that sense plays the role of tuning parameter in the bias-variance tradeoff of frequentist analysis.

References

[a1] C. Antoniak, "Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems" Ann. Statist. , 2 (1974) pp. 1152–1174
[a2] C.A. Bush, S.N. MacEachern, "A semi-parametric Bayesian model for randomized block designs" Biometrika , 83 (1996) pp. 275–285
[a3] M. Escobar, M. West, "Bayesian density estimation and inference using mixtures" J. Amer. Statist. Assoc. , 90 (1995) pp. 577–588
[a4] T.S. Ferguson, "A Bayesian analysis of some nonparametric problems" Ann. Statist. , 1 (1973) pp. 209–230
[a5] T.S. Ferguson, "Prior distributions on spaces of probability measures" Ann. Statist. , 2 (1974) pp. 615–629
[a6] J. Sethuraman, "A constructive definition of Dirichlet priors" Statistica Sinica , 4 (1994) pp. 639–650
How to Cite This Entry:
Dirichlet process. Encyclopedia of Mathematics. URL: http://encyclopediaofmath.org/index.php?title=Dirichlet_process&oldid=46722
This article was adapted from an original article by H. DossS.N. MacEachern (originator), which appeared in Encyclopedia of Mathematics - ISBN 1402006098. See original article