Mixture models

This article Mixture Models was adapted from an original article by Wilfried Edwin Seidel, which appeared in StatProb: The Encyclopedia Sponsored by Statistics and Probability Societies. The original article ([http://statprob.com/encyclopedia/MixtureModels.html StatProb Source], Local Files: pdf | tex) is copyrighted by the author(s), the article has been donated to Encyclopedia of Mathematics, and its further issues are under Creative Commons Attribution Share-Alike License'. All pages from StatProb are contained in the Category StatProb.

2010 Mathematics Subject Classification: Primary: 62E10 Secondary: 62F1062G0562H30 [MSN][ZBL]

Wilfried Seidel

Helmut-Schmidt-Universität, D-22039

Hamburg, Germany
Mixture Models

Introduction

Mixture distributions are convex combinations of component distributions. In statistics, these are standard tools for modelling heterogeneity in the sense that different elements of a sample may belong to different components. However, they may also be used simply as flexible instruments for achieving a good fit to data when standard distributions fail. As good software for fitting mixtures is available, these play an increasingly important role in nearly every field of statistics.

It is convenient to explain finite mixtures (i.e. finite convex combinations) as theoretical models for cluster analysis, but of course the range of applicability is not at all restricted to the clustering context. Suppose that a feature vector $X$ is observed in a heterogeneous population, which consists of $k$ homogeneous subpopulations, the components. It is assumed that for $i = 1, \dots, k$, $X$ is distributed in the i-th component according to a (discrete or continuous) density $f(x, \, \theta_{i})$ (the component density), and all component densities belong to a common parametric family $\{f(x, \theta), \; \theta \in \Theta\}$, the component model. The relative proportion of the i-th component in the whole population is $p_{i}$, $p_{1} + \dots + p_{k} = 1$. Now suppose that an item is drawn randomly from the population. Then it belongs to the i-th component with probability $p_{i}$, and the conditional probability that $X$ falls in some set $A$ is $\mbox{Pr}\,(X \in A \;| \; \theta_{i})$, calculated from the density $f(x, \, \theta_{i})$. Consequently, the marginal probability is $$\mbox{Pr}\,(X \in A \; | \; P) = p_{1} \, \mbox{Pr}\,(X \in A \; | \; \theta_{1}) + \dots + p_{k} \, \mbox{Pr}\,(X \in A \; | \; \theta_{k})$$ with density $$\label{mixture density} f(x, \, P) = p_{1} f(x, \, \theta_{1}) + \dots + p_{k} f(x, \, \theta_{k}), \tag{1}$$

a simple finite mixture with parameter $P = ((p_{1}, \dots, p_{k}), (\theta_{1}, \dots , \theta_{k}))$. The components $p_{i}$ of $P$ are called mixing weights, the $\theta_{i}$ component paramaters. For fixed $k$, let ${\cal P}_{k}$ be the set of all vectors $P$ of this type, with $\theta_{i} \in \Theta$ and nonnegative mixing weights summing up to one. Then ${\cal P}_{k}$ parameterizes all mixtures with not more than $k$ components. If all mixing weights are positive and component densities are different, then $k$ is the exact number of components. The set of all simple finite mixtures is parameterized by ${\cal P}_{\mbox{fin}}$, the union of all ${\cal P}_{k}$.

This model can be extendes in various ways. For example, all component densities may contain additional common parameters (variance parameters, say), they may depend on covariables (mixtures of regression models), and also the mixing weights may depend on covariables. Mixtures on time series models are also considered. Here I shall concentrate on simple mixtures, as all relevant concepts can be explained very easily in this setting. These need not be finite convex combinations; there is an alternative and more general definition of simple mixtures: Observe that the parameter $P$ can be considered as a discrete probability distribution on $\Theta$ which assigns probability mass $p_{i}$ to the parameter $\theta_{i}$. Then equation (1) is an integral with respect to this distribution, and if $\xi$ is an arbitrary probability distribution $\Theta$, a mixture can be defined by $$\label{general mixture} f(x, \, \xi) = \int_{\Theta} f(x, \, \theta) \, d \xi(\theta) \;. \tag{2}$$

It can be considered as the distribution of a two-stage experiment: First, choose a parameter $\theta$ according to the distribution $\xi$, then choose $x$ according to $f(x, \, \theta)$. Here, $\xi$ is called a mixing distribution, and mixture models of this type can be parameterized over every set $\Xi$ of probability distributions on $\Theta$.

In statistical applications of mixture models, a nontrivial key issue is identifiability, meaning that different parameters describe different mixtures. In a trivial sense, models parametrized over vectors $P$ are never identifiable: All vectors that correspond to the same probability distribution on $\Theta$ describe the same mixture model. For example, any permutation of the sequence of components leaves the mixing distribution unchanged, or components may be added with zero mixing weights. Therefore identifiability can only mean that parameters that correspond to different mixing distributions describe different mixture models. However, also in this sense identifiability is often violated. For example, the mixture of two uniform distributions with supports $[0,\, 0,5]$ and $[0.5,\, 1]$ and equal mixing weights is the uniform distribution with support $[0, \, 1]$. On the other hand, finite mixtures of many standard families (normal, Poisson,...) are identifiable, see for example Titterington [Titterington]. Identifiability of mixtures of regression models has been treated among others by Hennig [Hennig 1]. A standard general reference for finite mixture models is McLachlan and Peel [McLachlan].

Statistical Problems

Consider a mixture model with parameter $\eta$ (vector or probalility measure). In the simplest case, one has i.i.d. data $x_{1}, \dots, x_{n}$ from $f(x, \, \eta)$, from which one wants to gain information about $\eta$. Typical questions are estimation of (parameters of) $\eta$, or mixture diagnostics: Is there strong evidence for a mixture (in contrast to homogeneity in the sense that $\eta$ is concentrated at some single parameter $\theta$)? What is the (minimum) number of mixture components?

A variety of techniques has been developed. The data provide at least implicitly an estimate of the mixture, and equations 1 and 2 show that mixture and mixing distribution are related by a linear (integral) equation. Approximate solution techniques have been applied for obtaining estimators, and moment estimators have been developed on basis of this structure. Distance estimators exhibit nice properties. Traditionally, mixture diagnostics has been handled by graphical methods. More recent approaches for estimation and diagnostics are based on Bayesian or likelihood techniques; likelihood methods will be addressed below. Although Bayesian methods have some advantages over likelihood methods, they are not straightforward (for example, usually no natural conjugate priors are available, therefore posteriors are simulated using MCMC. Choice of noninformative priors is not obvious, as improper priors usually lead to improper posteriors. Nonidentifiability of ${\cal P}_{k}$ causes the problem of label switching). A nice reference for Bayesian methods is Frühwirth-Schnatter [Fruehwirth].

Let me close this section with a short discussion of robustness. Robustness with respect to outliers is treated by Hennig [Hennig 2]. Another problem is that mixture models are extremely nonrobust with respect to misspecification of the component model. Estimating the component model in a fully nonparametric way is of course not possible, but manageable alternatives are for example mixtures of log-concave distributions. Let me point out, however, that issues like nonrobustness and nonidentifiability only cause problems if the task is to interpret the model parameters somehow. If the aim is only to obtain a better data fit, one need not worry about them.

Likelihood Methods

In the above setting, $l(\eta) = \log (f(x_{1}, \, \eta)) + \dots + \log (f(x_{n}, \, \eta))$ is the log likelihood function. It may have some undesirable properties: First, the log likelihood is often unbounded. For example, consider mixtures of normals. If the expectation of one component is fixed at some data point and the variance goes to zero, the likelihood goes to infinity. Singularities usually occur at the boundary of the parameter space. Second, the likelihood function is usually not unimodal, although this depends on the parametrization. For example, if the parameter is a probability distribution as in equation~2 and if the parameter space $\Xi$ is a convex set (with respect to the usual linear combination of measures), the log likelihood function is concave. If it is bounded, there is a nice theory of nonparametric likelihood estimation (Lindsay [Lindsay]), and the nonparametric maximum likelihood estimator is in some sense uniquely defined and can be calculated numerically (Böhning~[Boehning], Schlattmann [Schlattmann]).

Nonparametric methods, however, work only for low dimensional component models, whereas parametric estimation techniques like the Expectation-Maximization (EM) method work for nearly any dimension. The latter is a local maximizer for mixture likelihoods in ${\cal P}_{k}$. Here the mixture likelihood is usually multimodal; moreover, it can be very flat. Analytic expressions for likelihood maxima usually do not exist, they have to be calculated numerically. On the other hand, even for unbounded likelihoods, it is known from asymptotic theory, that the simple heuristics of searching for a large local maximum in the interior of the parameter space may lead to reasonable estimators. However, one must be aware that there exist spurious large local maxima that are statistically meaningless. Moreover, except from simple cases, there is no manageable asymptotics for likelihood ratio.

Some of the problems of pure likelihood approaches can be overcome by considering penalized likelihoods. However, here one has the problem of choosing a penalization parameter. Moreover, the EM algorithm is a basic tool for a number of estimation problems, and it has a very simple structure for simple finite mixtures. Therefore it will be outlined in the next sextion.

EM Algorithm

The EM algorithm is a local maximization technique for the log likelihood in ${\cal P}_{k}$. It starts from the complete-data log-likelihood. Suppose that for observation $x_{i}$ the (fictive) component membership is known. It is defined by a vector $z_{i} \in \Re^{k}$ with $z_{ij} = 1$, if $x_{i}$ belongs to j-th component, and zero elsewhere. As a random variable $Z_{i}$, it has a multinomial distribution with parameters $k, p_{1}, \dots, p_{k}$. Then the complete data likelihood and log likelihood of $P$, respectively, are $L_{c}(P) = \prod_{i=1}^{n} \prod_{j=1}^{k} (p_{j} \, f(x_{i}, \, \theta_{j}))^{z_{ij}}$ and $l_{c}(P) = \log(L_{c}(P)) = \sum_{i=1}^{n} \sum_{j=1}^{k} z_{ij} \, \log p_{j} + \sum_{i=1}^{n} \sum_{j=1}^{k} z_{ij} \, \log f(x_{i}, \, \theta_{j})$.

The EM needs a starting value $P_{0}$, and then proceeds as an iteration between an E-step and an M-step until convergence. The first E-step consists in calculating the conditional expectation $E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n})$ of $l_{c}(P)$ for arbitrary $P$, given the data, under $P_{0}$. However, as the only randomness is in the $z_{ij}$, we obtain $$E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n}) = \sum_{i=1}^{n} \sum_{j=1}^{k} \tau_{j}(x_{i}|P_{0}) \log p_{j} + \sum_{i=1}^{n} \sum_{j=1}^{k} \tau_{j}(x_{i}|P_{0}) \log f(x_{i}, \, \theta_{j}) \, ,$$ where $$\tau_{j}(x_{i}|P_{0}) = \mbox{Pr}_{P_{0}}(Z_{ij}=1 \, | \, x_{i}) = \frac{p_{j} f(x_{i}, \, \theta_{j})}{f(x_{i}, \, P_{0})}$$ is the conditional probability that the i-th observation belongs to component~j, given the data, with respect to $P_{0}$.

In the following M-step, $E_{P_{0}}(l_{c}(P) \, | \, x_{1}, , \dots, x_{n})$ is maximized with respect to $P$. As it is the sum of terms depending on the mixing weights and on the parameters only, respectively, both parts can be maximized separately. It is easily shown that the maximum in the $p_{j}$ is achieved for $p_{j}^{(1)} = (1/n) \sum_{i=1}^{n}\tau_{j}(x_{i}|P_{0}), j = 1, \dots, n$. For component densities from exponential families, similar simple solutions exist for the $\theta_{j}$, therefore both the E-step and te M-step can be carried out here analytically. It can be shown that (i) the log-likelihood is not decreasing during the iteration of the EM, and (ii) tat under some regularity conditions it converges to a stationary point of the likelihood function. However, this may also be a saddle point.

It remains to define the stopping rule and the starting point(s). Both are crucial, and the reader is referred to the literature. There are also techniques that prevent from convergence to singularities or spurious maxima. A final nice issue of the EM is that it yields a simple tool for classification of data points: If $\hat{P}$ is an estimator, then $\tau_{j}(x_{i}|\hat{P})$ is the posterior probability that $x_{i}$ belongs to class $j$ with respect to the prior $\hat{P}$. The Bayesian classification rule assigns observation $i$ to the class $j$ that maximizes $\tau_{j}(x_{i}|\hat{P})$, and the $\tau_{j}(x_{i}|\hat{P})$ measure the plausibility of such a clustering.

Let me finally remark that the Bayesian analysis of a mixture model via the Gibbs sampler takes advantage of exactly the same structure by simulating both the parameters and the missing data (see, e.g., Frühwirth-Schnatter, [Fruehwirth]).

Number of Components, Testing and Asymptotics

Even if one has an estimator in each ${\cal P}_{k}$ from the EM, the question is how to assess the number of components (i.e. how to choose $k$). Usually information criteria like AIC and BIC are recommended. An alternative is to perform a sequence of tests of $k$ against $k+1$ components, for $k = 1, 2 \dots$.

There are several tests for homogeneity, i.e. for the component model, as for example goodness of fit or dispersion score tests. For testing $k_{0}$ against $k_{1}$ components, a likelihood ratio test may be performed. However, the usual $\chi^{2}$-asymptotics fails, so critical values have to be simulated. Moreover, the distribution of the test statistic usually depeds on the specific parameter under the null hypothesis. Therefore some sort of bootstrap is needed, and as estimators have to be calculated numerically, likelihood ratio tests are computationally intensive.

Let me close with some remarks on asymptotics. Whereas asymptotic normality of estimators is guaranteed under some conditions, the usual asymptotics for the likelihood ratio test fails. The reason is that under the null hypothesis, the parameter $P_{0}$ is on the boundary of the parameter space, it is not identifiable and the Fisher information matrix in $P_{0}$ is singular. There is an asymptotic theory under certain restrictive assumptions, but it is usually hard to calculate critical values from it.