# ANOVA

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

analysis of variance

Here, ANOVA will be understood in the wide sense, i.e., equated to the univariate linear model whose model equation is

 (a1)

in which is an observable random vector, is a known -matrix (the "design matrix" ), is an -vector of unknown parameters, and is an -vector of unobservable random variables (the "errors" ) that are assumed to be independent and to have a normal distribution with mean and unknown variance (i.e., the are independent identically distributed ). It is assumed throughout that . Inference is desired on and . The may represent measurement error and/or inherent variability in the experiment. The model equation (a1) can also be expressed in words by: has independent normal elements with common, unknown variance and expectation , in which is known and is unknown. In most experimental situations the assumptions made on should be regarded as an approximation, though often a good one. Studies on some of the effects of deviations from these assumptions can be found in [a48], Chap. 10, and [a51] discusses diagnostics and remedies for lack of fit in linear regression models. To a certain extent the ANOVA ideas have been carried over to discrete data, then called the log-linear model; see [a6], and [a10].

MANOVA (multivariate analysis of variance) is the multivariate generalization of ANOVA. Its model equation is obtained from (a1) by replacing the column vectors by matrices to obtain

 (a2)

where and are , is , and is as in (a1). The assumption on is that its rows are independent identically distributed , i.e., the common distribution of the independent rows is -variate normal with mean and non-singular covariance matrix .

GMANOVA (generalized multivariate analysis of variance) generalizes the model equation (a2) of MANOVA to

 (a3)

in which is as in (a2), is as in (a2), is , and is an second design matrix.

Logically, it would seem that it suffices to deal only with (a3), since (a2) is a special case of (a3), and (a1) of (a2). This turns out to be impossible and it is necessary to treat the three topics in their own right. This will be done, below. For unexplained terms in the fields of estimation and testing hypotheses, see [a30], [a31] (and also Statistical hypotheses, verification of; Statistical estimation).

## ANOVA.

This field is very large, well-developed, and well-documented. Only a brief outline is given here; see the references for more detail. An excellent introduction to the essential elements of the field is [a48] and a short history is given in [a47], Sect. 2. Brief descriptions are also given in [a56], headings Anova; General Linear Model. Other references are [a49] [a50], [a43], [a26], and [a15]. A collection of survey articles on many aspects of ANOVA (and of MANOVA and GMANOVA) can be found in [a14].

In (a1) it is assumed that the parameter vector is fixed (even though unknown). This is called a fixed effects model, or Model I. In some experimental situations it is more appropriate to consider random and inference is then about parameters in the distribution of . This is called a random effects model, or Model II. It is called a mixed model if some elements of are fixed, others random. There are also various randomization models that are not described by (a1). For reasons of space limitation, only the fixed effects model will be treated here. For the other models see [a48], Chaps. 7, 8, 9.

The name "analysis of variance" was coined by R.A. Fisher, who developed statistical techniques for dealing with agricultural experiments; see [a48], Sect. 1.1: references to Fisher. As a typical example, consider the two-way layout for the simultaneous study of two different factors, for convenience denoted by and , on the measurement of a certain quantity. Let have levels , and let have levels . For each combination, measurements , , are made. For instance, in a study of the effects of different varieties and different fertilizers on the yield of tomatoes, let be the weight of ripe tomatoes from plant of variety using fertilizer . The model equation is

 (a4)

and it is assumed that the are independent identically distributed . This is of the form (a1) after the and are strung out to form the column vectors and of (a1) with ; similarly, the parameters on the right-hand side of (a4) form an -vector , with ; finally, in (a1) has one column for each of the parameters, and in row of there is a in the columns for , , , and , and s elsewhere. Some of the customary terminology is as follows. Each combination is a cell. In the example (a4), each cell has the same number of observations (balanced design); in general, the cell numbers need not be equal. The parameters on the right-hand side of (a4) are called the effects: is the general mean, the s are the main effects for factor , the s for , and the s are the interactions.

The extension to more than two factors is immediate. There are then potentially more types of interactions; e.g., in a three-way layout there are three types of two-factor interactions and one type of three-factor interactions. Layouts of this type are called factorial, and completely crossed if there is at least one observation in each cell. The latter may not always be feasible for practical reasons if the number of cells is large. In that case it may be necessary to restrict observations to only a fraction of the cells and assume certain interactions to be . The judicious choice of this is the subject of design of experiments; see [a26], [a15].

A different type of experiment involves regression. In the simplest case the measurement of a certain quantity may be modelled as , where and are unknown real-valued parameters and is the value of some continuously measurable quantity such as time, temperature, distance, etc.. This is called linear regression (i.e., linear in ). More generally, there could be an arbitrary polynomial in on the right-hand side. As an example, assume quadratic regression and suppose denotes time. Let be the measurement on at time , . The model equation is , which is of the form (a1) with of (a1). The matrix of (a1) has three columns corresponding to , , and ; the th row of is . Functions of other than polynomials are sometimes appropriate. Frequently, is referred to as a regressor variable or independent variable, and the dependent variable. Instead of one regressor variable there may be several (multiple regression).

Factors such as above whose values can be measured on a continuous scale are called quantitative. In contrast, categorical variables (e.g., variety of tomato) are called qualitative. A quantitative factor may be treated qualitatively if the experiment is conducted at several values, say , but these are only regarded as levels of the factor whereas the actual values are ignored. The name analysis of variance is often reserved for models that have only factors that are qualitative or treated qualitatively. In contrast, regression analysis has only quantitative factors. Analysis of covariance covers models that have both kinds of factors. See [a48], Chap. 6, for more detail.

Another important distinction involving factors is between the notions of crossing and nesting. Two factors and are crossed if each level of can occur with each level of (completely crossed if there is at least one observation for each combination of levels, otherwise incompletely or partly crossed). For instance, in the tomato example of the two-way layout (a4), the two factors are crossed since each variety can be grown with any fertilizer . In contrast, factor is said to be nested within factor if every level of can only occur with one level of . For instance, suppose two different manufacturing processes (factor ) for the production of cords have to be compared. From each of the two processes several cords are chosen (factor ), each cord cut into several pieces and the breaking strength of each piece measured. Here each cord goes only with one of the processes so that is nested within . Nested factors should be treated more realistically as random. However, for the analysis it is necessary to analyze the corresponding fixed effects model first. See [a48], Sect. 5.3, for more examples and detail.

### Estimation and testing hypotheses.

The main interest is in inference on linear functions of the parameter vector of (a1), called parametric functions, i.e., functions of the form , with of order . Usually one requires point estimators (cf. also Point estimator) of such s to be unbiased (cf. also Unbiased estimator). Of particular interest are the elements of the vector . However, there is a complication arising from the fact that the design matrix in (a1) may be of less than maximal rank (the columns can be linearly dependent). This happens typically in analysis of variance models (but not usually in regression models). For instance, in the two-way layout (a4) the sum of the columns for the equals the column for . If is of less than full rank, then the elements of are not identifiable in the sense that even if the error vector in (a1) were , so that is known, there is no unique solution for . A fortiori the elements of do not possess unbiased estimators. Yet, there are parametric functions that do have an unbiased estimator; they are called estimable. It is easily shown that is estimable if and only if is in the row space of (see [a48], Sect. 1.4). In particular, if one sets and takes to be the th row of , then is estimable. Thus, is estimable if and only if it is a linear combination of the elements of .

The complication presented by a design matrix that is not of full rank may be handled in several ways. First, a re-parametrization with fewer parameters and fewer columns of is possible. Second, a popular way is to impose side conditions on the parameters that make them unique. For instance, in the two-way layout (a4) often-used side conditions are: , or, equivalently, (where dotting on a subscript means averaging over that subscript); similarly, , and for all , for all . Then all parameters are estimable and (for instance) the hypothesis that all main effects of factor are can be expressed by: All are equal to zero. A third way of dealing with an of less than full rank is to express all questions of inference in terms of estimable parametric functions. For instance, if in (a4) one writes (), then all are estimable and can be expressed by stating that all are equal, or, equivalently, that all are equal to zero.

Another type of estimator that always exists is a least-squares estimator (LSE; cf. also Least squares, method of). A least-squares estimator of is any vector minimizing . A minimizing (unique if and only if is of full rank) is denoted by and satisfies the normal equations

 (a5)

If is estimable, then is unique (even when is not) and is called the least-squares estimator of . By the Gauss–Markov theorem (cf. also Least squares, method of), is the minimum variance unbiased estimator of . See [a48], Sect. 1.4.

A linear hypothesis consists of one or more linear restrictions on :

 (a6)

with of order and rank . Then is to be tested against the alternative . Let . The model (a1) together with of (a6) can be expressed in geometric language as follows: The mean vector lies in a linear subspace of -dimensional space, spanned by the columns of , and restricts to a further subspace of , where and . Further analysis is simplified by a transformation to the canonical system, below.

### Canonical form.

There is a transformation , with of order and orthogonal, so that the model (a1) together with the hypothesis (a6) can be put in the following form (in which are the elements of and ): are independent, normal, with common variance ; , and, additionally, specifies . Note that are unrestricted throughout. Any estimable parametric function can be expressed in the form , with constants , and the least-squares estimator of is . To estimate one forms the sum of squares for error , and divides by ( degrees of freedom for the error) to form the mean square . Then is an unbiased estimator of . A test of the hypothesis can be obtained by forming , with degrees of freedom , and . Then, if is true, the test statistic has an -distribution with degrees of freedom . For a test of of level of significance one rejects if ( the upper -point of the -distribution with degrees of freedom ). This is "the" -test; it can be derived as a likelihood-ratio test (LR test) or as a uniformly most powerful invariant test (UMP invariant test) and has several other optimum properties; see [a48], Sect. 2.10. For the power of the -test, see [a48], Sect. 2.8.

### Simultaneous confidence intervals.

Let be the linear space of all parametric functions of the form , i.e., all that are if is true. The -test provides a way to obtain simultaneous confidence intervals for all with confidence level (cf. also Confidence interval). This is useful, for instance, in cases where is rejected. Then any whose confidence interval does not include is said to be "significantly different from 0" and can be held responsible for the rejection of . Observe that has an -distribution with degrees of freedom (whether or not is true) so that this quantity is with probability . This inequality can be converted into a family of double inequalities and leads to the simultaneous confidence intervals

 (a7)

in which and is the square root of the unbiased estimator of the variance of . Thus, the confidence interval for has endpoints , and all are covered by their confidence intervals simultaneously with probability . Note that (a7) is stated without needing the canonical system so that the confidence intervals can be evaluated directly in the original system.

With help of (a7) the -test can also be expressed as follows: is accepted if and only if all confidence intervals with endpoints cover the value . More generally, it is convenient to make the following definition: a test of a hypothesis is exact with respect to a family of simultaneous confidence intervals for a family of parametric functions if is accepted if and only if the confidence interval of every in the family includes the value of specified by ; see [a52], [a53]. Thus, the -test is exact with respect to the simultaneous confidence intervals (a7).

The confidence intervals obtained in (a7) are called Scheffé-type simultaneous confidence intervals. Shorter confidence intervals of Tukey-type within a smaller class of parametric functions are possible in some designs. This is applicable, for instance, in the two-way layout of (a4) with equal cell numbers if only differences between the are considered important rather than all parametric functions that are under (so-called contrasts). See [a48], Sect. 3.6.

The canonical system is very useful to derive formulas and prove properties in a unified way, but it is usually not advisable in any given linear model to carry out the transformation explicitly. Instead, the necessary expressions can be derived in the original system. For instance, if and are the orthogonal projections of on and on , respectively, then and . These projections can be found by solving the normal equations (a5) (and one gets, for instance, ), or by minimizing quadratic forms. As an example of the latter: In the two-way layout (a4), minimize over the . This yields , so that . If desired, formulas can be expressed in vector and matrix form. As an example, if is of maximal rank, then (a5) yields and . Similar expressions hold under after replacing by a matrix whose columns span . If is not of maximal rank, then a generalized inverse may be employed. See [a43], Sect. 4a.3, and [a45].

## MANOVA.

There are several good textbooks on multivariate analysis that treat various aspects of MANOVA. Among the major ones are [a1], [a8], [a19], [a29], [a36], [a41], and [a43], Chap. 8. See also [a56], headings Multivariate Analysis; Multivariate Analysis Of Variance, and [a14]. The ideas involved in MANOVA are essentially the same as in ANOVA, but there is an added dimension in that the observations are now multivariate. For instance, if measurements are made on different features of the same individual, then this should be regarded as one observation on a -variate distribution. The MANOVA model is given by (a2). A linear hypothesis on analogous to (a6) is

 (a8)

with as in (a6). Any ANOVA testing problem defined by the choice of in (a1) and in (a6) carries over to the same kind of problem given by (a2) and (a8). However, since is a matrix, there are other ways than (a8) of formulating a linear hypothesis. The most obvious extension of (a8) is

 (a9)

in which is a known -matrix of rank . However, (a9) can be reduced to (a8) by making the transformation , of order , , ; then the model is , with the rows of independent identically distributed , , and . Thus, the transformed problem is as (a2), (a8), with replacing . This can be applied, for instance, to profile analysis; see [a29], Sect. 5.4 (A5), [a36], Sects. 4.6, 5.6.

There is a canonical form of the MANOVA testing problem (a2), (a8) analogous to the ANOVA problem (a1), (a6), the difference being that the real-valued random variables of ANOVA are replaced by random vectors. These vectors form the rows of three random matrices, of order , of order , and of order , all of whose rows are assumed independent and -variate normal with common non-singular covariance matrix ; furthermore, , is unspecified, and specifies . It is assumed that . Put , so that is an unbiased estimator of . For testing , is ignored and the sums of squares and of ANOVA are replaced by the -matrices and , respectively. An application of sufficiency plus the principle of invariance restricts tests of to those that depend only on the positive characteristic roots of ( the positive characteristic roots of ). The case , when is a row vector, deserves special attention. It arises, for instance, when testing for zero mean in a single multivariate population or testing the equality of means in two such populations. Then is the only positive characteristic root; is called Hotelling's , and has an -distribution with degrees of freedom , central or non-central according as is true or false. Rejecting for large values of is uniformly most powerful invariant. If there is no best way of combining the characteristic roots, so that there is no uniformly most powerful invariant test (unlike there is in ANOVA). The following tests have been proposed:

reject if (Wilks LR test);

reject if the largest characteristic root of exceeds a constant (Roy's test);

reject if (Lawley–Hotelling test);

reject if (Bartlett–Nanda–Pillai test). For references, see [a1], Sects. 8.3, 8.6, or [a36], Chap. 5. For distribution theory, see [a1], Sects. 8.4, 8.6, [a41], Sects. 10.4–10.6, [a55], Sect. 10.3. Tables and charts can be found in [a1], Appendix, and [a36], Appendix.

The problem of expressing the matrices and in terms of the original model given by (a2), (a8) is very similar to the situation in ANOVA. One way is to express and explicitly in terms of and . Another is to consider the ANOVA problem with the same and ; if explicit formulas exist for and , they can be converted to and . For instance, in the ANOVA two-way layout (a4) converts to in the corresponding MANOVA problem, where now the are -vectors.

### Point estimation.

In the canonical system is an unbiased estimator and the maximum-likelihood estimator of (cf. also Maximum-likelihood method). If is a linear function of , then is both an unbiased estimator and a maximum-likelihood estimator of . An unbiased estimator of is , whereas its maximum-likelihood estimator is .

### Confidence intervals and sets.

There are several kinds of linear functions of that are of interest. The direct analogue of a linear function of in ANOVA is a function of the form (with of order ), which is a -vector. This leads to a confidence set in -space for , rather than an interval. Simultaneous confidence sets for all can be derived from any of the proposed tests for , but it turns out that only Roy's maximum root test is exact with respect to these confidence sets (and not, for instance, the LR test of Wilks); see [a52], [a53]. The same is true for simultaneous confidence sets for all , and confidence intervals for all . Simultaneous confidence sets for all were given in [a18]. In [a46] simultaneous confidence intervals for all are derived (called "double linear compounds" ). These are special cases of all (possibly matrix-valued) functions of the form are treated in [a11]. The most general linear functions of are of the form . Simultaneous confidence intervals for all such functions as runs through all -matrices are given in [a37]. These are derived from a test defined in terms of a symmetric gauge function rather than from Roy's maximum root test. In [a52], [a53] a generalization of this is given if has its rank restricted; for this reproduces the confidence intervals of [a46].

### Step-down procedures.

Partition into its columns ; then of (a8) is the intersection of the component hypotheses . Also partition into its columns . Then for each , the hypothesis is tested with a univariate ANOVA -test that depends only on . If any is rejected, then is rejected. The tests are independent, which permits easy determination of the overall level of significance in terms of the individual ones. For details, history of the subject and references, see [a38] and [a39], Sect. 3. A variation, based on -values, is presented in [a40]. Step-down procedures are convenient, but it is shown in [a34] that even in the simplest case when , a step-down test is not admissible. Furthermore, a step-down test is not exact with respect to simultaneous confidence intervals or confidence sets derived from the test for various linear functions of ; see [a53], Sect. 4.4. A generalization of step-down procedures is proposed in [a38] by grouping the column vectors of and into blocks.

### Random effects models.

Some references on this topic in MANOVA are [a2] and [a35]; see also references quoted therein.

### Missing data.

Statistical experiments involving multivariate observations bring in an element that is not present with univariate observations, such as in ANOVA. Above, it has been taken for granted that of every individual in a sample all variates are observed. In practice this is not always true, for various reasons, in which case some of the observations have missing data. (This is not to be confused with the notion of empty cells in ANOVA.) If that happens, one can group all observations with complete data together as the complete sample and call the remaining observations an incomplete sample. From a slightly different point of view, the incomplete sample is sometimes considered extra data on some of the variates. The analysis of MANOVA problems is more complicated when there are missing data. In the simplest case, all missing data are on the same variates. This is a special case of nested missing data patterns. In the latter case explicit expressions of maximum-likelihood estimators are possible; see [a3] and the references therein. For more complicated missing data patterns explicit maximum-likelihood estimators are usually not available unless certain assumptions are made on the structure of the unknown covariance matrix ; see [a3], [a4] and [a5]. The situation is even worse for testing. For instance, even in the simplest case of testing the hypothesis that the mean of a multivariate population is , if in addition to a complete sample there is an incomplete one taken on a subset of the variates, then there is no locally (let alone uniformly) most-powerful test; see [a9]. Several aspects of estimation and testing in the presence of various patterns of missing data can be found in [a25], wherein also appear many references to other papers in the field.

## GMANOVA.

This topic has not been recognized as a distinct entity within multivariate analysis until relatively recently. Consequently, most of today's (2000) knowledge of the subject is found in the research literature, rather than in textbooks. (There is an introduction to GMANOVA in [a41], Problem 10.18, and a little can be found in [a8], Sect. 9.6, second part.) A good exposition of testing aspects of GMANOVA, pointing to applications in various experimental settings, is given in [a21].

The general GMANOVA model was first stated in [a42], where the motivation was the modelling of experiments on the comparison of growth curves in different populations. Suppose such a growth curve can be represented by a polynomial in the time , say . If measurements are made on an individual at times , then these data are thought of as one observation on a -variate population with population mean and covariance matrix , where the s and are unknown parameters. Suppose populations are to be compared and a sample of size is taken from the th population, . In order to model this by (a3), let the th column of (corresponding to the th population) have s, and s otherwise. Specifically, the first column has a in positions , the second in positions , etc.; then . Let the growth curve in the th population be ; then the matrix has rows, the th row being , so that in (a3); and has columns, the th one being . (In the example given in [a42], measurements were taken at ages 8, 10, 12, and 14 in a group of girls and a group of boys; each measurement was of a certain distance between two points inside the head (with help of an X-ray picture) that is of interest in orthodontistry to monitor growth.)

Linear hypotheses are in general of the form (a9). For instance, suppose two growth curves are to be compared, both assumed to be straight lines () so that , . Suppose the hypothesis is (equal slope in the two populations). Then in (a9) one can take and . Other examples of GMANOVA may be found in [a21].

A canonical form for the GMANOVA model was derived in [a13]; it can also be found in [a21], Sect. 3.2. It can be obtained from the canonical form of MANOVA by partitioning the matrices columnwise into three blocks, resulting in matrices , . Invariance reduction eliminates all except and (the latter is used for estimating the relevant portion of the unknown covariance matrix ). It is given that and ; inference is desired on , e.g., to test the hypothesis . Further sufficiency reduction leads to two matrix-valued statistics and ([a20], [a21]), of which is the most important and is built-up from the following statistic:

 (a10)

in which (with ) is the estimated regression of on , the true regression being . That inference on should be centred on can be understood intuitively by realizing that if were known, then minimizes the variances among all linear combinations of and whose mean is , and provides therefore better inference than using only . The unknown regression is then estimated by , leading to of (a10).

The essential difference between GMANOVA and MANOVA lies in the presence of , which is correlated with and has zero mean. Then is used as a covariate for ; see, e.g., [a33]. However, not all models that appear to be GMANOVA produce such a covariate. More precisely, if in (a3) , then it turns out that in the canonical form there are no matrices and the model reduces essentially to MANOVA. This situation was encountered previously when it was pointed out that the MANOVA model (a2) together with the GMANOVA-type hypothesis (a9) was immediately reducible to straight MANOVA. The same conclusion would have been reached after treating (a2), (a9) as a special case of GMANOVA and inspecting the canonical form. For a "true" GMANOVA the existence of is essential. A typical example of true GMANOVA, where the covariate data are built into the experiment, was given in [a7].

Inference on can proceed using only (e.g., [a27], and [a13]), but is not necessarily the best possible. For testing an essentially complete class of tests include those that also involve explicitly. One such test is the locally most-powerful test derived in [a20]. For the distribution theory of see [a21], Sect. 3.6, and [a54], Sect. 6.5. Admissibility and inadmissibility results were obtained in [a32]; comparison of various tests can also be found there. A natural estimator of is of (a10); it is an unbiased estimator and in [a22] it is shown to be best equivariant. Other kinds of estimators have also been considered, e.g., in [a24], in which several references to earlier work can be found. Simultaneous confidence intervals and sets have been treated in [a16], [a17], [a27], and [a28]. Special structures of the covariance matrix have been studied in [a44], where also references to earlier work on related topics can be found.

### Generalizations.

A natural generalization of the GMANOVA model is indicated in [a13] by having a further partitioning of the blocks of s in the canonical form. This is called extended GMANOVA in [a21] and examples are given there. Another generalization involves some relaxation of the usual assumptions of multivariate normality, etc. See [a23], [a12], [a17].