The label "analysis of variance" (ANOVA) brings together a series of techniques
to determine and measure the source of the variation in data. Specifically,
ANOVA procedures partition the total variation in a data set into its component parts.
ANOVA models come in many shapes and sizes, called designs.
The Extreme Optimization Numerical Libraries for .NET
support the three most common designs: one-way, one-way with repeated measures, and two-way
analysis of variance. However, the infrastructure is in place to handle designs
of any size and complexity.
All classes that implement ANOVA models inherit from a common base class,
AnovaModel,
which in turn inherits from
Model,
the base class of all statistical model classes.
In regression models, the dependent variable is a linear function
of the independent variables. In an ANOVA design, the independent variables
are categorical. The contribution of each individual combination of values
of the independent variables must be estimated separately. Some dependencies exist,
so the actual number of parameters is smaller than the number of combinations.
Depending on the design, some combinations may be excluded from the model,
further decreasing the number of parameters.
The set of all possible values of a categorical variable is called
a factor. The possible values are called
the levels of the factor. The purpose of an ANOVA analysis
is to investigate the contribution of each level of each factor,
and/or combinations thereof to the total variation of the data.
So even though the model is initially defined in terms of the dependent
and independent variables, the actual calculations are performed
using the factors rather than the independent variables they are associated with.
The GetFactor
method of the AnovaModel
class returns the IIndex
of the variable at the specified position. An overload allows you to
retrieve the factor associated with an independent variable through the variable's name.
The first step in performing an analysis of variance is to divide the data set
into groups of rows with the same values for the factors. The data that is
associated with a particular combination of factor levels is called a
cell.
Cells are implemented by the Cell
class. This class has a number of properties that return summary statistics
for the data in the cell. The most important ones are:
Count
which returns the number of observations in the cell,
Mean
which returns the cell mean, and
Variance
which returns the variance of the data in the cell only.
Cell objects can't be created directly. Instead, they are accessed through
various properties of the models that return single cells or arrays of cells.
To access a specific cell, use the factor levels as indices. Using the special index All for a factor level indicates that the cell contains the totals for all
levels of the factor. Setting all indices to Cell.All indicates that the cell represents summary data
for the entire data set.
The results of an analysis of variance are in the same format
as those of other linear models.
The AnovaTable
property returns the AnovaTable
object that summarizes the results. The number of rows in the table varies
with the details of the design. The
TotalRow
property always returns the
AnovaRow
for the complete data. The
ErrorRow
property returns the row for the residuals. The
CompleteModelRow
property returns the row for all the factors or interactions in the model combined.
Rows corresponding to the individual factors and interactions in the model
can be retrieved through the
GetModelRow method.