Home»Documentation»Statistics Library User's Guide»Statistical Models»Defining models using formulas

## Defining models using formulas | Extreme Optimization Numerical Libraries for .NET Professional |

Formulas are a way to compactly specify the variables that appear in a model and their roles.

The syntax for formulas is very similar to that of R and its predecessor, S. For example, the formula for a linear regression of a variable y on a set of variables, x, a, and b is:

y ~ x + a + b

The left-hand side of this equation, before the ~ sign, specifies the dependent variable(s). The right-hand side specifies the independent variables.

The terms in a formula can be more complicated expressions, like product terms and interactions. It is helpful to think of terms as sets of variables, where a term like x is a set consisting of of one variable. The operations in the formula are operations on sets of variables.

In all, the formula language includes 6 operators. They are, in order of lowest to highest precedence:

Parentheses can be used to change the order of operations. All operators are left-associative, so x - a - b is equivalent to (x - a) - b.

Variables are specified using their name. If the name contains spaces or other reserved characters, they can be quoted using back quotes (`), for example:

Result ~ `Item 1` + `Item 2 + 3`

In addition, two special terms, 1 and 0, indicate the presence or absence of an intercept term, as discussed in the next section.

Finally, the . term is a special value that represents all the variables in the dataset that have not been used up to that point. This is particularly useful for situations where the dataset contains many variables. If y is the dependent variable, and all other variables should be included in the model, then the formula is simply

y ~ .

Most linear models include an intercept or constant term. For convenience, formulas for such models include the term by default, even if it is not specified. So

y ~ x + a + b

is really equivalent to

`y ~ 1 + x + a + b`

There are two ways to exclude an intercept term from a model. The first is to explicitly remove it at the end:

`y ~ x + a + b - 1`

The second is to include the 'no intercept' term
0*at the start*
of the formula:

`y ~ 0 + x + a + b`

Only regression models, including logistic regression models, include the intercept by default. Models that don't generally include an intercept term, like clustering models or PCA, don't include an intercept term by default.

Categorical variables are special. First, an interaction between a categorical variable and itself does not add any information to the model. This is in contrast to numerical variables.

Second, most models require variables to be numerical, so in order to include categorical variables, they must be encoded into one or more indicator variables.

What complicates matters is that full encodings usually result in linear dependencies between the indicator variables. Put another way: adding the full set of indicator variables to a model would add redundant information.

For example, if a boolean variable is encoded using two indicator variables, one that has a 1 for true and zeros elsewhere, and one that has a 1 for false and zeros elsewhere, then the sum of the indicator variables will have 1 everywhere, which makes it exactly the same as the intercept term. If an intercept term is already present, then adding the second indicator variable does not add any new information, because it can be calculated from the intercept and the first indicator variable.

For this reason, in linear models not all indicator variables will end up being included in the model. Only indicator variables that add information to the model will be included.

Categorical variables can be encoded in a variety of ways. Each encoding will produce different values for the model parameters. The interpretation of parameter values is different as well. Each encoding scheme and each encoding within a scheme brings out a different aspect of the role of the variable in the model. For this reason, an encoding of a categorical variable is sometimes referred to as a contrast.

Encodings are specific to the levels of a categorical variable (its CategoryIndex property). Categorical encodings are implemented by the CategoricalEncoding class. This class has no public constructors. Instead, one of the static methods should be used to create the encoding:

Name | Description |
---|---|

Also called one hot encoding. Every level is compared against the reference level. Every level except the reference level is encoded using a binary variable. The first level is the default for the reference level. | |

Each level is compared to the reference level. The grand mean serves as the intercept. The first level is the default for the reference level. | |

Every level except the reference level is encoded using one of three values: 1 if the value equals the level, -1 if the value equals the reference level, and 0 otherwise. | |

Only valid for ordinal variables where the levels are ordered. The levels are encoded as orthogonal polynomials which reflect linear, quadratic, cubic... trends in the categorical variable. | |

Only valid for ordinal variables where the levels are ordered. Each level is compared to the next level. | |

Only valid for ordinal variables where the levels are ordered. Each level is compared to the previous level. | |

Only valid for ordinal variables where the levels are ordered. Each level is compared to the mean of subsequent levels. | |

Only valid for ordinal variables where the levels are ordered. Each level is compared to the mean of previous levels. |

Each encoding has two variants: full rank and reduced rank. The reduced-rank encoding is when using the full-rank encoding would lead to redundancies.

The GetContrastMatrix(Boolean) method returns a matrix whose columns contain the encodings of the variable.

To set the encoding for a variable in a model, use the model's
Data property
to access the model's group of
Features.
You can then call the group's
SetEncoding(String, Func

Copyright Â© 2004-2023,
Extreme Optimization. All rights reserved.

*Extreme Optimization,* *Complexity made simple*, *M#*, and *M
Sharp* are trademarks of ExoAnalytics Inc.

*Microsoft*, *Visual C#, Visual Basic, Visual Studio*, *Visual
Studio.NET*, and the *Optimized for Visual Studio* logo

are
registered trademarks of Microsoft Corporation.