Formulas are a way to compactly specify the variables that appear
in a model and their roles.
The syntax for formulas is very similar to that of R and its predecessor, S.
For example, the formula for a linear regression of a variable y
on a set of variables, x, a,
and b is:
The left-hand side of this equation, before the ~ sign,
specifies the dependent variable(s). The right-hand side specifies the
independent variables.
The terms in a formula can be more complicated expressions, like product terms
and interactions. It is helpful to think of terms as sets of variables,
where a term like x is a set consisting of of one variable.
The operations in the formula are operations on sets of variables.
In all, the formula language includes 6 operators.
They are, in order of lowest to highest precedence:
- ~
Separates the dependent variables or targets on the left from the
independent variables or features on the right. If not present,
the formula is considered to contain independent variables or features.
- +
Union operator. Combines the terms on the left and right and computes their union.
- -
Difference operator. Computes the set difference between two terms. It returns the set of
items that are in the left operand but not in the right operand.
This operator has the same precedence as +.
- *
Product operator. Like +, it computes the union of
the left and right terms, but also adds the interaction between
each term on the left and each term on the right.
In other words: a*b is equivalent to
a + b + a:b.
- :
Interaction operator. Computes the interaction between the left and right terms.
The result consists of the interactions of each term in the
left set with each term in the right set.
In numerical terms, the interaction between two variables corresponds
to their element-wise product.
- ^ or **
Computes a polynomial. The right operand must be an integer
exponent, n. The result is
applying the product operator *
to n times the left operand.
So, (a+b)**3 is equivalent to
(a+b)*(a+b)*(a+b).
Parentheses can be used to change the order of operations.
All operators are left-associative, so x - a - b
is equivalent to (x - a) - b.
Variables are specified using their name. If the name contains
spaces or other reserved characters, they can be quoted using
back quotes (`), for example:
Result ~ `Item 1` + `Item 2 + 3`
In addition, two special terms, 1 and
0, indicate the presence or absence of
an intercept term, as discussed in the next section.
Finally, the . term is a special value
that represents all the variables in the dataset that have not been
used up to that point. This is particularly useful for situations
where the dataset contains many variables. If y
is the dependent variable, and all other variables should be included
in the model, then the formula is simply
Most linear models include an intercept or constant
term. For convenience, formulas for such models include the
term by default, even if it is not specified. So
is really equivalent to
There are two ways to exclude an intercept term from a model.
The first is to explicitly remove it at the end:
The second is to include the 'no intercept' term
0at the start
of the formula:
Only regression models, including logistic regression models,
include the intercept by default.
Models that don't generally include an intercept term, like
clustering models or PCA, don't include an intercept term by default.
Categorical variables are special.
First, an interaction between a categorical variable
and itself does not add any information to the model.
This is in contrast to numerical variables.
Second, most models require variables to be numerical,
so in order to include categorical variables, they must be
encoded into one or more indicator variables.
What complicates matters is that full encodings
usually result in linear dependencies between the indicator
variables. Put another way: adding the full set of
indicator variables to a model would add redundant information.
For example, if a boolean variable is encoded
using two indicator variables, one that has a 1
for true and zeros elsewhere, and one that
has a 1 for false and
zeros elsewhere, then the sum of the indicator variables
will have 1 everywhere, which makes it
exactly the same as the intercept term.
If an intercept term is already present, then
adding the second indicator variable does not add any new
information, because it can be calculated from
the intercept and the first indicator variable.
For this reason, in linear models not all indicator variables
will end up being included in the model. Only indicator
variables that add information to the model
will be included.
Encodings for Categorical Variables
Categorical variables can be encoded in a variety of ways.
Each encoding will produce different values
for the model parameters. The interpretation
of parameter values is different as well.
Each encoding scheme and each encoding within
a scheme brings out a different aspect of the role
of the variable in the model.
For this reason, an encoding of a categorical variable
is sometimes referred to as a contrast.
Encodings are specific to the levels
of a categorical variable (its
CategoryIndex
property). Categorical encodings are implemented by the
CategoricalEncoding
class. This class has no public constructors.
Instead, one of the static methods should be used
to create the encoding:
Name | Description |
---|
Dummy(IIndex, Int32) |
Also called one hot encoding.
Every level is compared against the reference level.
Every level except the reference level is encoded using a binary variable.
The first level is the default for the reference level.
|
Simple(IIndex, Int32) |
Each level is compared to the reference level.
The grand mean serves as the intercept.
The first level is the default for the reference level.
|
Deviation(IIndex, Int32) |
Every level except the reference level is encoded using
one of three values: 1 if the value
equals the level, -1 if the value
equals the reference level, and 0 otherwise.
|
OrthogonalPolynomial(IIndex) |
Only valid for ordinal variables where the levels are ordered.
The levels are encoded as orthogonal polynomials
which reflect linear, quadratic, cubic... trends
in the categorical variable.
|
ForwardDifference(IIndex) |
Only valid for ordinal variables where the levels are ordered.
Each level is compared to the next level.
|
BackwardDifference(IIndex) |
Only valid for ordinal variables where the levels are ordered.
Each level is compared to the previous level.
|
Helmert(IIndex) |
Only valid for ordinal variables where the levels are ordered.
Each level is compared to the mean of subsequent levels.
|
ReverseHelmert(IIndex) |
Only valid for ordinal variables where the levels are ordered.
Each level is compared to the mean of previous levels.
|
Each encoding has two variants: full rank and reduced rank.
The reduced-rank encoding is when using the full-rank encoding
would lead to redundancies.
The GetContrastMatrix(Boolean)
method returns a matrix whose columns contain the encodings of the variable.
To set the encoding for a variable in a model, use the model's
Data property
to access the model's group of
Features.
You can then call the group's
SetEncoding(String, FuncIIndex, Int32, CategoricalEncoding, Int32)
method to select the encoding. The first argument is the key of the variable. The second
is a function that creates the encoding. This can be one of the static methods of the
CategoricalEncoding
class. The third argument is optional and specifies the reference level.
The GetEncoding(String)
method returns the current the encoding.