Statistical models serve two main purposes. First, they may provide
insight into a phenomenon by providing quantitative indicators
of the interactions between variables and their contribution to outcomes.
Secondly, they may be used to predict outcomes for fresh data.
In either case, the process of creating and working with
a statistical model consists of several steps:
Gathering the data;
Choosing the type of model to fit, and which data to include
in the model;
Cleaning the data, including handling of missing values;
Fitting the model;
Validating the model using diagnostic information gathered
from the fitting process;
Using the model to make predictions.
The data frame
library is perfectly suited for the third step.
It also provides the means to get to step 4: fitting the model.
From there on, the statistical model classes provide all
the required functionality.
Statistical models describes relationships between variables.
Specifying the variables that appear in a model is therefore
an essential step. Variables can play different roles in a mode.
Features are variables that represent known properties
of the phenomenon under investigation. Depending on the context, they
may also be called independent, explanatory or exogenous variables,
regressors, or inputs. They may be numerical (continuous)
or categorical. Many models expect only numerical input, so categorical
variables must be encoded into numerical variables.
Targets are variables that describe the outcome.
The objective of a model is to describe how the features
affect the target. They are also called dependent, explained
or endogenous variables, outputs or (in classification) labels.
There may be 0, 1, or many targets.
Weights are numerical variables that indicate
the relative importance of an observation to the outcome.
If the model supports it, weights are always optional.
Specific models may have variables that serve a specific purpose.
There is a great deal of flexibility in how the input to a model
is specified. Usually, the data is supplied in the constructor as a
(a DataFrameR, C,
When the data is supplied as a data frame, the column keys
of the model variables must be specified. This can be done either
by supplying them directly or as an R-style formula.
The supplied variables are then used to prepare the input for the
model fitting algorithm. The model may derive additional variables
from the input. For example, a constant (intercept) term is added
by default to most regression models.
Models that require numerical features automatically convert categorical
variables to a set of indicator variables using a suitable encoding.
In polynomial regression, a single input variable is expanded to
a set of powers of the variable.
The classes that fit statistical models expect the input to be
in a specific format. The input consists of one or more groups
Once a model has been fitted, it may turn out that some input variables
are not used in the final model. This can happen, for example, when
some variables are constant, or when there is a linear dependency between
So, in summary, a model has three sets of variables:
The variables that are supplied as input to the model.
We call these the original variables.
The variables that are derived from the input variables
into a form suitable for use by the fitting algorithm.
We call these the input variables.
The variables that are present in the fitted model.
We call these the model variables.
Once the model has been fitted, it must be validated.
Models have a large number of properties that
give information about the quality of the fit,
including residuals, R2
values, and so on. Often some form of goodness-of-fit
test is also available.
Validation is specific to the type of model being fitted,
and is discussed in detail in later sections.
While some models are created purely for exploratory purposes,
mostly models are used to make predictions based on
Regression and classification models have an overloaded
method that takes a vector or data frame and produces the model's
prediction for the supplied data. These methods take a
argument that specifies which set of variables is being passed.
The available options reflect the 3 sets of variables discussed earlier:
The data are the variables as passed to the model.
The data are the variables in the format expected by the model.
They are derived from the original variables.
The data are the variables that are present in the final model.
The setting should be inferred from the number of variables,
giving preference to original variables.
Automatic selection is the default. However, this may lead to
unexpected results when there is insufficient information to distinguish
between two options.
Models may be predictive in that they
model an outcome in terms of known inputs.
The inputs are known as independent variables, predictor variables
or features. The outputs are known as dependent variables or targets.
Regression models are predictive models
that express a continuous
variable in terms of one or more predictor variables,
which may be continuous or categorical.
ANOVA models are a special case where
the predictor variables are categorical.
Time series models are another special
case where the values of the dependent variable are correlated,
and so lagged versions of the dependent variable also appear
as independent variables.
Classification models are predictive models
that attempt to assign observations to one of two or more classes.
Clustering models attempt to group observations
purely based on some measure of similarity without reference to
predefined labels. Clustering models only have features.
There are no dependent variables.
Transformation models attempt to bring out
the most relevant features. A common application is
dimensionality reduction, i.e. reducing
the total number of features that are included in a model.
Dimensionality reduction may be used as a preprocessing
step when building predictive models.
Transformation models only have features.