Partial Least Squares

Partial least squares is a technique that fits combinations of independent variables called factors to one or more dependent variables. The factors are chosen to maximize the covariance between the factors and the dependent variables.

Partial least squares is useful when the number of independent variables is large compared to the number of observations, or when variables are highly correlated.

Constructing Partial Least Squares Models

The PartialLeastSquaresModel class has four constructors. The first constructor takes three arguments. The first is a Vector<T> that represents the dependent variable. The second is a parameter array of vectors that represent the independent variables. The last argument is the number of factors that should be computed. This creates a Partial Least Squares model with one dependent variable, sometimes called PLS1. The second constructor takes a matrix instead of a vector as the first argument. This constructs a multivariate PLS model, sometimes called PLS2, where each column of the matrix represents a dependent variable.

In the example below, we create two Partial Least Squares models using random data. The first has one dependent variable, 10 independent variables and 20 observations. The second has 3 dependent variables. In both cases, we're asking for 5 factors:

var dependent = Vector.CreateRandom(20);
var independents = Matrix.CreateRandom(20, 10);
var model1 = new PartialLeastSquaresModel(dependent, independents, 5);
var dependents = Matrix.CreateRandom(20, 3);
var model2 = new PartialLeastSquaresModel(dependents, independents, 5);

Visual Basic

Dim dependent = Vector.CreateRandom(20)
Dim independents = Matrix.CreateRandom(20, 10)
Dim model1 = New PartialLeastSquaresModel(dependent, independents, 5)
Dim dependents = Matrix.CreateRandom(20, 3)
Dim model2 = New PartialLeastSquaresModel(dependents, independents, 5)

Visual Basic

No code example is currently available or this language may not be supported.

Visual Basic

let dependent = Vector.CreateRandom(20)
let independents = Matrix.CreateRandom(20, 10)
let model1 = new PartialLeastSquaresModel(dependent, independents, 5)
let dependents = Matrix.CreateRandom(20, 3)
let model2 = new PartialLeastSquaresModel(dependents, independents, 5)
//

The third constructor takes 4 arguments. The first argument is a IDataFrame (a DataFrame<R, C> or Matrix<T>) that contains the variables to be used in the regression. The second argument is an array of strings containing the names of the dependent variables. The third argument is an array of strings containing the names of the independent variables. All the names must exist in the column index of the data frame specified by the first argument. The last argument is once again the number of factors.

In the code that follows, we give the two matrices of dependent and independent variables a column index. We join these matrices to get a matrix that can act as a data frame. We then use this matrix, along with the arrays of column names, to construct the same PLS model:

var xNames = new string[] {
    "x1", "x2", "x3", "x4", "x5",
    "x6", "x7","x8", "x9", "x10" };
independents.ColumnIndex = Index.Create(xNames);
var yNames = new string[] { "y1", "y2", "y3" };
dependents.ColumnIndex = Index.Create(yNames);
// A matrix can act as a data frame:
var all = Matrix.JoinHorizontal(independents, dependents);
var model3 = new PartialLeastSquaresModel(all, yNames, xNames, 5);

Visual Basic

Dim xNames = { 
    "x1", "x2", "x3", "x4", "x5",
    "x6", "x7", "x8", "x9", "x10" }
independents.ColumnIndex = Index.Create(xNames)
Dim yNames = { "y1", "y2", "y3" }
dependents.ColumnIndex = Index.Create(yNames)
' A matrix can act as a data frame
Dim all = Matrix.JoinHorizontal(independents, dependents)
Dim model3 = New PartialLeastSquaresModel(all, yNames, xNames, 5)

Visual Basic

No code example is currently available or this language may not be supported.

Visual Basic

let xNames = [|
                "x1"; "x2"; "x3"; "x4"; "x5";
                "x6"; "x7";"x8"; "x9"; "x10"
             |]
independents.ColumnIndex <- Index.Create(xNames)
let yNames = [| "y1"; "y2"; "y3" |]
dependents.ColumnIndex <- Index.Create(yNames)
// A matrix can act as a data frame:
let all = Matrix.JoinHorizontal(independents, dependents)
let model3 = new PartialLeastSquaresModel(all, yNames, xNames, 5)
//

The fourth constructor takes three arguments. The first argument once again contains the data. The second is a string that contains a formula that describes the model. See the section on formulas for details. The same model as above can be defined using a formula as:

var model4 = new PartialLeastSquaresModel(all, "y1 + y2 + y3 ~ .", 5);

Visual Basic

Dim model4 = New PartialLeastSquaresModel(all, "y1 + y2 + y3 ~ .", 5)

Visual Basic

No code example is currently available or this language may not be supported.

Visual Basic

let model4 = new PartialLeastSquaresModel(all, "y1 + y2 + y3 ~ .", 5)
//

We used the special . term in the right-hand side to capture all remaining columns as independent variables.

Computing the Model

The Compute method performs the actual analysis. Most properties and methods throw an exception when they are accessed before the Compute method is called. You can verify that the model has been calculated by inspecting the Computed property.

Fitting the model is done with one of two standard algorithms: NIPALS (Nonlinear Iterative PArtial Least Squares) or SIMPLS (Statistically Inspired Modification of Partial Least Squares). The two algorithms give identical results when there is only one dependent variable.

By default, the NIPALS algorithm is used. You can change this by setting the Method property. This property is of type PartialLeastSquaresMethod and can take on the following values:

Method	Description
Nipals	Use the original Nonlinear Iterative PArtial Least Squares method (NIPALS).
Simpls	Use the Statistically Inspired Modification of Partial Least Squares method (SIMPLS) of de Jong.

The number of components to compute can be changed by setting the NumberOfComponents property. In the next example, we compute the first model we created earlier using default settings. For the second model, we change the number of requested components to 7 and compute the model using the SIMPLS algorithm:

model1.Fit();
model2.NumberOfComponents = 7;
model2.Method = PartialLeastSquaresMethod.Simpls;
model2.Fit();

Visual Basic

model1.Fit()
model2.NumberOfComponents = 7
model2.Method = PartialLeastSquaresMethod.Simpls
model2.Fit()

Visual Basic

No code example is currently available or this language may not be supported.

Visual Basic

model1.Fit()
model2.NumberOfComponents <- 7
model2.Method <- PartialLeastSquaresMethod.Simpls
model2.Fit()
//

Results

The PredictedValues property returns a Matrix<T> that contains the values of the dependent variable as predicted by the model. The YResiduals property returns a vector containing the difference between the actual and the predicted values of the dependent variable. Both vectors contain one element for each observation.

The Coefficients property returns the matrix of regression coefficients of the model. The Intercepts returns the vector of corresponding intercepts. The StandardizedCoefficients property returns a matrix of the standardized coefficients, based on centered and normalized variables.

Several properties give information about the factors and how they relate to the dependent and independent variables. In PLS, both the matrix of independent and dependent variables are decomposed into components. Similar terminology is used.

The XLoadings property returns a matrix that contains the loadings and XScores returns a matrix that contains the scores of the independent variables. These are the factors T and P in the decomposition of X into TP^T. The YLoadings property returns a matrix that contains the loadings and YScores returns a matrix that contains the scores of the dependent variables. These are the factors U and Q in the decomposition of Y into UQ^T. In addition, the WeightMatrix property returns a matrix containing the projection weights for the independent variables.

Making predictions

The Predict method can be used to predict the values of the dependent variables for new data. The method has three overloads, which all take two arguments. The first overload takes a vector as its first argument. The vector contains the values of the independent variables for which a prediction should be made. The second argument, which is always optional, specifies how the values in the vector relate to the variables in the model. This overload returns a vector that contains the predictions for each of the dependent variables.

The second and third overloads take a matrix and a data frame, respectively, as their first argument. Each row in the matrix or data frame corresponds to an observation. The methods return a matrix whose rows contain the corresponding predictions for the dependent variables.

Verifying the Quality of the Model

One of the objectives of Partial Least Squares is to capture as much as possible of the variance in both the dependent and the independent variables. The XVarianceExplained and YVarianceExplained properties return vectors that contain the proportion of variance explained by each factor. Corresponding XCumulativeVarianceExplained YCumulativeVarianceExplained return the cumulative proportions.

The quality of a PLS model is often assessed using a validation test set. The Press(Matrix<Double>, Matrix<Double>) method computes the PRESS (Predicted REsidual Sum of Squares) of the model for the supplied data. It takes two arguments. The first is a matrix that contains the values of the independent variables to be tested. The second argument is a matrix that contains the values of the dependent variables. The method returns a vector of the PRESS values for each dependent variable. The RootMeanPress(Matrix<Double>, Matrix<Double>) method returns a single value: the square root of the mean of these values.

These methods can be used to determine the ideal number of components using cross validation. In the example below, we split the input into a training and a test dataset. We print out the PRESS value for the test set for a model based on a varying number of components, from 0 to 10:

// Create subsets (sets of indices) for train and test data:
var trainingSet = new Subset(all.RowCount, 0, 9);
var testSet = new Subset(all.RowCount, 10, 20);
// Generate the train and test data sets:
var XTrain = independents.GetRows(trainingSet);
var YTrain = dependents.GetRows(trainingSet);
// Set up the model:
var model = new PartialLeastSquaresModel(YTrain, XTrain, 0);
for (int k = 0; k <= 10; k++)
{
    model.NumberOfComponents = k;
    model.Fit();
    var XTest = independents.GetRows(testSet);
    var YTest = dependents.GetRows(testSet);
    double rmPress = model.RootMeanPress(YTest, XTest);
    Console.WriteLine("{0}: {1:F6}", k, rmPress);
}

Visual Basic

' Create subsets (sets of indices) for train And test data
Dim trainingSet = New Subset(all.RowCount, 0, 9)
Dim testSet = New Subset(all.RowCount, 10, 20)
' Generate the train And test data sets
Dim XTrain = independents.GetRows(trainingSet)
Dim YTrain = dependents.GetRows(trainingSet)
' Set up the model
Dim model = New PartialLeastSquaresModel(YTrain, XTrain, 0)
For k As Integer = 0 To 10
    model.NumberOfComponents = k
    model.Fit()
    Dim XTest = independents.GetRows(testSet)
    Dim YTest = dependents.GetRows(testSet)
    Dim rmPress = model.RootMeanPress(YTest, XTest)
    Console.WriteLine("{ 0 }: { 1:F6 }", k, rmPress)
Next

Visual Basic

No code example is currently available or this language may not be supported.

Visual Basic

// Create subsets (sets of indices) for train and test data:
let trainingSet = new Subset(all.RowCount, 0, 9)
let testSet = new Subset(all.RowCount, 10, 20)
// Generate the train and test data sets:
let XTrain = independents.GetRows(trainingSet)
let YTrain = dependents.GetRows(trainingSet)
// Set up the model:
let model = new PartialLeastSquaresModel(YTrain, XTrain, 0)
for k in 0..10 do
    model.NumberOfComponents <- k
    model.Fit()
    let XTest = independents.GetRows(testSet)
    let YTest = dependents.GetRows(testSet)
    let rmPress = model.RootMeanPress(YTest, XTest)
    printfn "%d: %.6f" k rmPress
//