Principal component analysis (PCA) is a data reduction technique that expresses a data set
in terms of components or combinations of variables that contribute most to the variation
in the data. As a result, the total number of variables used to describe the data is reduced,
at the cost of losing some of the fine-grained information in the data.
Defining PCA models
All classes related to Principal Component Analysis reside in the Extreme.Statistics.Multivariate namespace.
The main type is PrincipalComponentAnalysis,
which represents a PCA analysis.
The PrincipalComponentAnalysis
class has four constructors. The first constructor takes one parameter: a
Matrix whose columns contain the data to be analyzed.
The second constructor also takes one argument: an array of
NumericalVariable objects.
The third and fourth constructors each take two parameters. The third constructor takes a
VariableCollection as its first argument. The second argument is an array of strings
that contains the names of the variables from the collection that should be included in the analysis.
The fourth constructor takes a System.Data..::.DataTable as its first argument. The second argument
is once again an array of strings that this time contains the names of the columns to be included in the analysis.
Performing the analysis
When the variables in a PCA analysis use very different scales, the principal components will give more weight to
the variable with the larger values. To put all variables on an equal footing, the variables are often standardized to have
mean zero and unit standard deviation. The Standardize
property determines whether this transformation is performed. The default is true.
The Compute()()() method performs the actual calculations.
The following example reads data from a delimited text file into a matrix, and sets up and computes a principal component
analysis of the columns of a matrix:
| C# | Copy |
|---|
DelimitedTextMatrixReader reader = new DelimitedTextMatrixReader(@"..\..\..\..\Data\Depress.txt");
reader.MergeConsecutiveDelimiters = true;
reader.SetColumnDelimiters(' ');
Matrix m = reader.ReadMatrix();
m = m.GetSubmatrix(0, m.RowCount - 1, 8, 27);
PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(m);
pca.Compute();
|
| Visual Basic | Copy |
|---|
Dim reader As New DelimitedTextMatrixReader("..\..\..\..\Data\Depress.txt")
reader.MergeConsecutiveDelimiters = True
reader.SetColumnDelimiters(" "c)
Dim m As Matrix = reader.ReadMatrix()
m = m.GetSubmatrix(0, m.RowCount - 1, 8, 27)
Dim pca As New PrincipalComponentAnalysis(m)
pca.Compute()
|
Once the computations are complete, a number of properties and methods give access to the results in detail.
Results of the Analysis
The Components property provides access to
a collection of PrincipalComponent objects that provide details about
each of the principal components.
The components are sorted in order of their contribution to the variance in the data, in descending order.
The VarianceProportions and
CumulativeVarianceProportions properties summarize
the contribution of the components. The
GetVarianceThreshold(Double)
method calculates how many components
are needed to explain a certain proportion of the total variation in the data.
PrincipalComponent objects provide more detailed information.
The Eigenvalue property returns the eigenvalue corresponding to the
component. This is an absolute measure for the size of the contribution.
The EigenvalueDifference property returns the difference between the
eigenvalues of the component and the next most significant component. This gives another indication of the signficance of a component.
The greater the difference, the more important the component is compared to the remaining components.
The ProportionOfVariance and
CumulativeProportionOfVariance properties
give the contribution of the component to the variation in the data in relative terms.
Finally, the Value property returns the component as
a Vector. The code below illustrates these properties:
| C# | Copy |
|---|
Console.WriteLine(" # Eigenvalue Difference Contribution Contrib. %");
for (int i = 0; i < 5; i++)
{
PrincipalComponent component = pca.Components[i];
Console.WriteLine("{0,2}{1,12:F4}{1,11:F4}{2,14:F3}%{3,10:F3}%",
i, component.Eigenvalue, component.EigenvalueDifference,
100 * component.ProportionOfVariance,
100 * component.CumulativeProportionOfVariance);
}
|
| Visual Basic | Copy |
|---|
Console.WriteLine(" # Eigenvalue Difference Contribution Contrib. %")
For i As Integer = 0 To 4
Dim component As PrincipalComponent = pca.Components(i)
Console.WriteLine("{0,2}{1,12:F4}{1,11:F4}{2,14:F3}%{3,10:F3}%", _
i, component.Eigenvalue, component.EigenvalueDifference, _
100 * component.ProportionOfVariance, _
100 * component.CumulativeProportionOfVariance)
Next
|
The ComponentMatrix property returns
the components as the columns of a matrix.
The ScoreMatrix property expresses the observations
in terms of the components.
The GetPredictions(Int32) method
returns the observations if only the specified number of components is taken into account.
The sample code below shows how to get the predictions for the components that explain 90% of the variation in the data:
| C# | Copy |
|---|
int count = pca.GetVarianceThreshold(0.9);
Console.WriteLine("Components needed to explain 90% of variation: {0}", count);
VariableCollection prediction = pca.GetPredictions(count);
Console.WriteLine("Predictions using {0} components:", count);
Console.WriteLine(" Pr. 1 Act. 1 Pr. 2 Act. 2 Pr. 3 Act. 3 Pr. 4 Act. 4", count);
for (int i = 0; i < 10; i++)
Console.WriteLine("{0,8:F4}{1,8:F4}{2,8:F4}{3,8:F4}{4,8:F4}{5,8:F4}{6,8:F4}{7,8:F4}",
prediction[0].GetValue(i), m[i, 0],
prediction[1].GetValue(i), m[i, 1],
prediction[2].GetValue(i), m[i, 2],
prediction[3].GetValue(i), m[i, 3]);
|
| Visual Basic | Copy |
|---|
Dim count As Integer = pca.GetVarianceThreshold(0.9)
Console.WriteLine("Components needed to explain 90% of variation: {0}", count)
Dim prediction As VariableCollection = pca.GetPredictions(count)
Console.WriteLine("Predictions imports {0} components:", count)
Console.WriteLine(" Pr. 1 Act. 1 Pr. 2 Act. 2 Pr. 3 Act. 3 Pr. 4 Act. 4", count)
For i As Integer = 0 To 9
Console.WriteLine("{0,8:F4}{1,8:F4}{2,8:F4}{3,8:F4}{4,8:F4}{5,8:F4}{6,8:F4}{7,8:F4}", _
prediction(0).GetValue(i), m(i, 0), _
prediction(1).GetValue(i), m(i, 1), _
prediction(2).GetValue(i), m(i, 2), _
prediction(3).GetValue(i), m(i, 3))
Next
|