Extreme Optimization™: Complexity made simple.

Numerical Components
for .NET

  • Home
  • •
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • •
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • •
  • Support
    • Frequently Asked Questions
    • QuickStart Samples
    • Sample Applications
    • Downloads
  • •
  • Blog
  • •
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Contact us
Introduction
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand ReferenceReference
  • Home
  • Documentation
  • Statistics Library User's Guide
  • Multivariate Analysis
  • Principal Component Analysis
Collapse imageExpand ImageCopy imageCopyHover image
       




Principal Component Analysis

Principal component analysis (PCA) is a data reduction technique that expresses a data set in terms of components or combinations of variables that contribute most to the variation in the data. As a result, the total number of variables used to describe the data is reduced, at the cost of losing some of the fine-grained information in the data.

Defining PCA models

All classes related to Principal Component Analysis reside in the Extreme.Statistics.Multivariate namespace. The main type is PrincipalComponentAnalysis, which represents a PCA analysis.

The PrincipalComponentAnalysis class has four constructors. The first constructor takes one parameter: a Matrix whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of NumericalVariable objects.

The third and fourth constructors each take two parameters. The third constructor takes a VariableCollection as its first argument. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis. The fourth constructor takes a System.Data..::.DataTable as its first argument. The second argument is once again an array of strings that this time contains the names of the columns to be included in the analysis.

Performing the analysis

When the variables in a PCA analysis use very different scales, the principal components will give more weight to the variable with the larger values. To put all variables on an equal footing, the variables are often standardized to have mean zero and unit standard deviation. The Standardize property determines whether this transformation is performed. The default is true. The Compute()()() method performs the actual calculations. The following example reads data from a delimited text file into a matrix, and sets up and computes a principal component analysis of the columns of a matrix:

C# Copy imageCopy
DelimitedTextMatrixReader reader = new DelimitedTextMatrixReader(@"..\..\..\..\Data\Depress.txt");
reader.MergeConsecutiveDelimiters = true;
reader.SetColumnDelimiters(' ');
Matrix m = reader.ReadMatrix();
// The data we want is in columns 8 through 27:
m = m.GetSubmatrix(0, m.RowCount - 1, 8, 27);
PrincipalComponentAnalysis pca = new PrincipalComponentAnalysis(m);
pca.Compute();
Visual Basic Copy imageCopy
Dim reader As New DelimitedTextMatrixReader("..\..\..\..\Data\Depress.txt")
reader.MergeConsecutiveDelimiters = True
reader.SetColumnDelimiters(" "c)
Dim m As Matrix = reader.ReadMatrix()
' The data we want is in columns 8 through 27:
m = m.GetSubmatrix(0, m.RowCount - 1, 8, 27)
Dim pca As New PrincipalComponentAnalysis(m)
pca.Compute()

Once the computations are complete, a number of properties and methods give access to the results in detail.

Results of the Analysis

The Components property provides access to a collection of PrincipalComponent objects that provide details about each of the principal components. The components are sorted in order of their contribution to the variance in the data, in descending order.

The VarianceProportions and CumulativeVarianceProportions properties summarize the contribution of the components. The GetVarianceThreshold(Double) method calculates how many components are needed to explain a certain proportion of the total variation in the data.

PrincipalComponent objects provide more detailed information. The Eigenvalue property returns the eigenvalue corresponding to the component. This is an absolute measure for the size of the contribution. The EigenvalueDifference property returns the difference between the eigenvalues of the component and the next most significant component. This gives another indication of the signficance of a component. The greater the difference, the more important the component is compared to the remaining components. The ProportionOfVariance and CumulativeProportionOfVariance properties give the contribution of the component to the variation in the data in relative terms. Finally, the Value property returns the component as a Vector. The code below illustrates these properties:

C# Copy imageCopy
Console.WriteLine(" #    Eigenvalue Difference Contribution Contrib. %");
for (int i = 0; i < 5; i++)
{ 
    // We get the ith component from the model...
    PrincipalComponent component = pca.Components[i];
    // and write out its properties
    Console.WriteLine("{0,2}{1,12:F4}{1,11:F4}{2,14:F3}%{3,10:F3}%",
        i, component.Eigenvalue, component.EigenvalueDifference, 
        100 * component.ProportionOfVariance,
        100 * component.CumulativeProportionOfVariance);
}
Visual Basic Copy imageCopy
Console.WriteLine(" #    Eigenvalue Difference Contribution Contrib. %")
For i As Integer = 0 To 4
    ' We get the ith component from the model...
    Dim component As PrincipalComponent = pca.Components(i)
    ' and write out its properties
    Console.WriteLine("{0,2}{1,12:F4}{1,11:F4}{2,14:F3}%{3,10:F3}%", _
        i, component.Eigenvalue, component.EigenvalueDifference, _
        100 * component.ProportionOfVariance, _
        100 * component.CumulativeProportionOfVariance)
Next

The ComponentMatrix property returns the components as the columns of a matrix. The ScoreMatrix property expresses the observations in terms of the components. The GetPredictions(Int32) method returns the observations if only the specified number of components is taken into account. The sample code below shows how to get the predictions for the components that explain 90% of the variation in the data:

C# Copy imageCopy
int count = pca.GetVarianceThreshold(0.9);
Console.WriteLine("Components needed to explain 90% of variation: {0}", count);
VariableCollection prediction = pca.GetPredictions(count);
Console.WriteLine("Predictions using {0} components:", count);
Console.WriteLine("   Pr. 1  Act. 1   Pr. 2  Act. 2   Pr. 3  Act. 3   Pr. 4  Act. 4", count);
for (int i = 0; i < 10; i++)
    Console.WriteLine("{0,8:F4}{1,8:F4}{2,8:F4}{3,8:F4}{4,8:F4}{5,8:F4}{6,8:F4}{7,8:F4}",
        prediction[0].GetValue(i), m[i, 0],
        prediction[1].GetValue(i), m[i, 1],
        prediction[2].GetValue(i), m[i, 2],
        prediction[3].GetValue(i), m[i, 3]);
Visual Basic Copy imageCopy
Dim count As Integer = pca.GetVarianceThreshold(0.9)
Console.WriteLine("Components needed to explain 90% of variation: {0}", count)
Dim prediction As VariableCollection = pca.GetPredictions(count)
Console.WriteLine("Predictions imports {0} components:", count)
Console.WriteLine("   Pr. 1  Act. 1   Pr. 2  Act. 2   Pr. 3  Act. 3   Pr. 4  Act. 4", count)
For i As Integer = 0 To 9
    Console.WriteLine("{0,8:F4}{1,8:F4}{2,8:F4}{3,8:F4}{4,8:F4}{5,8:F4}{6,8:F4}{7,8:F4}", _
        prediction(0).GetValue(i), m(i, 0), _
        prediction(1).GetValue(i), m(i, 1), _
        prediction(2).GetValue(i), m(i, 2), _
        prediction(3).GetValue(i), m(i, 3))
Next

Send comments on this topic to support@extremeoptimization.com

Copyright © 2003-2010, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.