Extreme Optimization™: Complexity made simple.

Numerical Components
for .NET

  • Home
  • •
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • •
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • •
  • Support
    • Frequently Asked Questions
    • QuickStart Samples
    • Sample Applications
    • Downloads
  • •
  • Blog
  • •
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Contact us
Introduction
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand ReferenceReference
  • Home
  • Documentation
  • Statistics Library User's Guide
  • Multivariate Analysis
  • K-Means Cluster Analysis
Collapse imageExpand ImageCopy imageCopyHover image
       




K-Means Cluster Analysis

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

In K-Means clustering, the number of clusters is fixed at the beginning. A number of initial cluster centers is chosen. The observations are assigned to the closest cluster. Each center is then recalculated as the mean of its members. This changes the distances between cluster centers and observations, so the observations are once again reassigned. This process is repeated until no more observations change cluster.

Note that the final partition depends on the initial location of the centers. Different applications may return different results for the same dataset.

Running a cluster analysis

K-Means clustering is implemented by the KMeansClusterAnalysis class. This class has four constructors. The first constructor takes two parameters. The first is a Matrix whose columns contain the data to be analyzed. The second parameter is the number of clusters to find. The second constructor also takes two parameters: an array of NumericalVariable objects and the number of clusters to find..

The third and fourth constructors each take three parameters. The third constructor takes a VariableCollection as its first argument. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis. The last parameter is the number of clusters to find. The fourth constructor takes a System.Data..::.DataTable as its first argument. The second argument is once again an array of strings that this time contains the names of the columns to be included in the analysis. The last parameter is once again the number of clusters to find.

The Compute()()() method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail. The following code sample sets up a K-means cluster analysis with 3 clusters:

C# Copy imageCopy
VariableCollection variables = new VariableCollection(data); 
KMeansClusterAnalysis kmc = new KMeansClusterAnalysis(variables, 3);
kmc.Standardize = true;
kmc.Compute();

Visual Basic Copy imageCopy
Dim variables As New VariableCollection(data)
Dim kmc As New KMeansClusterAnalysis(variables, 3)
kmc.Standardize = True
kmc.Compute()

Results of the analysis

Use the GetClusters()()() method to get an object of type KMeansClusterCollection, which - as the name implies - is a collection of KMeansCluster objects. In addition to the usual collection properties and methods, this class has two more methods. The GetMemberships()()() method, which returns a CategoricalVariable that for each observation indicates the cluster to which it belongs. The GetDistancesToCenters()()() method returns a NumericalVariable that for each observation indicates the distance of the observation from the center of its cluster.

Each KMeansCluster describes one cluster in detail. The Center property returns the center of the cluster as a Vector. The Size property returns the number of observations in the cluster. The MemberFilter property returns a Filter that selects the members of the cluster from the original dataset. The SumOfSquares returns the within-cluster sum of squares of the distances of its members to the center. The sample code below prints information about each cluster. In addition, it prints out for each observation the cluster it belongs to and the distance to its center:

C# Copy imageCopy
KMeansClusterCollection clusters = kmc.GetClusters();
foreach (KMeansCluster cluster in clusters)
{
    Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}", 
        cluster.Index, cluster.Size, cluster.SumOfSquares);
    Console.WriteLine("Center: {0:F4}", cluster.Center);
}
CategoricalVariable memberships = clusters.GetMemberships();
NumericalVariable distances = clusters.GetDistancesToCenters();
for (int i = 18; i < memberships.Length; i++)
    Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.", 
        i, memberships.GetLevelIndex(i), distances[i]);
Visual Basic Copy imageCopy
Dim clusters As KMeansClusterCollection = kmc.GetClusters()
For Each cluster As KMeansCluster In clusters
    Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}", _
        cluster.Index, cluster.Size, cluster.SumOfSquares)
    Console.WriteLine("Center: {0:F4}", cluster.Center)
Next
Dim memberships As CategoricalVariable = clusters.GetMemberships()
Dim distances As NumericalVariable = clusters.GetDistancesToCenters()
For i As Integer = 18 To memberships.Length - 1
    Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.", _
        i, memberships.GetLevelIndex(i), distances(i))
Next

Send comments on this topic to support@extremeoptimization.com

Copyright © 2003-2010, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.