Extreme Optimization™: Complexity made simple.

Math and Statistics
Libraries for .NET

  • Home
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • Resources
    • Downloads
    • QuickStart Samples
    • Sample Applications
    • Frequently Asked Questions
    • Technical Support
  • Blog
  • Order
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Partners
    • Contact us
Introduction
Deployment Guide
Nuget packages
Configuration
Using Parallelism
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Data Analysis Library User's GuideData Analysis Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand Data Access Library User's GuideData Access Library User's Guide
Expand ReferenceReference
  • Extreme Optimization
    • Features
    • Solutions
    • Documentation
    • QuickStart Samples
    • Sample Applications
    • Downloads
    • Technical Support
    • Download trial
    • How to buy
    • Blog
    • Company
    • Resources
  • Documentation
    • Introduction
    • Deployment Guide
    • Nuget packages
    • Configuration
    • Using Parallelism
    • Mathematics Library User's Guide
    • Vector and Matrix Library User's Guide
    • Data Analysis Library User's Guide
    • Statistics Library User's Guide
    • Data Access Library User's Guide
    • Reference
  • Statistics Library User's Guide
    • Statistical Variables
    • Numerical Variables
    • Statistical Models
    • Regression Analysis
    • Analysis of Variance
    • Time Series Analysis
    • Multivariate Analysis
    • Continuous Distributions
    • Discrete Distributions
    • Multivariate Distributions
    • Kernel Density Estimation
    • Hypothesis Tests
    • Appendices
  • Multivariate Analysis
    • Hierarchical Cluster Analysis
    • K-Means Cluster Analysis
    • Principal Component Analysis
    • Factor Analysis
    • Discriminant Analysis
    • Partial Least Squares
  • K-Means Cluster Analysis

K-Means Cluster Analysis

Extreme Optimization Numerical Libraries for .NET Professional

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

In K-Means clustering, the number of clusters is fixed at the beginning. A cluster is defined by its cluster center or centroid. A number of initial cluster centers is chosen. The observations are assigned to the closest cluster. Each centroid is then recalculated as the mean of its members. This changes the distances between cluster centers and observations, so the observations are once again reassigned. This process is repeated until no more observations change cluster.

Note that the final partition depends on the initial location of the centers. Different applications may return different results for the same dataset.

Running a cluster analysis

K-Means clustering is implemented by the KMeansClusterAnalysis class. This class has three constructors. The first constructor takes one argument: a MatrixT whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of VectorT objects. Both these constructors are illustrated below:

C#
VB
C++
F#
Copy
var matrix = Matrix.CreateRandom(100, 10);
var kc1 = new KMeansClusterAnalysis(matrix, 3);
var vectors = matrix.Columns.ToArray();
var kc2 = new KMeansClusterAnalysis(vectors, 3);
Dim mat = Matrix.CreateRandom(100, 10)
Dim kc1 = New KMeansClusterAnalysis(mat, 3)
Dim vectors = mat.Columns.ToArray()
Dim kc2 = New KMeansClusterAnalysis(vectors, 3)

No code example is currently available or this language may not be supported.

let matrix = Matrix.CreateRandom(100, 10)
let kc1 = new KMeansClusterAnalysis(matrix, 3)
let vectors = matrix.Columns.ToArray()
let kc2 = new KMeansClusterAnalysis(vectors, 3)

The third constructor takes two arguments. The first is a IDataFrame (a DataFrameR, C or MatrixT) that contains the variables that may be used in the analysis. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis.

C#
VB
C++
F#
Copy
var rowIndex = Index.Default(matrix.RowCount);
var names = new string[] { "x1", "x2", "x3",
    "x4", "x5", "x6", "x7", "x8", "x9", "x10" };
var columnIndex = Index.Create(names);
var dataFrame = matrix.ToDataFrame(rowIndex, columnIndex);
var kc3 = new KMeansClusterAnalysis(dataFrame, names, 3);
Dim rowIndex = Index.Default(mat.RowCount)
Dim names = {"x1", "x2", "x3",
    "x4", "x5", "x6", "x7", "x8", "x9", "x10"}
Dim columnIndex = Index.Create(names)
Dim frame = mat.ToDataFrame(rowIndex, columnIndex)
Dim kc3 = New KMeansClusterAnalysis(frame, names, 3)

No code example is currently available or this language may not be supported.

let rowIndex = Index.Default(matrix.RowCount)
let names = [| "x1"; "x2"; "x3";
    "x4"; "x5"; "x6"; "x7"; "x8"; "x9"; "x10" |]
let columnIndex = Index.Create(names)
let dataFrame = matrix.ToDataFrame(rowIndex, columnIndex)
let kc3 = new KMeansClusterAnalysis(dataFrame, names, 3)

The outcome of the algorithm depends on this initialization. Different methods have been devised to generate the best possible results. Which method to use can be selected through the InitializationMethod property. It is of type KMeansInitializationMethod and can take on the following values:

Value

Description

KMeansPlusPlus

Use the K-means++ algorithm to compute initial centroids. This is the default.

RandomCenters

Use randomly selected observations as the centroid.

Forgy

Same as RandomCenters. Use randomly selected observations as the centroid.

RandomAssignments

Assign each observation randomly to one of the clusters and uses the centroid of each cluster.

The initialization procedure always involve some randomization. The RandomNumberGenerator property lets you set the random number generator that is used to obtain any needed pseudo-random numbers. Finally, the Standardize property lets you specify whether variables should be standardized before running the analysis. When variables are unequally scaled, some variables will make a larger contribution to the distance than others. This can distort the clustering. To avoid this problem, the variables can be standardized so they all contribute equally to the distance. The default is to standardize variables.

The Compute method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail. The following code sample sets up some details of a K-means cluster analysis and runs it:

C#
VB
C++
F#
Copy
kc1.InitializationMethod = KMeansInitializationMethod.KMeansPlusPlus;
kc1.RandomNumberGenerator = new MersenneTwister();
kc1.Standardize = true;
kc1.Fit();
kc1.InitializationMethod = KMeansInitializationMethod.KMeansPlusPlus
kc1.RandomNumberGenerator = New MersenneTwister()
kc1.Standardize = True
kc1.Fit()

No code example is currently available or this language may not be supported.

kc1.InitializationMethod <- KMeansInitializationMethod.KMeansPlusPlus
kc1.RandomNumberGenerator <- MersenneTwister()
kc1.Standardize <- true
kc1.Fit()
Results of the analysis

The Centers property returns an array of vectors that contains the clusters centers. The Predictions property returns a CategoricalVectorT that for each observation indicates the cluster to which it belongs. The GetDistancesToCenters method returns a VectorT that for each observation indicates the distance of the observation from the center of its cluster.

The Clusters property returns an array of KMeansCluster objects that describes each cluster in detail. The Center property returns the center of the cluster as a VectorT. The Size property returns the number of observations in the cluster. The MemberIndexes property returns a vector containing the indexes of its members in the original dataset. The SumOfSquares returns the within-cluster sum of squares of the distances of its members to the center. The sample code below prints some information about each cluster. In addition, it prints out for each observation the cluster it belongs to and the distance to its center:

C#
VB
C++
F#
Copy
foreach (var cluster in kc1.Clusters)
{
    Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}",
        cluster.Index, cluster.Size, cluster.SumOfSquares);
    Console.WriteLine("Center: {0:F4}", cluster.Center);
}
var memberships = kc1.Predictions;
var distances = kc1.GetDistancesToCenters();
for (int i = 18; i < memberships.Length; i++)
    Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.",
        i, memberships.GetLevelIndex(i), distances[i]);
For Each cluster In kc1.Clusters
    Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}",
        cluster.Index, cluster.Size, cluster.SumOfSquares)
    Console.WriteLine("Center: {0:F4}", cluster.Center)
Next
Dim memberships = kc1.Predictions
Dim distances = kc1.GetDistancesToCenters()
For i = 18 To memberships.Length - 1
    Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.",
        i, memberships.GetLevelIndex(i), distances(i))
Next

No code example is currently available or this language may not be supported.

for cluster in kc1.Clusters do
    printfn "Cluster %d has %d members. Sum of squares: %.4f"
        cluster.Index cluster.Size cluster.SumOfSquares
    printfn "Center: %A" cluster.Center
let memberships = kc1.Predictions
let distances = kc1.GetDistancesToCenters()
for i = 18 to memberships.Length-1 do
    printfn "Observation %d belongs to cluster %d, distance: %.4f."
        i (memberships.GetLevelIndex(i)) distances.[i]

Copyright (c) 2004-2021 ExoAnalytics Inc.

Send comments on this topic to support@extremeoptimization.com

Copyright © 2004-2021, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.