K-Means Cluster Analysis | Extreme Optimization Numerical Libraries for .NET Professional |

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

In K-Means clustering, the number of clusters is fixed at the beginning. A cluster is defined by its cluster center or centroid. A number of initial cluster centers is chosen. The observations are assigned to the closest cluster. Each centroid is then recalculated as the mean of its members. This changes the distances between cluster centers and observations, so the observations are once again reassigned. This process is repeated until no more observations change cluster.

Note that the final partition depends on the initial location of the centers. Different applications may return different results for the same dataset.

K-Means clustering is implemented by the
KMeansClusterAnalysis class.
This class has three constructors. The first constructor takes one argument: a
Matrix

var matrix = Matrix.CreateRandom(100, 10); var kc1 = new KMeansClusterAnalysis(matrix, 3); var vectors = matrix.Columns.ToArray(); var kc2 = new KMeansClusterAnalysis(vectors, 3);

The third constructor takes two arguments. The first is a
IDataFrame (a
DataFrame

var rowIndex = Index.Default(matrix.RowCount); var names = new string[] { "x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10" }; var columnIndex = Index.Create(names); var dataFrame = matrix.ToDataFrame(rowIndex, columnIndex); var kc3 = new KMeansClusterAnalysis(dataFrame, names, 3);

The outcome of the algorithm depends on this initialization. Different methods have been devised to generate the best possible results. Which method to use can be selected through the InitializationMethod property. It is of type KMeansInitializationMethod and can take on the following values:

Value | Description |
---|---|

KMeansPlusPlus | Use the K-means++ algorithm to compute initial centroids. This is the default. |

RandomCenters | Use randomly selected observations as the centroid. |

Forgy | Same as RandomCenters. Use randomly selected observations as the centroid. |

RandomAssignments | Assign each observation randomly to one of the clusters and uses the centroid of each cluster. |

The initialization procedure always involve some randomization. The RandomNumberGenerator property lets you set the random number generator that is used to obtain any needed pseudo-random numbers. Finally, the Standardize property lets you specify whether variables should be standardized before running the analysis. When variables are unequally scaled, some variables will make a larger contribution to the distance than others. This can distort the clustering. To avoid this problem, the variables can be standardized so they all contribute equally to the distance. The default is to standardize variables.

The Compute method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail. The following code sample sets up some details of a K-means cluster analysis and runs it:

kc1.InitializationMethod = KMeansInitializationMethod.KMeansPlusPlus; kc1.RandomNumberGenerator = new MersenneTwister(); kc1.Standardize = true; kc1.Compute();

The Centers property
returns an array of vectors that contains the clusters centers.
The Predictions property
returns a CategoricalVector

The Clusters
property returns an array of KMeansCluster
objects that describes each cluster in detail.
The Center property returns the
center of the cluster as a Vector

foreach (var cluster in kc1.Clusters) { Console.WriteLine("Cluster {0} has {1} members. Sum of squares: {2:F4}", cluster.Index, cluster.Size, cluster.SumOfSquares); Console.WriteLine("Center: {0:F4}", cluster.Center); } var memberships = kc1.Predictions; var distances = kc1.GetDistancesToCenters(); for (int i = 18; i < memberships.Length; i++) Console.WriteLine("Observation {0} belongs to cluster {1}, distance: {2:F4}.", i, memberships.GetLevelIndex(i), distances[i]);

Copyright Â© 2004-20116,
Extreme Optimization. All rights reserved.

*Extreme Optimization,* *Complexity made simple*, *M#*, and *M
Sharp* are trademarks of ExoAnalytics Inc.

*Microsoft*, *Visual C#, Visual Basic, Visual Studio*, *Visual
Studio.NET*, and the *Optimized for Visual Studio* logo

are
registered trademarks of Microsoft Corporation.