Extreme Optimization™: Complexity made simple.

Numerical Components
for .NET

  • Home
  • •
  • Features
    • Math Library
    • Vector and Matrix Library
    • Statistics Library
    • Performance
    • Usability
  • •
  • Documentation
    • Introduction
    • Math Library User's Guide
    • Vector and Matrix Library User's Guide
    • Statistics Library User's Guide
    • Reference
  • •
  • Support
    • Frequently Asked Questions
    • QuickStart Samples
    • Sample Applications
    • Downloads
  • •
  • Blog
  • •
  • Company
    • About us
    • Testimonials
    • Customers
    • Press Releases
    • Careers
    • Contact us
Introduction
Expand Mathematics Library User's GuideMathematics Library User's Guide
Expand Vector and Matrix Library User's GuideVector and Matrix Library User's Guide
Expand Statistics Library User's GuideStatistics Library User's Guide
Expand ReferenceReference
  • Home
  • Documentation
  • Statistics Library User's Guide
  • Multivariate Analysis
  • Hierarchical Cluster Analysis
Collapse imageExpand ImageCopy imageCopyHover image
       




Hierarchical Cluster Analysis

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

Hierarchical clustering organizes observations in a tree structure based on similarity or dissimilarity between clusters. The algorithm starts with each observation as its own cluster, and successively combines the clusters that are most similar. Two important properties of the algorithm are the distance measure and the linkage method.

Distance measures

The distance measure determines how the similarity between clusters is measured. The most common distance measure is the squared Euclidean distance. The following distance measures have been predefined, and are available as properties of the DistanceMeasures class:

Value Description
EuclidianDistance The standard Euclidian distance.
SquaredEuclidianDistance The square of the Euclidean distance. This is the default.
ManhattanDistance The Manhattan or city block distance, computed as the sum of the absolute values of the difference between corresponding components.
MaximumDistance The distance computed as the largest absolute difference between corresponding components.
CorrelationDistance A measure based on the correlation between the observations.
CosineDistance The distance computed as the cosine of the angle between two observations.
MinkowskiDistance(power) The general Minkowski distance for the specified power. The power must be specified as a parameter to the method.
CanberraDistance The so-called Canberra distance, which gives more weight to components that add up to small values.

Linkage

The linkage method determines how clusters are combined and how the similarities between the merged cluster and the remaining clusters are computed. It can take on the following values:

Value Description
Single The distance between two clusters is the distance between their closest neighboring points.
Complete The distance between two clusters is the distance between their two furthest member points.
Average Also called unweighted pair-group method using averages (UPGMA). The distance between two clusters is the average distance between all inter-cluster pairs.
Ward The cluster to be merged is the one which will produce the least increase in the sum of squared Euclidean distances from each case in a cluster to the mean of all variables.
Median Also called unweighted pair-group method using centroid averages (UPGMC). The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means (centroids) for all variables.
McQuitty The distance between a cluster and a newly merged cluster is the mean of the distance between the cluster and each of the two component clusters.
Centroid The distance between two clusters is the distance between the centroids or the means of the clusters.

Running a cluster analysis

Hierarchical clustering is implemented by the HierarchicalClusterAnalysis class. This class has four constructors. The first constructor takes one parameter: a Matrix whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of NumericalVariable objects.

The third and fourth constructors each take two parameters. The third constructor takes a VariableCollection as its first argument. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis. The fourth constructor takes a System.Data..::.DataTable as its first argument. The second argument is once again an array of strings that this time contains the names of the columns to be included in the analysis.

The distance measure can be set through the DistanceMeasure property. It defaults to squared Euclidean distance. To set or get the linkage method, use the DistanceMeasure property. The default here is the centroid method. The following example sets up a hierarchical cluster analysis using 7 variabels and standardizes the results:

C# Copy imageCopy
VariableCollection variables = new VariableCollection(data); 
HierarchicalClusterAnalysis hc = new HierarchicalClusterAnalysis(variables);
hc.Standardize = true;
hc.LinkageMethod = LinkageMethod.Centroid;
hc.DistanceMeasure = DistanceMeasures.SquaredEuclidianDistance;
hc.Compute();
Visual Basic Copy imageCopy
Dim variables As New VariableCollection(data)
Dim hc As New HierarchicalClusterAnalysis(variables)
hc.Standardize = True
hc.LinkageMethod = LinkageMethod.Centroid
hc.DistanceMeasure = DistanceMeasures.SquaredEuclidianDistance
hc.Compute()

Results of the analysis

The Compute()()() method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail.

Use the GetClusterPartition(Int32) method to divide the observations into the specified number of sets. This method returns a HierarchicalClusterCollection class, which - as the name implies - is a collection of HierarchicalCluster objects. In addition to the usual collection properties and methods, this class also has a GetMemberships()()() method, which returns a CategoricalVariable that for each observation indicates the cluster to which it belongs.

Each HierarchicalCluster describes one cluster in detail. The Size property returns the number of observations in the cluster. The MemberFilter property returns a Filter that selects the members of the cluster from the original dataset. The Root property returns the DendrogramNode node that is the root of the set. The following example creates a partition of 5 clusters, prints the size of each, and also prints out which cluster each observation belongs to:

C# Copy imageCopy
HierarchicalClusterCollection partition = hc.GetClusterPartition(5);
foreach(HierarchicalCluster cluster in partition)
    Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size);
// Get a variable that shows memberships:
CategoricalVariable memberships = partition.GetMemberships();
for (int i = 15; i < memberships.Length; i++)
    Console.WriteLine("Observation {0} belongs to cluster {1}", i, memberships.GetLevelIndex(i));

Visual Basic Copy imageCopy
Dim partition As HierarchicalClusterCollection = hc.GetClusterPartition(5)
For Each cluster As HierarchicalCluster In partition
    Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size)
Next
' Get a variable that shows memberships:
Dim memberships As CategoricalVariable = partition.GetMemberships()
For i As Integer = 15 To memberships.Length - 1
    Console.WriteLine("Observation {0} belongs to cluster {1}", i, memberships.GetLevelIndex(i))
Next i

Dendrograms

A dendrogram is a tree-like representation of the results of a hierarchical cluster analysis. The nodes of the tree are clusters that represent either original observations, or clusters that resulted from merging two clusters. The nodes of a dendrogram are represented by DendrogramNode objects. The dendrogram of a cluster analysis can be accessed through its root node, available through the a DendrogramRoot property.

The IsLeaf property indicates whether the cluster is a leaf node (an observation from the original data set), or a merged cluster. If it is a leaf node, then the ObservationIndex returns the index of this observation in the original data set. The LeftChild and RightChild properties specify the two clusters that were merged. DistanceMeasure returns the distance between the two merged clusters. This property is zero for leaf nodes. Level returns the number of nodes between the current node and the root node. The root node is at level 0. The children of the root are at level 1, and so on.

A dendrogram is also useful to give a graphical representation of the results. The Position property gives the horizontal position of the node. The leaf nodes (observations) have integer positions that correspond to their GetDendrogramOrder()()() and can be used as one coordinate. It is centered over the two clusters that were merged. The DistanceMeasure or Level properties can be used as the other coordinate.

The next example shows how to access the dendrogram nodes and their properties:

C# Copy imageCopy
DendrogramNode root = hc.DendrogramRoot;
// Position and DistanceMeasure give the x and y coordinates:
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure);
// The left and right children:
Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position);
Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position);

Visual Basic Copy imageCopy
Dim root As DendrogramNode = hc.DendrogramRoot
' Position and DistanceMeasure give the x and y coordinates:
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure)
' The left and right children:
Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position)
Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position)

Send comments on this topic to support@extremeoptimization.com

Copyright © 2003-2010, Extreme Optimization. All rights reserved.
Extreme Optimization, Complexity made simple, M#, and M Sharp are trademarks of ExoAnalytics Inc.
Microsoft, Visual C#, Visual Basic, Visual Studio, Visual Studio.NET, and the Optimized for Visual Studio logo
are registered trademarks of Microsoft Corporation.