- Extreme Optimization
- Documentation
- Statistics Library User's Guide
- Statistical Variables
- Continuous Variables
- Categorical Variables
- Variable Collections
- General Linear Models
- Regression Analysis
- Analysis of Variance
- Time Series Analysis
- Multivariate Analysis
- Continuous Distributions
- Discrete Distributions
- Multivariate Distributions
- Hypothesis Tests
- Histograms
- Random Numbers
- Appendices

- Multivariate Analysis
- Hierarchical Cluster Analysis

# Hierarchical Cluster Analysis

Cluster analysis is the collective name given to a number of algorithms for grouping similar objects into distinct categories. It is a form of exploratory data analysis aimed at grouping observations in a way that minimizes the difference within groups while maximizing the difference between groups.

Hierarchical clustering organizes observations in a tree structure based on similarity or dissimilarity between clusters. The algorithm starts with each observation as its own cluster, and successively combines the clusters that are most similar. Two important properties of the algorithm are the distance measure and the linkage method.

# Distance measures

The *distance measure* determines how the similarity between clusters is measured.
The most common distance measure is the squared Euclidean distance.
The following distance measures have been predefined, and are available as properties of the
DistanceMeasures class:

Value | Description |
---|---|

EuclidianDistance | The standard Euclidian distance. |

SquaredEuclidianDistance | The square of the Euclidean distance. This is the default. |

ManhattanDistance | The Manhattan or city block distance, computed as the sum of the absolute values of the difference between corresponding components. |

MaximumDistance | The distance computed as the largest absolute difference between corresponding components. |

CorrelationDistance | A measure based on the correlation between the observations. |

CosineDistance | The distance computed as the cosine of the angle between two observations. |

MinkowskiDistance(power) | The general Minkowski distance for the specified power. The power must be specified as a parameter to the method. |

CanberraDistance | The so-called Canberra distance, which gives more weight to components that add up to small values. |

# Linkage

The *linkage method* determines how clusters are combined and how the similarities
between the merged cluster and the remaining clusters are computed. It can take on the following values:

Value | Description |
---|---|

Single | The distance between two clusters is the distance between their closest neighboring points. |

Complete | The distance between two clusters is the distance between their two furthest member points. |

Average | Also called unweighted pair-group method using averages (UPGMA). The distance between two clusters is the average distance between all inter-cluster pairs. |

Ward | The cluster to be merged is the one which will produce the least increase in the sum of squared Euclidean distances from each case in a cluster to the mean of all variables. |

Median | Also called unweighted pair-group method using centroid averages (UPGMC). The cluster to be merged is the one with the smallest sum of Euclidean distances between cluster means (centroids) for all variables. |

McQuitty | The distance between a cluster and a newly merged cluster is the mean of the distance between the cluster and each of the two component clusters. |

Centroid | The distance between two clusters is the distance between the centroids or the means of the clusters. |

# Running a cluster analysis

Hierarchical clustering is implemented by the HierarchicalClusterAnalysis class. This class has four constructors. The first constructor takes one parameter: a Matrix whose columns contain the data to be analyzed. The second constructor also takes one argument: an array of NumericalVariable objects.

The third and fourth constructors each take two parameters. The third constructor takes a VariableCollection as its first argument. The second argument is an array of strings that contains the names of the variables from the collection that should be included in the analysis. The fourth constructor takes a System.Data..::..DataTable as its first argument. The second argument is once again an array of strings that this time contains the names of the columns to be included in the analysis.

The distance measure can be set through the DistanceMeasure property. It defaults to squared Euclidean distance. To set or get the linkage method, use the DistanceMeasure property. The default here is the centroid method. The following example sets up a hierarchical cluster analysis using 7 variabels and standardizes the results:

C# | Copy |
---|---|

VariableCollection variables = new VariableCollection(data); HierarchicalClusterAnalysis hc = new HierarchicalClusterAnalysis(variables); hc.Standardize = true; hc.LinkageMethod = LinkageMethod.Centroid; hc.DistanceMeasure = DistanceMeasures.SquaredEuclidianDistance; hc.Compute(); |

Visual Basic | Copy |
---|---|

Dim variables As New VariableCollection(data) Dim hc As New HierarchicalClusterAnalysis(variables) hc.Standardize = True hc.LinkageMethod = LinkageMethod.Centroid hc.DistanceMeasure = DistanceMeasures.SquaredEuclidianDistance hc.Compute() |

# Results of the analysis

The Compute()()()() method performs the actual calculations. Once the computations are complete, a number of properties and methods give access to the results in detail.

Use the GetClusterPartition(Int32) method to divide the observations into the specified number of sets. This method returns a HierarchicalClusterCollection class, which - as the name implies - is a collection of HierarchicalCluster objects. In addition to the usual collection properties and methods, this class also has a GetMemberships()()()() method, which returns a CategoricalVariable that for each observation indicates the cluster to which it belongs.

Each HierarchicalCluster describes one cluster in detail. The Size property returns the number of observations in the cluster. The MemberFilter property returns a Filter that selects the members of the cluster from the original dataset. The Root property returns the DendrogramNode node that is the root of the set. The following example creates a partition of 5 clusters, prints the size of each, and also prints out which cluster each observation belongs to:

C# | Copy |
---|---|

HierarchicalClusterCollection partition = hc.GetClusterPartition(5); foreach(HierarchicalCluster cluster in partition) Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size); // Get a variable that shows memberships: CategoricalVariable memberships = partition.GetMemberships(); for (int i = 15; i < memberships.Length; i++) Console.WriteLine("Observation {0} belongs to cluster {1}", i, memberships.GetLevelIndex(i)); |

Visual Basic | Copy |
---|---|

Dim partition As HierarchicalClusterCollection = hc.GetClusterPartition(5) For Each cluster As HierarchicalCluster In partition Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size) Next ' Get a variable that shows memberships: Dim memberships As CategoricalVariable = partition.GetMemberships() For i As Integer = 15 To memberships.Length - 1 Console.WriteLine("Observation {0} belongs to cluster {1}", i, memberships.GetLevelIndex(i)) Next i |

# Dendrograms

A dendrogram is a tree-like representation of the results of a hierarchical cluster analysis. The nodes of the tree are clusters that represent either original observations, or clusters that resulted from merging two clusters. The nodes of a dendrogram are represented by DendrogramNode objects. The dendrogram of a cluster analysis can be accessed through its root node, available through the a DendrogramRoot property.

The IsLeaf property indicates whether the cluster is a leaf node (an observation from the original data set), or a merged cluster. If it is a leaf node, then the ObservationIndex returns the index of this observation in the original data set. The LeftChild and RightChild properties specify the two clusters that were merged. DistanceMeasure returns the distance between the two merged clusters. This property is zero for leaf nodes. Level returns the number of nodes between the current node and the root node. The root node is at level 0. The children of the root are at level 1, and so on.

A dendrogram is also useful to give a graphical representation of the results. The Position property gives the horizontal position of the node. The leaf nodes (observations) have integer positions that correspond to their GetDendrogramOrder()()()() and can be used as one coordinate. It is centered over the two clusters that were merged. The DistanceMeasure or Level properties can be used as the other coordinate.

The next example shows how to access the dendrogram nodes and their properties:

C# | Copy |
---|---|

DendrogramNode root = hc.DendrogramRoot; // Position and DistanceMeasure give the x and y coordinates: Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure); // The left and right children: Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position); Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position); |

Visual Basic | Copy |
---|---|

Dim root As DendrogramNode = hc.DendrogramRoot ' Position and DistanceMeasure give the x and y coordinates: Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure) ' The left and right children: Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position) Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position) |

Copyright © 2003-2013,
Extreme Optimization. All rights reserved.

*Extreme Optimization,* *Complexity made simple*, *M#*, and *M
Sharp* are trademarks of ExoAnalytics Inc.

*Microsoft*, *Visual C#, Visual Basic, Visual Studio*, *Visual
Studio.NET*, and the *Optimized for Visual Studio* logo

are
registered trademarks of Microsoft Corporation.