Cluster analysis is the collective name given to a number of algorithms for grouping similar
objects into distinct categories. It is a form of exploratory data analysis aimed at grouping
observations in a way that minimizes the difference within groups while maximizing the difference
between groups.
Hierarchical clustering organizes observations in a tree structure based on similarity or
dissimilarity between clusters. The algorithm starts with each observation as its own cluster,
and successively combines the clusters that are most similar.
Two important properties of the algorithm are the distance measure and the linkage method.
The distance measure determines how the similarity between clusters is measured.
The most common distance measure is the squared Euclidean distance.
The following distance measures have been predefined, and are available as properties of the
DistanceMeasures class:
Value | Description |
---|
EuclideanDistance | The standard Euclidean distance. |
SquaredEuclideanDistance | The square of the Euclidean distance. This is the default. |
ManhattanDistance |
The Manhattan or city block distance, computed as the sum of the absolute values
of the difference between corresponding elements.
|
MaximumDistance |
The distance computed as the largest absolute difference between corresponding
elements.
|
CorrelationDistance | A measure based on the correlation between the observations. |
CosineDistance | The distance computed as the cosine of the angle between two observations. |
MinkowskiDistance(Double) |
The general Minkowski distance for the specified power. The power must be specified
as a parameter to the method.
|
CanberraDistance |
The so-called Canberra distance, which gives more weight to elements that
add up to small values.
|
The linkage method determines how clusters are combined
and how the similarities between the merged cluster and the remaining clusters
are computed. It can take on the following values:
Value | Description |
---|
Single | The distance between two clusters is the distance between their closest neighboring points. |
Complete | The distance between two clusters is the distance between their two furthest member points. |
Average |
Also called unweighted pair-group method using averages (UPGMA).
The distance between two clusters is the average distance between all inter-cluster pairs.
|
Ward |
The cluster to be merged is the one which will produce the least increase in the sum
of squared Euclidean distances from each case in a cluster to the mean of all variables.
|
Median |
Also called unweighted pair-group method using centroid averages (UPGMC). The cluster
to be merged is the one with the smallest sum of Euclidean distances between cluster means
(centroids) for all variables.
|
McQuitty |
The distance between a cluster and a newly merged cluster is the mean of the distance between
the cluster and each of the two component clusters.
|
Centroid | The distance between two clusters is the distance between the centroids or the means of the clusters. |
Running a cluster analysis
Hierarchical clustering is implemented by the
HierarchicalClusterAnalysis class.
This class has three constructors. The first constructor takes one argument: a
MatrixT whose columns
contain the data to be analyzed.
The second constructor also takes one argument: an array of
VectorT objects.
Both these constructors are illustrated below:
var matrix = Matrix.CreateRandom(100, 10);
var hc1 = new HierarchicalClusterAnalysis(matrix);
var vectors = matrix.Columns.ToArray();
var hc2 = new HierarchicalClusterAnalysis(vectors);
Dim mat = Matrix.CreateRandom(100, 10)
Dim hc1 = New HierarchicalClusterAnalysis(mat)
Dim vectors = mat.Columns.ToArray()
Dim hc2 = New HierarchicalClusterAnalysis(vectors)
No code example is currently available or this language may not be supported.
let matrix = Matrix.CreateRandom(100, 10)
let hc1 = new HierarchicalClusterAnalysis(matrix)
let vectors = matrix.Columns.ToArray()
let hc2 = new HierarchicalClusterAnalysis(vectors)
The third constructor takes two arguments. The first is a
IDataFrame (a
DataFrameR, C or
MatrixT)
that contains the variables that may be used in the analysis.
The second argument is an array of strings
that contains the names of the variables from the collection that should be included
in the analysis.
The distance measure can be set through the
DistanceMeasure property.
It defaults to squared Euclidean distance. To set or get the linkage method, use the
LinkageMethod property.
The default here is the centroid method.
One last option that can be set is whether the variables should be standardized,
using the Standardize
property. When variables are unequally scaled, some variables will make a larger contribution
to the distance than others. This can distort the clustering. To avoid this problem,
the variables can be standardized so they all contribute equally to the distance.
The default is to standardize variables. The following example sets up
a hierarchical cluster analysis using a data frame and sets some options:
var rowIndex = Index.Default(matrix.RowCount);
var names = new string[] { "x1", "x2", "x3",
"x4", "x5", "x6", "x7", "x8", "x9", "x10" };
var columnIndex = Index.Create(names);
var dataFrame = matrix.ToDataFrame(rowIndex, columnIndex);
var hc3 = new HierarchicalClusterAnalysis(dataFrame, names);
hc3.Standardize = true;
hc3.LinkageMethod = LinkageMethod.Centroid;
hc3.DistanceMeasure = DistanceMeasures.SquaredEuclideanDistance;
Dim rowIndex = Index.Default(mat.RowCount)
Dim names = {"x1", "x2", "x3",
"x4", "x5", "x6", "x7", "x8", "x9", "x10"}
Dim columnIndex = Index.Create(names)
Dim frame = mat.ToDataFrame(rowIndex, columnIndex)
Dim hc3 = New HierarchicalClusterAnalysis(frame, names)
hc3.Standardize = True
hc3.LinkageMethod = LinkageMethod.Centroid
hc3.DistanceMeasure = DistanceMeasures.SquaredEuclideanDistance
No code example is currently available or this language may not be supported.
let rowIndex = Index.Default(matrix.RowCount)
let names = [| "x1"; "x2"; "x3";
"x4"; "x5"; "x6"; "x7"; "x8"; "x9"; "x10" |]
let columnIndex = Index.Create(names)
let dataFrame = matrix.ToDataFrame(rowIndex, columnIndex)
let hc3 = new HierarchicalClusterAnalysis(dataFrame, names)
hc3.Standardize <- true
hc3.LinkageMethod <- LinkageMethod.Centroid
hc3.DistanceMeasure <- DistanceMeasures.SquaredEuclideanDistance
The Compute
method performs the actual calculations. Once the computations are complete,
a number of properties and methods give access to the results in detail.
Use the GetClusterPartition
method to divide the observations into the specified number of sets. This method returns a
HierarchicalClusterCollection
object which, as the name implies, is a collection of
HierarchicalCluster objects.
In addition to the usual collection properties and methods, this class also has a
GetMemberships method,
which returns a CategoricalVectorT that
for each observation indicates the cluster to which it belongs.
Each HierarchicalCluster
describes one cluster in detail.
The Size
property returns the number of observations in the cluster.
The MemberIndexes
property returns a vector containing the indexes of its members in the original dataset.
The Root property returns
the DendrogramNode
node that is the root of the set. The following example creates a partition of 5 clusters,
prints the size of each, and also prints out which cluster
each observation belongs to:
hc1.Fit();
var partition = hc1.GetClusterPartition(5);
foreach (HierarchicalCluster cluster in partition)
Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size);
var memberships = partition.GetMemberships();
for (int i = 15; i < memberships.Length; i++)
Console.WriteLine("Observation {0} belongs to cluster {1}", i,
memberships.GetLevelIndex(i));
hc1.Fit()
Dim Partition = hc1.GetClusterPartition(5)
For Each cluster In Partition
Console.WriteLine("Cluster {0} has {1} members.", cluster.Index, cluster.Size)
Dim memberships = Partition.GetMemberships()
For i = 15 To memberships.Length - 1
Console.WriteLine("Observation {0} belongs to cluster {1}", i,
memberships.GetLevelIndex(i))
Next
Next
No code example is currently available or this language may not be supported.
hc1.Fit()
let partition = hc1.GetClusterPartition(5)
for cluster in partition do
printfn "Cluster %d has %d members." (cluster.Index) (cluster.Size)
let memberships = partition.GetMemberships()
for i = 15 to (memberships.Length - 1) do
printfn "Observation %d belongs to cluster %d" i
(memberships.GetLevelIndex(i))
A dendrogram is a tree-like representation of the results of a hierarchical cluster analysis.
The nodes of the tree are clusters that represent either original observations,
or clusters that resulted from merging two clusters. The nodes of a dendrogram
are represented by
DendrogramNode
objects. The dendrogram of a cluster analysis
can be accessed through its root node, available through the
a DendrogramRoot
property.
The IsLeaf
property indicates whether the cluster is a leaf node (an observation
from the original data set), or a merged cluster. If it is a leaf node, then the
ObservationIndex
returns the index of this observation in the original data set. The
LeftChild and
RightChild
properties specify the two clusters that were merged.
DistanceMeasure
returns the distance between the two merged clusters. This property is zero for leaf nodes.
Level
returns the number of nodes between the current node and the root node.
The root node is at level 0. The children of the root are at level 1, and so on.
A dendrogram is also useful to give a graphical representation of the results.
The Position property gives the horizontal position of the node.
The leaf nodes (observations) have integer positions that correspond to their
GetDendrogramOrder
and can be used as one coordinate. It is centered over the two clusters that were merged.
The DistanceMeasure
or Level properties can be used as the other coordinate.
The next example shows how to access the dendrogram nodes and their properties:
var root = hc1.DendrogramRoot;
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure);
Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position);
Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position);
Dim root = hc1.DendrogramRoot
Console.WriteLine("Root position: ({0:F4}, {1:F4})", root.Position, root.DistanceMeasure)
Console.WriteLine("Position of left child: {0:F4}", root.LeftChild.Position)
Console.WriteLine("Position of right child: {0:F4}", root.RightChild.Position)
No code example is currently available or this language may not be supported.
let root = hc1.DendrogramRoot
printfn "Root position: (%.4f, %.4f)" root.Position root.DistanceMeasure
printfn "Position of left child: %.4f" root.LeftChild.Position
printfn "Position of right child: %.4f" root.RightChild.Position