A comparative study of existing and novel methods for estimating the number of clusters in a data set

Nilsen, Gro

Master thesis

View/Open

Masteroppgave.pdf (3.394Mb)

Year

2009

Abstract

Cluster analysis is a field of study where the aim is to discover distinct groups or clusters in a data set. The objects in the same groups should be similar to each other in some respect, while at the same time dissimilar from objects in the other groups. In 2- or 3-dimensional data sets, this task is simplified by the plotting of the data. In high-dimensional data, on the other hand, the challenge is much greater. One particular discipline in which cluster analysis is commonly used is genomics. In cancer research, for example, the expression of thousands of genes are measured simultaneously, and one may seek to find groups of co-regulated genes, or groups of patients that have similar genetic expression profiles and clinical outcomes.

An intrinsic part of cluster analysis is to determine how many clusters are present in the data set. In this thesis, several methods that intend to estimate the number of clusters are presented. These include Gap, Recursive Gap, Silhouette, Prediction strength and In-group proportion. In addition, two novel approaches are introduced, namely Reference Gap and ERA. The methods are first applied on two breast tumour data sets for which earlier studies have indicated the presence of five, possibly six, distinct clusters. ERA and Recursive Gap give results that are the most consistent with the previous findings. The methods are then applied on data sets simulated from various simulation scenarios. The advantage of using simulated data sets is that the true number of clusters is known beforehand, and we may thus make a direct comparison of the methods' effectiveness. The conclusion of these trials is that ERA stands out as the most versatile and successful method, with Recursive Gap not too far behind. The other methods have more varying performance, and are overall less successful than ERA.

The results found for both real and simulated data sets in this thesis hence indicate that the novel method ERA provides a valuable approach to the challenging task of estimating the number of clusters in a data set.