• English
    • Norsk
  • English 
    • English
    • Norsk
  • Administration
View Item 
  •   Home
  • Det matematisk-naturvitenskapelige fakultet
  • Matematisk institutt
  • Modellering og dataanalyse
  • View Item
  •   Home
  • Det matematisk-naturvitenskapelige fakultet
  • Matematisk institutt
  • Modellering og dataanalyse
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

A comparative study of existing and novel methods for estimating the number of clusters in a data set

Nilsen, Gro
Master thesis
View/Open
Masteroppgave.pdf (3.394Mb)
Year
2009
Permanent link
http://urn.nb.no/URN:NBN:no-23169

Metadata
Show metadata
Appears in the following Collection
  • Modellering og dataanalyse [183]
Abstract
Cluster analysis is a field of study where the aim is to discover distinct groups or clusters in a data set. The objects in the same groups should be similar to each other in some respect, while at the same time dissimilar from objects in the other groups. In 2- or 3-dimensional data sets, this task is simplified by the plotting of the data. In high-dimensional data, on the other hand, the challenge is much greater. One particular discipline in which cluster analysis is commonly used is genomics. In cancer research, for example, the expression of thousands of genes are measured simultaneously, and one may seek to find groups of co-regulated genes, or groups of patients that have similar genetic expression profiles and clinical outcomes.

An intrinsic part of cluster analysis is to determine how many clusters are present in the data set. In this thesis, several methods that intend to estimate the number of clusters are presented. These include Gap, Recursive Gap, Silhouette, Prediction strength and In-group proportion. In addition, two novel approaches are introduced, namely Reference Gap and ERA. The methods are first applied on two breast tumour data sets for which earlier studies have indicated the presence of five, possibly six, distinct clusters. ERA and Recursive Gap give results that are the most consistent with the previous findings. The methods are then applied on data sets simulated from various simulation scenarios. The advantage of using simulated data sets is that the true number of clusters is known beforehand, and we may thus make a direct comparison of the methods' effectiveness. The conclusion of these trials is that ERA stands out as the most versatile and successful method, with Recursive Gap not too far behind. The other methods have more varying performance, and are overall less successful than ERA.

The results found for both real and simulated data sets in this thesis hence indicate that the novel method ERA provides a valuable approach to the challenging task of estimating the number of clusters in a data set.
 
Responsible for this website 
University of Oslo Library


Contact Us 
duo-hjelp@ub.uio.no


Privacy policy
 

 

For students / employeesSubmit master thesisAccess to restricted material

Browse

All of DUOCommunities & CollectionsBy Issue DateAuthorsTitlesThis CollectionBy Issue DateAuthorsTitles

For library staff

Login
RSS Feeds
 
Responsible for this website 
University of Oslo Library


Contact Us 
duo-hjelp@ub.uio.no


Privacy policy