The amount of available genomic data, produced by genome sequencingprojects, increases more and more quickly. The vast quantity of available data must be analyzed in order to extract valuable information about the genome. The Genomic HyperBrowser is a system which participates in the analysis of genome information data. It provides many comparative analyses at the sequence level.
The genomic data that has been collected in the HyperBrowser systemare represented as mathematical objects called genome annotation tracks. The biological hypotheses of interest are translated into studies of mathematical relations between tracks. So both the biological data and investigations are mathematically represented and executed. As an endeavour to contribute to the data analyzing process in the HyperBrowserB system, this master thesis adds a exible, customizable clustering system, which supports many different possibilities for clustering of genome annotation tracks. Theclustering tracks could be all of the already available tracks within the HyperBrowser or those tracks obtained by running track annotating tools in the HyperBrowser system.
Starting with the requirements, the development of the clustering system was divided into two parts : the theoretical development of the clustering cases and the implementation of the clustering system that supports these clusterings.
It was found that there are at least three fundamentally di erent ways to cluter a set of tracks, and one way to cluster regions on a single track. The clustering cases were constructed by rst examining the possibilities for clustering a concrete dataset based on di erent biological investigations the user might be interested in, then generalizing these possibilities for all the track-formats available in the HyperBrowser that could be clustered using this clustering system. The theoretical properties and distinctions between cases were investigated.
The implementation of the system is further divided into two parts : auser interface (front-end) and a set of functions that carry out the clustering (back-end). The front-end is a simple webpage which interactively communicates with the user. The clustering cases are listed on the webpage, and the user decides which case should be used. According to the selected case, different appropriate options which are speci c for each case, will be subsequently displayed. All the information selected by the user are then used as input data for back-end functions. The front-end webpage was implemented using Mako template and Html.
The back-end functions carrying out the clustering were implemented inPython and R. Based on the input data from the front-end, appropriate statistical functions that are already implemented in the HyperBrowser are used to construct the data matrix representing the clustering tracks, and clustering methods in R are used to carry out the clustering. All four clustering cases were implemented in the system.
The clustering system was then tested by performing clustering of twoseparate datasets (virus and genes datasets), using all three clustering cases for tracks. One of the test-cases has a sample result from an earlier study, which was used as a reference to check the credibility of the newly implemented tool. The clustering result using this tool indeed matches the sample result, thereby con firming the reliability of the tool.