Parallelisation of Hierarchical Clustering Algorithms for Metagenomics

Tantono, Mimi

Master thesis

View/Open

Mimi-Tantono-Thesis-2015.pdf (4.375Mb)

Year

2015

Abstract

Metagenomics is the investigation of genetic samples directly obtained from the environment. Driven by the rapid development of DNA sequencing technology and continuous reductions in sequencing costs, studies in metagenomics become popular over the past few years with the potential to discover novel knowledge in many fields through analysing the diversity of microbial ecology. The availability of large-scale datasets increases the challenge in data analysis, especially for hierarchical clustering that has a quadratic time complexity. This thesis presents the design and implementation of a parallelisation method for single-linkage hierarchical clustering for metagenomics data. Using 16 parallel threads, p-swarm was measured to achieve 11 times of speedup. This result shows a significant improvement of execution time while preserving the quality of exact and unsupervised clustering, which makes it possible to hierarchically cluster a larger dataset, for example TARA dataset which consists of nearly 10 million amplicons in just a few hours. Moreover, our method may be extended to a distributed computing model that could further increase the scalability and the capacity to cluster a larger volume of dataset.