Rapid and Sensitive Alignment-free DNA Sequence Comparison

Mjelva, Kristoffer

Master thesis

View/Open

master_thesis.pdf (1.475Mb)

Year

2018

Abstract

When it comes to the analysis of biological sequences, alignment based methods have long been in the forefront for finding similarities between sequences, and in the classification of genetic sequences. Many of these methods date back to the 1970s and 80s, and while they are precise in their assessment of biological data, they suffer from high execution times as the amount of data grows. To counter this, new alignment-free methods have been developed, and a common practice in these new methods is to count the number of shared words of length k, or k-mers, in the sequences that are to be analysed. In this project, we use empirical analysis of randomly generated DNA sequences, the Rfam 11 database and the 16S RDP training set to investigate what is the optimal k-mer length, if we should allow mismatches in k-mers and whether the positions and number of occurrences of each k-mer might influence our classification methods one way or the other. Our research indicates that for short sequences of varying similarity, such as those in the Rfam database, a k-mer length of 8 yields high values for sensitivity and precision, while a length of 11 performs slightly better, if one mismatch is allowed in each k-mer. For the longer and more similar sequences in the RDP training set, k-mers of length 6 or more perform equally good, and the introduction of mismatches in k- mers for these sequences do not improve the sensitivity or precision of our analysis. We conclude that certain variations on the implementation of k- mers, such as mismatches and counting the number of occurrences of each word, might improve sensitivity and precision in certain instances, but that no k-mer implementation is universally best for the different data sets we have tested.