• English
    • Norsk
  • English 
    • English
    • Norsk
  • Administration
View Item 
  •   Home
  • Det matematisk-naturvitenskapelige fakultet
  • Institutt for informatikk
  • Institutt for informatikk
  • View Item
  •   Home
  • Det matematisk-naturvitenskapelige fakultet
  • Institutt for informatikk
  • Institutt for informatikk
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Rapid and Sensitive Alignment-free DNA Sequence Comparison

Mjelva, Kristoffer
Master thesis
View/Open
master_thesis.pdf (1.475Mb)
Year
2018
Permanent link
http://urn.nb.no/URN:NBN:no-65383

Metadata
Show metadata
Appears in the following Collection
  • Institutt for informatikk [3604]
Abstract
When it comes to the analysis of biological sequences, alignment based methods have long been in the forefront for finding similarities between sequences, and in the classification of genetic sequences. Many of these methods date back to the 1970s and 80s, and while they are precise in their assessment of biological data, they suffer from high execution times as the amount of data grows. To counter this, new alignment-free methods have been developed, and a common practice in these new methods is to count the number of shared words of length k, or k-mers, in the sequences that are to be analysed. In this project, we use empirical analysis of randomly generated DNA sequences, the Rfam 11 database and the 16S RDP training set to investigate what is the optimal k-mer length, if we should allow mismatches in k-mers and whether the positions and number of occurrences of each k-mer might influence our classification methods one way or the other. Our research indicates that for short sequences of varying similarity, such as those in the Rfam database, a k-mer length of 8 yields high values for sensitivity and precision, while a length of 11 performs slightly better, if one mismatch is allowed in each k-mer. For the longer and more similar sequences in the RDP training set, k-mers of length 6 or more perform equally good, and the introduction of mismatches in k- mers for these sequences do not improve the sensitivity or precision of our analysis. We conclude that certain variations on the implementation of k- mers, such as mismatches and counting the number of occurrences of each word, might improve sensitivity and precision in certain instances, but that no k-mer implementation is universally best for the different data sets we have tested.
 
Responsible for this website 
University of Oslo Library


Contact Us 
duo-hjelp@ub.uio.no


Privacy policy
 

 

For students / employeesSubmit master thesisAccess to restricted material

Browse

All of DUOCommunities & CollectionsBy Issue DateAuthorsTitlesThis CollectionBy Issue DateAuthorsTitles

For library staff

Login
RSS Feeds
 
Responsible for this website 
University of Oslo Library


Contact Us 
duo-hjelp@ub.uio.no


Privacy policy