Abstract
Neste-generasjons sekvensering gir en stor mengde data i form av Short Reads. I denne oppgaven presenterer vi muligheter for å komprimere disse dataene ved hjelp av differensiering. Vi ser på to tilfeller; først komprimering av ubehandlede data med differensiering internt, så komprimering av data som er kartlagt mot et referansegenom. Vi viser resultater hvor vi i de fleste tilfellene oppnår signifikant komprimering.
With the rapid progression in the field of DNA sequencing, an appreciable
part of total costs has moved from the wet lab to the server room. New
platforms are generating data that increases in size faster than the storage
capacity. More effective compression than generic techniques is possible due
to the traits inhabited by the sequencing output. In this thesis we look
at using diff erential compression both for raw unmapped reads and reads
that been mapped to a reference genome. For the aligned reads this means
using a reference-based solution where the information in the reads is stored
as a di fferencing against a reference library. For the unaligned reads we
use an intra-frame diff erencing similar to frame-to-frame approach in video
compression. With this approach we are able to show a proof-of-concept
where signficant compression rates are attainable in both cases.