Abstract
DNA sequencing has brought many important improvements to medicine, and is a field
of active development. Through the study of genetic information, we gain knowledge
of hereditary diseases and traits, both in humans and other species.
However, the process of sequencing is difficult, both due to vast amounts of data
and sequencing errors. Verification of results is often done with expensive re-sequencing
and analysis.
In this thesis we study the use of simulated reads in order to obtain exact results.
First we suggest a method for creating a known artificial genome, using dbSNP to
provide variation. Several existing programs for variant calling are evaluated through
detailed analysis of variant files. Finally we suggest methods for improving verification
of results.
The results show that the GATK variant callers performed well, but also Dindel
provided some advantages. Furthermore, results suggest that some issues are caused by
erroneous mapping and realignment.
We hope that others can use these results to improve the development of sequencing
algorithms through simulations.