Representation and integrated analysis of heterogeneous genomic datasets
Appears in the following Collection
AbstractThe technological developments in molecular biology over the last 50 years have brought with them a gradual shift of focus from genes and proteins towards the biological activity in non-coding parts of the genome. High-throughput sequencing techniques have opened a floodgate of published whole-genome datasets of experimental nature, such as ChIP-seq or variation data, which has accelerated this development. The shift towards non-coding regions of DNA has increased the need for representing data as genomic tracks, i.e. with coordinates along a reference genome. Consequently, a demand for user-friendly tools for analyzing such tracks has arisen.
This thesis presents a conceptual differentiation of genomic tracks into fifteen track types, such as points or segments, and it argues that the track types of study determine which questions are meaningful to ask. Furthermore, the thesis presents “The Genomic HyperBrowser” (http://hyperbrowser.uio.no), a general web-based system for statistical analysis of genomic tracks. The system incorporates a range of hypothesis tests for answering questions about particular relations between datasets. The calculation of p-values is mostly based upon Monte Carlo simulation under a user-selected null model. In addition, a number of descriptive statistics and data manipulation tools have been developed.
The thesis also introduces “GTrack”, a new file format for most types of genomic data, supporting all of the fifteen track types. The file format, and its binary variant, is fully supported by the Genomic HyperBrowser, providing the backbone for flexible high-speed analysis within the system.
Lastly, the thesis presents “The differential disease regulome”, a hypothesis-generating tool providing a powerful way to visualize relations between two classes of datasets. In the main case, transcription factors are related to disease genes. The resulting heatmap of ~500.000 relations is browsable via the Google maps engine, due to its large size.
List of papers.
Paper I: Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestøl K, Clancy T, Ferkingstad E, Johansen M, Nygaard V, Tøstesen E, Frigessi A, Hovig E. The Genomic HyperBrowser: inferential genomics at the sequence level. Genome Biol. 2010;11(12):R121. doi:10.1186/gb-2010-11-12-r121 Copyright 2010 Sandve et al.; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License.
Paper II: Gundersen S, Kalaš M, Abul O, Frigessi A, Hovig E, Sandve GK. Identifying elemental genomic track types and representing them uniformly. BMC Bioinformatics. 2011 Dec 30;12:494. doi:10.1186/1471-2105-12-494 Copyright 2011 Gundersen et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License.
Paper III: Sandve GK, Gundersen S, Rydbeck H, Glad IK, Holden L, Holden M, Liestøl K, Clancy T, Drabløs F, Ferkingstad E, Johansen M, Nygaard V, Tøstesen E, Frigessi A, Hovig E. The differential disease regulome. BMC Genomics. 2011 Jul 7;12:353. doi:10.1186/1471-2164-12-353 Copyright 2011 Sandve et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License.