Identifisering av ikkje-kodande RNA ved hjelp av samvariansmodellar

Thingnes, Josef

Master thesis

View/Open

opg.pdf (1.707Mb)

Year

2004

Abstract

Goal

This master thesis is a part of a RNA-project at Rikshospitalet, Oslo, Norway. The overall goal of this project is to make a general search tool for finding non-coding RNA-genes in genomic sequences. This task is identified to be harder and more complex than to search for protein-coding genes. The goal of this particular study is to find the best ways to search for homologs of known non-coding RNA-genes.

Background

Functional RNA are important but not well known regulators, catalysts and even enzymes in living cells (Eddy 2001; Mattick 2001; Storz 2002; Hershberg et al. 2003). These molecules are the result of so called non-coding RNA-genes (ncRNA-genes).

One well-known way to look for new genes in a genomic sequence is to look for homologs of already known genes for example from another organism. Two genes are defined as homologs if they have evolved from a common ancestor. A program checking for homology is looking for conserved parts between the two genes. In ncRNA-genes the secondary structure of the RNA-molecule is often more conserved than the sequence itself. The covariance model (CM) (Eddy and Durbin 1994) captures this conserved structure of a family of ncRNA-genes. The CM also provides an algorithm package that allows users to search for new members of this family.

Methods

In this thesis I have studied different kinds of homology search in general and the use of CM in particular. Four different but connected tests are completed. First a search-program, tRNAscan-SE (Lowe and Eddy 1997), that looks for tRNA-genes by employing a CM is tested on the complete genome of Escherichia coli. Next I have made CM training-sets for two other genes spf and ssrA by doing homology searches with the program ParAlign (Rognes 2001). A part of the covariance model is implemented and tested on these training-sets as well as handmade test data. In the end a change in the model is made to improve the model’s ability to predict secondary structure.

Results

Through studies of the theory in the field and the four tests, I have obtained knowledge that will be a god guide for developing a general search tool. Even though the model has its drawbacks, it is clear that a general search tool has to employ a CM.

Referanses

Eddy, SR (2001). Non-coding RNA genes and the modern RNA world. Nature Reviwes Genetics 2 (12): 919-29.

Eddy, SR and Durbin, R (1994). RNA sequence analysis using covariance models. Nucleic Acids Research 22 (11): 2079-2088.

Hershberg, R, Altuvia, S and Margalit, H (2003). A survey of small RNA-encoding genes in Escherichia coli. Nucleic Acids Research 31 (7): 1813-1820.

Lowe, T and Eddy, SR (1997). tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research 25 (5): 955-964.

Mattick, JS (2001). Non-coding RNAs: the architects of eukaryotic complexity. European Molecular Biology Organization Reports 2 (11): 986-91.

Rognes, T (2001). ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Research 29 (7): 1647-1652.

Storz, G (2002). An Expanding Universe of Noncoding RNAs. Science 244: 1260 - 1263.