Detection of non-coding RNA genes by searching for transcription signals in intergenic regions.
BackgroundNon-coding RNA (ncRNA) genes produce transcripts that exert their function without ever being a recipe for proteins. ncRNA gene sequences, unlike protein coding genes, do not have strong transcription signals. This study was conducted to investigate a special version of a previously tested and suggested method of detecting RNAs. This study is a part of a larger project where many such methods are to be combined to create a general purpose ncRNA finding program.
There are many possible ways to locate ncRNA. ncRNA genes have to be transcribed to produce ncRNA, and must therefore be surrounded by sequence regions that regulate transcription. Good candidates for new ncRNA genes would therefore be parts of intergenic sequences where transcription signals are present. Searching for transcription signals has previously been applied with success to find ncRNA genes in the bacteria Escherichia coli (E.coli) and yeast. This strategy has later been applied once more to the E.coli genome with some success by Chen et al.
MethodsThe method chosen in this study is a version of the above mentionedsearch for transcription signals. During this study 8 promoter consensus sequences have been suggested using data from earlier studies, the consensus sequences cover the promoter sequence of five of the seven known socalled sigma factors in E.coli. A novel promoter sequence score function has been created resulting in the implementation of a new promoter search algorithm. This promoter search has been combined with an implementation of a previously developed terminator search and scoring algorithm.
The output data has been analyzed by comparing the candidates to 52 verified and 1056 suggested ncRNAs. The number of located promoters has been compared with the estimated number of promoter hits that would occur in a random sequence which maintains the basic features of the riginal inputstring. Some output data have also been multiple aligned with intergenic regions of genomes from bacteria closely related to E.coli.
ResultsDuring this study at least three novel promoter consensus sequences for the E.coli polymerase have been suggested. A novel promoter sequence scoring algorithm has been implemented together with a previously used method to locate rho-independent terminators in E.coli. The implemented program has eight different promoter sequences it may search for by using user-defined thresholds.
A comparison has been made on the program's candidates against thesuggested and verified ncRNAs. This comparison shows a very low hit ratio. Analysis has also been made to check the program's hit ratio towards the random case to verify the significance of the search criteria.
Using about 850 ncRNA candidates from the program, multiple alignments have been made to intergenic regions in related bacteria. This has resulted in a suggestion of 20 novel ncRNAs having a high level of conservation and high scores on promoter and terminator regions. Of the 20 suggested ncRNA candidates two were inside already known ncRNA genes, this leaves 18 novel ncRNA candidates.
At http://folk.uio.no/gardt/Hovedfag/index.html the search programdeveloped in this study can be downloaded along with the BioJavapackages needed. At this site one can also download the Java code, JavaDoc for the program and also the file containing the intergenic regions of E.coli that were used in this study.
ConclusionThis study concludes with a suggestion of 18 novel ncRNA candidates.The search algorithm and criteria used in this study represent a slightly new approach to the problem of detecting ncRNAs, specially by including searches for promoters recognized by other sigma factor mthan the widely used sigma 70. Analyses have shown that the program has a low hit ratio on already known or suggested ncRNAs, however other analyses have shown that the promoter consensus sequences used in this search are significant in promoter sequences to protein coding genes. The problems of detecting ncRNAs are rather connected to their weak transcription signals.
Of the 18 candidates, none have structural similarities with knownncRNA families. This is not very remarkable since if they had shown such similarities they would have been known already, consequently the 18 candidates represent novel families of ncRNAs or they are false. The answer to whether they are real ncRNA genes will be given when the 18 novel ncRNA candidates are tested in the laboratory.
As an independent program for ncRNA detection this program is not verysuited as of today, but, as indicated above, when combined with otheranalyses it might represent a useful tool.