Improved Splice Site Detection in Genie

Martin G. Reese, Frank H. Eeckman, David Kulp, David Haussler

Reviewer

Alistair G. Robertson (alistair{dot}robertson{at}gmail{dot}com)

Reference

Reese MG, Eeckman, FH, Kulp, D, Haussler, D, 1997. ``Improved Splice Site Detection in Genie''. J Comp Biol 4(3), 311-23. [PDF]

Keywords

Artificial Neural Network, Computational Biology, Generalized Hidden Markov Model, Splice

Related Papers

Reese, M.G., and Eeckman, F.H. (1996). ``Splice Sites: A detailed neural network study'', Proceedings of the 1996 Genome Mapping & Sequencing Meeting, Cold Spring Harbour, New York, arranged by D. Bentley, E. Green and P. Hieter
FO Desmet, Hamroun D, Lalande M, Collod-Beroud G, Claustres M, Beroud C. Human Splicing Finder: an online bioinformatics tool to predict splicing signals. Nucleic Acid Research, 2009
M Pertea, X Lin, SL Salzberg GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res, 29:1185-90 2001
S.M. Hebsgaard, P.G. Korning, N. Tolstrup, J. Engelbrecht, P. Rouze, S. Brunak: Splice site prediction in Arabidopsis thaliana DNA by combining local and global sequence information, Nucleic Acids Research, 1996, Vol. 24, No. 17, 3439-3452. Brunak, S., Engelbrecht, J., and Knudsen, S.: Prediction of Human mRNA Donor and Acceptor Sites from the DNA Sequence, Journal of Molecular Biology, 1991, 220, 49-65.
J. W. Fickett and C.-S. Tung. Assessment of protein coding measures. Nucl. Acids Res., 20:6441-6450, 1992.
V. Solovyev, Salamov A., and C. Lawrence. Predict- ing internal exons by oligonucleotide composition and discriminant analysis of splicable open reading frames. Nucl. Acids Res., 22:5156-5163, 1994.
S. Dong and D. B. Searls. Gene structure prediction by linguistic methods. Genomics, 162:705-708, 1994.
L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. IEEE ASSP Magazine, 3( 1):4- 16, January 1986.
D. Kulp, D. Haussler, M. Reese, and F. Eeckman. A generalized hidden Markov model for the recognition of human genes in DNA. In ISMB-96, St. Louis, June 1996. AAAI Press.

Summary

In this paper, researchers from the University of California, Berkley Drosophila Genome Project (BDGP) present newly improved methods for splice site detection in DNA sequences. The methods utilise a Generalized Hidden Markov Model (GHMM) to describe a legal parse of a multi-exon DNA sequence. In such a model, hidden states represent sections of DNA that are classified as coding, non-coding and untranslated regions.

Artificial Neural Networks are used to sense candidate locations for donor and acceptor splice sites in order to improve the speed at which an optimal parse can be found computationally. As with standard Hidden Markov Models, the GHMM uses a variant of the Viterbi dynamic programming algorithm to calculate the most likely sequence of hidden states given a sequence of output tokens. The use of ANNs to identify candidate locations for splice sites, greatly enhances the speed of this computation.

Besides the ANNs used to identify donor and acceptor splice sites, machine learning algorithms are used to model the probability distributions of lengths of DNA for GHMM hidden states (which are themselves submodels). The authors also present improvements in this area.

Results are presented comparing the performance in terms of sensitivity and specificity for the new and old sets of algorithms. Additionally, results are produced comparing the presented methodologies against a set of seven competing algorithms from other research labs. Seven-fold cross validation results are also presented showing an impressive specificity of 85% and a sensitivity of 86% on average.

Finally there is a discussion of some of the unusual aspects of the findings such as the unusually low scores of false splice site motifs that are in close proximity to actual splice sites.

Evaluation

The quality of the researchers is very high. The main researcher, Martin G. Reese is a PhD from the University of California, Berkeley, and the University of Hohenheim. The others include a Fulbright Scholar and PhD (UC Santa Cruz) in Bioinformatics, another Santa Cruz PhD and David Haussler, professor of Biomedical Engineering at UC Santa Cruz. The research team has expertise in both Biology and Computer Science, two core disciplines of Bioinformatics, and they hold high level qualifications from well recognised institutions.

The title is informative and meaningful to those who are familiar with the research group's earlier work, However, it fails to convey the main purpose of the paper to outsiders which is to accurately annotate DNA sequences computationally.

The introduction provides a good contextualisation of the speed of publication for genetic data and the need for automatic sequence annotation methods. There is a good understanding of the wider literature presented and some similarities and differences with existing, analogous methods are discussed.

The methods presume the reader is familiar with Hidden Markov Models and Artificial Neural Networks. The methods are well presented, including a canonical introduction to Generalized Hidden Markov Models and a robust explanation of how these two techniques and other machine learning algorithms work together as an ensemble. The researchers also declare their training datasets and provide a link to them on the accompanying website. This would allow replication of results by others and further enhances the legitimacy of this scientific publication.

Seven fold cross validation results are published showing a significant improvement over the results in an earlier publication by the same research group. Reassuringly, the subsamples used were evaluated for homology against the training sets to ensure that evolutionarily closely related sequences, of little material difference to the training set, were not used as independent subsamples.

A set of comparative results with similar research projects was also published as well as the results of an already established testing scheme (Burset and Guigo 1996). These efforts ensure that the results are comparable with the literature and facilitate the adoption and relevance of the research.

Finally, the discussion highlights some interesting aspects of the results, but is careful not to draw hasty conclusions and avoids speculation. The authors conclude with suggestions for future research, alerting the reader to the likely direction the subsequent research will take. This paper is well written and breaks new ground in the computational annotation of DNA sequences, an important area of Bioinformatics.