SSpro, a web server for protein secondary structure prediction based on recurrent neural networks

Gianluca Pollastri, Pierre Baldi

Dept. of Information and Computer Science

University of California, Irvine

Irvine, CA 92697-3425, U.S.A.

{gpollast,pfbaldi}@ics.uci.edu

 

SSpro is a fully automated system for the prediction of protein secondary structure. The system is based on an ensemble of bidirectional recurrent neural networks (BRNNs) [1, 2]. BRNNs are graphical models that learn from data the transition between an input and an output sequence of variable length. The model is based on two hidden Markov chains, a forward and a backward chain, that transmit information in both directions along the sequence, between the input and the output sequences. Three neural networks model respectively the forward state update, the backward state update and the input and hidden states to output transition. BRNNs are trained in a supervised fashion using the gradient descent algorithm. The error signal is propagated through the model using the BPTS (backpropagation through structure) algorithm [3], an extension of BPTT (backpropagation through time), used in unidirectional recurrent neural networks.

The system is trained on a set of 1180 structures and tested on a set of 126 structures. The test set is the same on which the first version of the server PHD [4] was trained. The training set is extracted from the Protein Data Bank that was online in April 1999. The structures obtained using NMR or with a resolution worse than 2.0 Ångstroms are first removed from the set, then an all-against-all redundancy reduction procedure is run using a rigorous Smith-Waterman algorithm with the Pam120 matrix for pairwise alignments, discarding a sequence if it shows more than 25% identity to any sequence in the test set. The same threshold holds for each pair of sequences in the test set. A second all-against-all redundancy reduction procedure is then run on the set thus obtained using a threshold of 50% sequence identity.

The target secondary structure assignments are compiled with the program DSSP [5]. We assign to the class Helix the alpha-helix (H) and 310-helix (G) DSSP classes, to Strand the classes extended strand (E) and beta bridge (B), to Coil the other four classes, consistently with the CASP classification.

The system takes as input a profile obtained from a multiple alignment of protein sequences. The multiple alignments are compiled with the program BLAST [6], using default parameters. The database of sequences adopted is the NR that was online in October 1999 (roughly 420,000 sequences). No further check or filter is run on the database. Every sequence in the alignment is assigned a weight proportional to the information the sequence carries with respect to the unweighted profile. A weighted profile is then compiled and used as input for the system.

A set of 11 bidirectional recurrent neural networks is trained on the dataset. For details on the implementation, see [1,2]. The networks contain roughly 70,000 adjustable weights, have normalised exponentials on the outputs and are trained using the relative entropy between the target and output distributions. The final predictions are obtained averaging the network outputs for each residue. A performance of approximately 76.5% correct residue classification is observed on our independent test set (roughly 80% on the training set).

SSpro is implemented into a web server that can be found online at the address:

http://promoter.ics.uci.edu/BRNN-PRED/

 

[1] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri, and G. Soda. Bidirectional Dynamics for Protein Secondary Structure Prediction, Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI99), Stockholm, Sweden (1999).

[2] P. Baldi, S. Brunak, P. Frasconi, G. Pollastri and G. Soda. Exploiting the Past and the Future in Protein Secondary Structure Prediction. Bioinformatics, 15:937-946, (1999).

[3] P. Frasconi, M. Gori, A. Sperduti. A General Framework for Adaptive Processing of Data Structures. IEEE Trans. on Neural Networks, 9, 5:768-786, (1998).

[4] B. Rost, C. Sander. PHD - An automatic mail server for protein secondary structure prediction. Comput. Appl. Biosci., 10(1):53-60, (1994).

[5] W. Kabsch, C. Sander. Dictionary of protein secondary structure: pattern recognition of hydrogen bonded and geometrical features. Biopolymers, 22:2577-2637, 1983.

[6] S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25:3389-3402 (1997).