E1397

COMPARISON OF SEQUENCES AND STRUCTURES. W. Taylor, National Inst. For Medical Res., London NW7 1AA,UK

The construction of a model for a protein of unknown structure based on a similar sequence of known structure requires accurate alignment of the sequences. When the sequences are highly divergent it becomes necessary to use constraints derived from the known structure.

Structural constraints can be imposed at two levels: the simpler occurs locally in the sequence and can take the form of a bias to match known secondary structure with predicted secondary structure of like type, or to match regions of sequence variation with regions known to be exposed to solvent in the structure. These constraints can all be imposed during the alignment of the two sequences. The more complex constraints involve interactions that are non-local in the sequence - such as the preference to have cysteins sufficiently close to form a disulphide bridge, or hydrophobic residues packed in the core. Following the work of Eisenberg, simple constraints will be referred to as "1D/3D comparison" while the establishment of an alignment using true spatial interaction will be referred to as `threading' (Jones, Taylor and Thornton, 1992, Nature, 358:86-89).

A dynamic programming algorithm can find the optimal alignment of A and B given a score relating all pairs of positions. To align a sequence with a structure, these scores would need to be a measure of how well, say, residue ai fits in location bj. This quantity cannot be obtained until the full environment around position bj is determined. The problem is circular since the final alignment (or a good part of it) must be known in order to calculate it. The problem can be overcome by inverting the approach and asking not "how `happy' is ai on bj?" but "if ai is placed on bj, then how `happy' can it be made?". Consequently, the dynamic programming algorithm is applied to the matrix (ijR) in which each element (ijRmn) is a measure of the interaction of ai (on bj) with am on bn. The best path through this matrix has an associated score, which provides a measure by which the validity of the original assumption (of placing ai on bj) can be assessed. These scores can be taken to form the higher-level matrix (S), from which a set of equivalences can be extracted by a dynamic programming calculation finding the highest scoring path compatible with a sequence alignment. Following the use of the same algorithm in protein structure comparison (Taylor and Orengo, 1989, JMB, 208:1-22), the individual scores (ijRmn) along the optimal path traced in each low-level matrix (ijR) were summed to produce the high-level matrix.