E1397

COMPARISON OF SEQUENCES AND STRUCTURES. W. Taylor, National Inst. For Medical Res., London NW7 1AA,UK

The construction of a model for a protein of unknown structure based on a similar sequence of known structure requires accurate alignment of the sequences. When the sequences are highly divergent it becomes necessary to use constraints derived from the known structure.

Structural constraints can be imposed at two levels: the simpler occurs locally in the sequence and can take the form of a bias to match known secondary structure with predicted secondary structure of like type, or to match regions of sequence variation with regions known to be exposed to solvent in the structure. These constraints can all be imposed during the alignment of the two sequences. The more complex constraints involve interactions that are non-local in the sequence - such as the preference to have cysteins sufficiently close to form a disulphide bridge, or hydrophobic residues packed in the core. Following the work of Eisenberg, simple constraints will be referred to as "1D/3D comparison" while the establishment of an alignment using true spatial interaction will be referred to as `threading' (Jones, Taylor and Thornton, 1992, Nature, 358:86-89).

A dynamic programming algorithm can find the optimal alignment of A and B given a score relating all pairs of positions. To align a sequence with a structure, these scores would need to be a measure of how well, say, residue a_i fits in location b_j. This quantity cannot be obtained until the full environment around position b_j is determined. The problem is circular since the final alignment (or a good part of it) must be known in order to calculate it. The problem can be overcome by inverting the approach and asking not "how `happy' is a_i on b_j?" but "if a_i is placed on b_j, then how `happy' can it be made?". Consequently, the dynamic programming algorithm is applied to the matrix (^ijR) in which each element (^ijR_mn) is a measure of the interaction of a_i (on b_j) with a_m on b_n. The best path through this matrix has an associated score, which provides a measure by which the validity of the original assumption (of placing a_i on b_j) can be assessed. These scores can be taken to form the higher-level matrix (S), from which a set of equivalences can be extracted by a dynamic programming calculation finding the highest scoring path compatible with a sequence alignment. Following the use of the same algorithm in protein structure comparison (Taylor and Orengo, 1989, JMB, 208:1-22), the individual scores (^ijR_mn) along the optimal path traced in each low-level matrix (^ijR) were summed to produce the high-level matrix.