Feature article
DeepMind and CASP14
What do I as a protein crystallographer think of the use of AI by DeepMind to predict protein folds?
The Critical Assessment of Techniques for Protein Structure Prediction (CASP) challenge is a serious and very well done effort these last decades carried out by global organisers: Dr John Moult (chair), University of Maryland, USA; Dr Krzysztof Fidelis, UC Davis, USA; Dr Andriy Kryshtafovych, UC Davis, USA; Dr Torsten Schwede, University of Basel and SIB Swiss Institute of Bioinformatics, Switzerland; and Dr Maya Topf, Birkbeck, University of London, UK and CSSB (HPI and UKE) Hamburg, Germany. The CASP14 Press release of 30th November 2020 https://predictioncenter.org/casp14/doc/CASP14_press_release.html stated that:
'Today (Monday), researchers at the 14th Community Wide Experiment on the CASP14 will announce that an artificial intelligence (AI) solution to the challenge has been found.' And 'During the latest round of the challenge, DeepMind's AlphaFold program has determined the shape of around two-thirds of the proteins with accuracy comparable to laboratory experiments*. AlphaFold's accuracy with most of the other proteins was also high, though not quite at that level."
The asterisk above is all-important to exploring precisely what has been achieved in this undoubted breakthrough:
'Notes to editors: *AlphaFold produced models for about two-thirds of the CASP14 target proteins (JRH: 84 targets; full details are here: https://predictioncenter.org/casp14/index.cgi) with global distance test scores above 90 out of 100. Above the 90-score threshold, remaining differences between the models and the experimental structures are small and of the size expected for experimental artefacts and errors and alternative low energy local conformations. Note that these CASP targets are single proteins or domains, not protein complexes, which are a next frontier. The global distance test is a measure of how closely the shape of the protein model matches the shape from lab experiments [1, 2]."
A dramatic advance has happened then, but what, exactly?
This is clearly, i.e. all the expert judges of the CASP series of challenges agree, a breakthrough in the field of prediction of protein folds. To understand this better, we need to clarify what is a fold, and what is a structure? The polypeptide chain fold is described by simply linking the alpha carbon in each peptide in the protein's amino acid sequence, one after the other. This leads to the well known ribbon representation of a protein. Then, where the individual atoms are in the structure is a critical aspect, and as W. L. Bragg most famously said: with crystallography, we can see atoms.
The experimental probes of X-rays, electrons and neutrons in crystallography, and electrons in microscopy, as well as NMR, then strive for precise protein models derived as the best fit to their measurements. The term accuracy, which means as close as we can get to the truth, requires more than one method and, in this context, is termed integrated structural biology. Our experimental methods have weaknesses though. There is radiation damage to the protein structure, if not usually to the fold, by X-rays or electrons according to the dose absorbed by the crystal sample. Then there are changes to the details of the structure and its structural dynamics according to whether cryo-temperatures are used. In recent years experimentalists have explored more and more of these changes, termed 'artefacts', and strive for damage-free structures at physiologically relevant temperatures. Cryo-structures can be substantiated by accompanying the structure with a functional assay and/or undertaking one of a group of structures at room temperature, including comparing structures in solution with those in the crystal such as by solution X-ray or neutron scattering. The characterisation of what precision of a protein structure has been achieved is a big topic, but in 1999 Durward Cruickshank formulated the Diffraction Precision Index [3] to assist with this assessment. The CASP14 press release's sentence, "Above the 90-score threshold, remaining differences between the models and the experimental structures are small and of the size expected for experimental artefacts and errors, and alternative low energy local conformations", has I think to be assessed carefully, case by case, and the precision of placement atom by atom. I suspect that a new terminology has arrived: allowing for prediction artefacts to add to our own list of experimental artefacts.
Some history of the protein folding problem: key steps and how the folding problem has held great scientific minds over the last 60 years
Referred to as Levinthal's paradox, in 1969 Cyrus Levinthal [4] noted that, because of the vast number of degrees of freedom in an unfolded polypeptide chain, the molecule has an astronomical number of possible conformations. Based upon the observation that proteins fold much faster than this, Levinthal then proposed that a random conformational search does not occur, and the protein must, therefore, fold through a series of metastable intermediate states.
A famous article in this field is by Christian Anfinsen [5], which stated in conclusion that 'empirical considerations of the large amount of data now available on correlations between sequence and three-dimensional structure, together with an increasing sophistication in the theoretical treatment of the energetics of polypeptide chain folding are beginning to make more realistic the idea of the a priori prediction of protein conformation. It is certain that major advances in the understanding of cellular organization, and of the causes and control of abnormalities in such organization, will occur when we can predict, in advance, the three-dimensional phenotypic consequences of a genetic message.'
Indeed the Nobel Prize in Chemistry in 1972 (https://www.nobelprize.org/prizes/chemistry/1972/press-release/) was awarded 'one half to Christian B. Anfinsen for his work on ribonuclease, especially concerning the connection between the amino acid sequence and the biologically active conformation, the other half jointly to Stanford Moore and William H. Stein for their contribution to the understanding of the connection between chemical structure and catalytic activity of the active centre of the ribonuclease molecule.'
The great protein crystallographer David Phillips, who with his team determined the first enzyme crystal structure [6], and later was Sir David Phillips and then Lord Phillips of Ellesmere, wrote already in 1966 in Scientific American [7] of the folding of lysozyme: 'It seems a reasonable assumption that, as the synthesis proceeds, the amino end of the chain becomes separated by an increasing distance from the point of attachment to the ribosome, and that the folding of the protein chain to its native conformation begins at this end even before the synthesis is complete. According to our present ideas, parts of the polypeptide chain, particularly those near the terminal amino end, may fold into stable conformations that can still be recognized in the finished molecule and that act as 'internal templates', or centres, around which the rest of the chain is folded.'
At this point in the history, one should note that [8] '(while) Anfinsen (1973) showed that all information needed for a polypeptide to fold into basic secondary structure elements such as α-helices and β-strands and their subsequent collapse into a compact structure is entirely contained within the amino acid sequence. (and) Although small single-domain proteins can fold spontaneously and in a reversible way, most multi-domain proteins need the help of chaperones to adopt the ultimate native and active conformation.... These proteins thus violate Anfinsen's rule.'
How did DeepMind make this advance [9, 10]?
From the DeepMind blog https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery the public at large can learn that DeepMind's methods 'relied on deep neural networks that are trained to predict properties of the protein from its genetic sequence. The properties our networks predict are: (a) the distances between pairs of amino acids and (b) the angles between chemical bonds that connect those amino acids. The first development is an advance on commonly used techniques that estimate whether pairs of amino acids are near each other.'
They had already entered the CASP13 competition in 2018, which they described in [10]. In that, they make clear [9] that:
'Source code for the distogram, reference distogram, and torsion prediction neural networks, together with the neural network weights and input data for the CASP13 targets, are available for research and non-commercial use at https://github.com/deepmind/deepmind-research/tree/master/alphafold_casp13. We make use of several open-source libraries to conduct our experiments, particularly HHblits36, PSI-BLAST37, and the machine-learning framework TensorFlow (https://github.com/tensorflow/tensorflow) along with the TensorFlow library Sonnet (https://github.com/deepmind/sonnet), which provides implementations of individual model components50. We also used Rosetta9 under license.
Data availability: Our training, validation, and test data splits (CATH domain codes) are also available from https://github.com/deepmind/deepmind-research/tree/master/alphafold_casp13.
The following versions of public datasets were used in this study: PDB 2018-03-15; CATH 2018-03-16; Uniclust30 2017-10; and PSI-BLAST nr dataset (as of 15 December 2017).'
The CASP14 work involved rewriting their software such that there were accuracy improvements in AlphaFold in CASP14 over that used by them in CASP13 which 'lead to more accurate interpretations of function; better interface prediction for protein-protein interactions; better binding pocket prediction and improved molecular replacement in crystallography.'
Most importantly for us experimentalists, we read: 'All the neural network models are trained on structures extracted from the PDB. We extract non-redundant domains by utilizing the CATH7 35% sequence similarity cluster representatives (CATH version: 2018-03-16). This gives 31 247 domains, which are split into train and test sets (29 427 and 1820 proteins, respectively), keeping all domains from the same homologous superfamily (H-level in the CATH classification) in the same partition.'
Getting into the mind of DeepMind: its other projects
A major strength of DeepMind is the breadth of its other areas of research involving AI. Further details of these are available at their website and include:
Traffic prediction with advanced graph neural networks
Using AI to predict retinal disease progression
Using WaveNet technology to reunite speech-impaired users with their original voices
Advanced machine learning helps Play Store users discover personalised apps
AlphaStar: Grandmaster level in StarCraft II using multi-agent reinforcement learning
Using machine learning to accelerate ecological research
Using AI to give doctors a 48-hour head start on life-threatening illness
DeepMind AI reduces Google data centre cooling bill by 40%
Some competing interests: an AI company and its use of an open-access data archive of experimental structures, the PDB
In their Nature article [9] Acknowledgements, the authors state, 'We thank ...the CASP13 organisers and the experimentalists whose structures enabled the assessment.' So, that is good, but in the blog of DeepMind's view of experimental methods ( https://deepmind.com/blog/article/AlphaFold-Using-AI-for-scientific-discovery) I find wide of the mark their use of the words 'trial and error':
'Over the past five decades, researchers have been able to determine shapes of proteins in labs using experimental techniques like cryo-electron microscopy, nuclear magnetic resonance, and X-ray crystallography, but each method depends on a lot of trial and error, which can take years of work, and cost tens or hundreds of thousands of dollars per protein structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process for difficult proteins. The ability to predict a protein's shape computationally from its genetic code alone - rather than determining it through costly experimentation - could help accelerate research.'
This also brings us back to being clear about fold versus structure and the all-important precision of atom placements, whether by experiment [3] or by AI prediction, as yet not parameterised.
There is also the question that while the reproducibility of the DeepMind Nature paper is feasible via the authors placing the workflows at github, the competing interests statement in their Nature paper states: 'A.W.S., J.K., T.G., J.J., L.S., R.E., H.P., C.Q., K.S., A.Ž. and A.B. have filed provisional patent applications relating to machine learning for predicting protein structures.'
And re. reproducibility, (https://github.com/deepmind/deepmind-research/tree/master/alphafold_casp13), 'This code can't be used to predict the structure of an arbitrary protein sequence. It can be used to predict structure only on the CASP13 dataset (links below). The feature generation code is tightly coupled to our internal infrastructure as well as external tools. Hence we are unable to open-source it. We give guide as to the features used for those accustomed to computing them below.....'
This, to me, seems like a culture clash of the fully open PDB versus the private enterprise and profit of DeepMind.
Future outlook
As I have explored in this article, it is the detailed protein structure that determines function, not the protein fold per se, but DeepMind are clearly entering the area of positioning atoms in detail, although not really with a clear level of precision at the atom-by-atom detailed level. Secondly, multi-domain proteins are currently outside of this achievement. Thirdly, where a structure is known, I note that structural dynamics studies start, i.e. in effect, while there may be one fold, there isn't one structure in function terms! In our experimental arena, kinetic crystallography is set to thrive with an expansion of X-ray lasers, the ESRF EBS, Diamond II etc. building on many earlier developments [11]. Diffuse scattering is all there for the measuring and interpreting, i.e. structural dynamics again.
Onto more specific topics: a third of all proteins are metalloproteins. Predicting where a metal might bind and which metal element it is, is not solved. And not just metal ions but all sorts of other ligands too will still have to be done experimentally, although see DeepMind's Nature paper Extended Data Fig. 8 and caption statement on ligands, and thereby drug design/discovery:
'Ligand pocket visualizations for T1011. T1011 (PDB 6M9T) is the EP3 receptor bound to misoprostol-FA55. a, The native structure showing the ligand in a pocket. b, c, Submission 5 (78.0 GDT TS) by AlphaFold (b), made without knowledge of the ligand, shows a pocket more similar to the true pocket than that of the best other submission (322, model 3, 68.7 GDT TS) (c). Both submissions are aligned to the native protein using the same subset of residues from the helices close to the ligand pocket and visualized with the interior pocket together with the native ligand position.'
The impact on the AI field as a whole in its efforts with predicting protein structure from amino acid sequence has already been major. Since the DeepMind publication [9] it has been cited 248 times since January 2020. A wide-ranging very recent review of the whole field, with 252 references, is in press in the journal Patterns [12].
I am keen to get started with DeepMind, who are owned by Google, which I imagine will proceed as follows:
JRH: Google, I am starting research on protein 427 in human chromosome 14. What is DeepMind's prediction of the 3D structure?
Google: I will submit the job to our computer cluster...(2 minutes later)...The nearest 3D structure is PDB entry 8XYZ but has a GDT score of only 55%. So, DeepMind offers a fold with an 89% GDT score. NB. it is likely a single-domain protein, so DeepMind is very confident in this GDT score estimate. This 3D structure prediction is in your mailbox.
JRH: Thankyou Google.
Google: You're welcome. Good luck with your research.
[NB For DeepMind to provide a GDT estimate without an experimental structure available is not yet solved!]
Envoi
I am going to leave the last words to DeepMind out of respect for their achievement:
'Some rare diseases involve mutations in a single gene, resulting in a malformed protein which can have profound effects on the health of an entire organism. A tool like AlphaFold might help rare disease researchers predict the shape of a protein of interest rapidly and economically.'
Also, here is their figure animation, kindly provided by the DeepMind Press Office.
The more complete 8-minute video from DeepMind on their work is to be found here, which I found also highly interesting: https://www.youtube.com/watch?v=gg7WjuFs8F4&feature=youtu.be
Acknowledgements
I am very grateful to Mike Glazer for helpful discussions during the preparation of this article, as well as to Sarah Froggatt and Brian McMahon of IUCr Chester for expert technical help. I thank the DeepMind Press Office and the Royal Institution of Great Britain for prompt responses to my requests for figure, gif and movie materials.
References
[1] Zemla, A., Venclovas, Č, Moult, J. & Fidelis, K. (2001). Processing and evaluation of predictions in CASP4. Proteins, 45(Suppl 5), 13-21.
[2] Zemla, A. (2001). LGA: A method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370-3374.
[3] Cruickshank, D. W. J. (1999). Remarks about protein structure precision. Acta Cryst. D55, 583-601.
[4] Levinthal, C. (1968). Are there pathways for protein folding? Journal de Chimie Physique et de Physico-Chimie Biologique, 65, 44-45.
[5] Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science, 181, 223-230.
[6] Blake, C. C. F. et al. (1965). The structure of hen egg white lysozyme: a three dimensional Fourier synthesis at 2Å resolution. Nature, 196, 1173-1176.
[7] Phillips, D. C. (1966). The three-dimensional structure of an enzyme molecule. Scientific American, 215(5), 78-93.
[8] Pauwels, K., Van Molle, I., Tommassen, J. & Van Gelder, P. (2007). Chaperoning Anfinsen: the steric foldases. Molecular Microbiology, 64, 917-922.
[9] Senior, A. W. et al. (2020). Improved protein structure prediction using potentials from deep learning. Nature, 577, 706-710. https://doi.org/10.1038/s41586-019-1923-7
[10] Senior, A. W. et al. (2019). Protein structure prediction using multiple deep neural networks in the 13th Critical Assessment of Protein Structure Prediction (CASP13). Proteins, 87, 1141-1148. https://doi.org/10.1002/prot.25834
[11] Cruickshank, D. W. J., Helliwell, J. R. & Johnson, L. N. (1992). Editors. Time-Resolved Macromolecular Crystallography. Proceedings of a Royal Society Discussion Meeting.
[12] Gao, W., Mahajan, S. P., Sulam, J. & Gray, J. J. (2020). Deep Learning in Protein Structural Modeling and Design. (2020). Patterns. In the press. https://doi.org/10.1016/j.patter.2020.100142
Copyright © - All Rights Reserved - International Union of Crystallography