Reaction to announcement of AlphaFold Database

John R. Helliwell
[Fig. 1]

At 4 p.m. on 22 July 2021 the bulletin board of the CCP4 (Collaborative Computational Project, Number 4, for macromolecular crystallography) conveyed, what I would call, a dramatic announcement from EMBL–EBI:

"DeepMind and EMBL's European Bioinformatics Institute (EMBL-EBI) have partnered, initially for a 2-year period, to make hundreds of thousands (and eventually many millions) of AlphaFold structure predictions freely available to the community through a new data resource, AlphaFold DataBase (AlphaFold DB). AlphaFold is an Artificial Intelligence (AI) system developed by DeepMind that predicts a protein's three-dimensional (3D) structure from its amino-acid sequence. The initial release of the resource provides structure predictions for most of the proteins in the human proteome as well as for the proteomes of 20 other species of significant biological or medical interest."

[The relevant academic references are Jumper et al. (2021) and Tunyasuvunakool et al. (2021).] 

The day before, in its Wednesday release of new experimentally derived Protein Data Bank (PDB)-archived structures, 214 new ones were made available, added to the ~180,000 in the archive. It was from all these that a core set of a few tens of thousands of experimental structures provided the learning set for DeepMind, AlphaFold2, to embark on its predictions of protein folds. That startling success was the basis for my article in the IUCr Newsletter last year (Helliwell, 2020), including a historical look back on the protein folding problem over many decades, as well as offering my congratulations to the DeepMind team on their research breakthrough.

I immediately alerted the IUCr Newsletter Editor, Mike Glazer, about the new EMBL–EBI announcement. Both he and I thought that this needed a description, inevitably brief, given a deadline for this article of four days later. Obviously to report on precisely what all this meant, I should do some direct incursions into the new AlphaFold Database (AlphaFold DB). I had already taken to heart this short extract from the EMBL scientists (Cusack et al., 2021) that came with the announcement:

"While AlphaFold DB will, in general, accelerate structural biology research, it will likely also induce a shift in emphasis from initial structural determination to the study of the more mechanistic and functional aspects of protein structures. Although this in turn may lead to an objective re-evaluation of the large-scale structural biology infrastructures devoted to structure determination (e.g. synchrotron X-ray crystallography beamlines), it is likely that for the foreseeable future they will be essential to validate and thus fully harness the potential of structure prediction, and to enable structural investigations for which no reliable predictions can be made at this time (structure of nucleic acids and large complexes, ligand and fragment screens, investigations of dynamics, etc.)."

The above paragraph's perspective was quite similar to my own (Helliwell, 2020), apart from the bit about the need for an "objective re-evaluation of the large-scale structural biology infrastructures devoted to structure determination (e.g. synchrotron X-ray crystallography beamlines)." As a pioneer of the development of synchrotron radiation beamlines and their use in crystallography, I obviously have some affection for such infrastructures, and so this is me saying I have a conflict of interest in scrutinising such a statement. Whether these beamlines are to be eventually replaced, under the auspices of the AlphaFold DBI was in turn rather struck by what I presume is the Google lawyers' view of an AlphaFold DB structure:

"REMARK   1 DISCLAIMERS                                                         

 REMARK   1 ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED. THE 

 REMARK   1 INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD

 REMARK   1 BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY

 REMARK   1 OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT

 REMARK   1 USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD   

 REMARK   1 PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR       

 REMARK   1 PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT  

 REMARK   1 CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS AVAILABLE FOR

 REMARK   1 ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."

(the prefacing of each line of the above with "REMARK" is the style of a PDB coordinate file).

Anyway, whatever the lawyers might imagine, I set about doing some tests. I chose my Oxford University DPhil project (1974 to 1977): "X-ray studies concerning the structure of sheep liver 6-phosphogluconate dehydrogenase (6PGDH)", available on request from the Bodleian Library, University of Oxford. My research and those of a few further PhD students, under the expert supervision of my supervisor Dr Margaret Adams, led to the PDB deposition 2pgd (Adams, Helliwell & Bugg, 1977; Adams et al., 1991). AlphaFold DB offered several predicted 6PGDHs, none for the sheep (Ovis aries) liver enzyme. But I could choose the human enzyme predicted structure, with its closely similar amino-acid sequence (Fig. 1).

[Fig. 1]Figure 1. The AlphaFold DB human 6PGDH. This shows the ribbon diagram for the predicted 3D structure. Note the "Model Confidence" colour coding at left, i.e. "Very high" throughout apart from a short stretch of amino acids at the bottom of this screenshot. This “Model Confidence” is described as “AlphaFold produces a per-residue confidence score (pLDDT) between 0 and 100. Some regions below 50 pLDDT may be unstructured in isolation."

I used this AlphaFold DB human 6PGDH 3D structure to perform molecular replacement (MR) with the Phaser MR program (McCoy et al., 2007) to solve the sheep liver 6PGDH X-ray diffraction data's structure. The program ran swiftly with the concluding message "EXIT STATUS: SUCCESS CPU Time: 0 days 0 hrs 2 mins 39.18 secs (159.18 secs)". As Fig. 2 shows, the placement of the AlphaFold DB model by Phaser MR into the sheep liver 6PGDH unit cell gave the (Fobs – Fcalc) electron density with clear evidence for one of our experimental crystallization procedure sulfate ions. That is, we had used high-concentration ammonium sulfate for crystallization, whose diffraction signal was in the Fobs (observed structure factor) list. The AlphaFold DB model was, of course, an in silico model, in effect in a vacuum, with no ligands and no bound waters. In the Fobs – Fcalc difference map there were also clear indications of the amino-acid substitutions of histidine for glutamine at positions 56 and 213 of the sheep versus the human sequence (not shown). So, in terms of my DPhil, the time I had taken searching for isomorphous replacement heavy-atom derivatives for phasing would have been saved; difficult to estimate exactly, but let me say one year. But note that I also spent a considerable time on the functional crystallographic studies (substrate binding experiments and so on) and which formed a significant part of my thesis.

[Fig. 2]Figure 2. The (FobsFcalc) (contoured at 3σ, in green) and the (2FobsFcalc) (contoured at 1.2 r.m.s., in magenta) electron-density maps from the Phaser MR run shows the sulfate ion present in the sheep liver 6PGDH crystal, PDB code 2pgd. The Fobs are the sheep X-ray diffraction structure-factor data for 2pgd (green model for protein, yellow for the sulfur in the sulfate ion, oxygens in red), the Fcalc are based on the correctly placed human 6PGDH AlphaFold DB model (in blue, oxygens in red). This figure was prepared using Coot (Emsley & Cowtan, 2004).

I was alerted by my Twitter friend @aemiele "structural biophysicist based in Lyon" that the web server in Seattle https://robetta.bakerlab.org/ had much to commend it as an alternative to AlphaFold DB, also published a few days ago (Baek et al., 2021). I duly registered for Robetta and was able to submit the sheep liver 6PGDH amino-acid sequence. I also noted a similar view to the Google lawyers by the University of Washington lawyers:

"Disclaimer

THE INFORMATION, DATA, PROTOCOLS, AND SOFTWARE AVAILABLE ON THIS WEB SITE ARE PROVIDED ON AN "AS IS" BASIS WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF TITLE, NONINFRINGEMENT OF INTELLECTUAL PROPERTY, AND IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

Limitation of Liability

The entire risk for use of the Web site lies with the user. The University of Washington reserves the right to modify the Web site or reduce or discontinue service at any time.

The Web site is provided for educational and informational purposes only and is not engaged in providing professional services. The Website is experimental in nature, and has been developed as part of research conducted at the University of Washington."

At the time of submitting this article to the IUCr Newsletter, I have not yet received an email with Robetta's predicted sheep liver 6PGDH 3D structure. I would note though that there was a sizeable queue of submitted jobs awaiting their turn.

I would conclude this short article by saying that I think this is a game-changer in protein structural science; it will speed up many experimental studies. Again I congratulate all concerned. Besides a speeding up it may even ease a project past a complete impasse of crystal structure determination, because of the unavailability of a suitable MR model or isomorphous or anomalous dispersion phasing tools, by providing a predicted structure as the phasing start. It will, I imagine, also stimulate new project directions.

There are some further points for interesting discussion. Firstly, the vast majority of the training set of experimental structures are cryostructures, not room-temperature let alone 37°C (physiological temperature for mammals) structures. Secondly, the predicted structures have no bound waters. For substrate or inhibitor binding these waters are either displaced or are the hydrogen bonders through which a ligand molecule attaches itself to a protein. Thirdly, there is an elaborate range of validation processes that the crystallographic community has evolved (see e.g. https://ecm2019.org/satellites/data-science-skills-in-publishing/), including the work of the PDB Validation Task Forces, over many years. The discussions for and against the need for rigorous evaluation of predicted structures versus their internal 'accuracy estimates' will probably continue. My view (Helliwell, 2020) remains that, although AlphaFold does give its own uncertainty estimate (Fig. 1), rigorous assessment requires it be measured against an experimentally determined protein structure. The new aspect, with AlphaFold DB, is that where there is a predicted structure alone, I suggest that one should proceed with the lawyers’ views, quoted above, in one’s mind.

Meanwhile, the last word I think should be given to the PDBe in its tweet:

“@PDBeurope  The structure predictions on the AlphaFold DB website include those that already have experimentally determined structures in the PDB. In these cases, the AlphaFold pages display links to our #PDBeKB protein pages to compare existing data.“

References

Adams, M. J., Gover, S., Leaback, R., Phillips, C. & Somers, D. O'N. (1991). Acta Cryst. B47, 817–820.

Adams, M. J., Helliwell, J. R. & Bugg, C. E. (1977). J. Mol. Biol. 112, 183–197.

Baek, M. et al. (2021). Science, 10.1126/science.abj8754.

Cusack, S., Eustermann, S., Kleywegt, G., Kosinski, J., Mahamid, J., Marquez, J. A., Müller, C., Schneider, T., Thornton, J., Vamathevan, J., Velankar, S. & Wilmanns, M. (2021). https://www.embl.org/news/science/alphafold-potential-impacts/

Emsley, P. & Cowtan, K. (2004). Acta Cryst. D60, 2126–2132.

Helliwell, J. R. (2020). IUCr Newsl. 28(4), 6.

Jumper, J. et al. (2021). Nature10.1038/s41586-021-03819-2.

McCoy, A. J., Grosse-Kunstleve, R. W., Adams, P. D., Winn, M. D., Storoni, L. C. & Read, R. J. (2007). J. Appl. Cryst. 40, 658–674.

Tunyasuvunakool, K., Adler, J., Wu, Z. et al. (2021). Nature, https://doi.org/10.1038/s41586-021-03828-1 

 

John R. Helliwell is Emeritus Professor, Department of Chemistry, University of Manchester, UK, and is a DSc in physics from the University of York, UK; john.helliwell@manchester.ac.uk.

 

Editor's note: See the Acta Cryst. D article on AlphaFold2.

25 July 2021

Copyright © - All Rights Reserved - International Union of Crystallography