Reaction to announcement of AlphaFold Database
At 4 p.m. on 22 July 2021 the bulletin board of the CCP4 (Collaborative Computational Project, Number 4, for macromolecular crystallography) conveyed, what I would call, a dramatic announcement from EMBL–EBI:
"DeepMind and EMBL's European Bioinformatics Institute (EMBL-EBI) have partnered, initially for a 2-year period, to make hundreds of thousands (and eventually many millions) of AlphaFold structure predictions freely available to the community through a new data resource, AlphaFold DataBase (AlphaFold DB). AlphaFold is an Artificial Intelligence (AI) system developed by DeepMind that predicts a protein's three-dimensional (3D) structure from its amino-acid sequence. The initial release of the resource provides structure predictions for most of the proteins in the human proteome as well as for the proteomes of 20 other species of significant biological or medical interest."
[The relevant academic references are Jumper et al. (2021) and Tunyasuvunakool et al. (2021).]
The day before, in its Wednesday release of new experimentally derived Protein Data Bank (PDB)-archived structures, 214 new ones were made available, added to the ~180,000 in the archive. It was from all these that a core set of a few tens of thousands of experimental structures provided the learning set for DeepMind, AlphaFold2, to embark on its predictions of protein folds. That startling success was the basis for my article in the IUCr Newsletter last year (Helliwell, 2020), including a historical look back on the protein folding problem over many decades, as well as offering my congratulations to the DeepMind team on their research breakthrough.
I immediately alerted the IUCr Newsletter Editor, Mike Glazer, about the new EMBL–EBI announcement. Both he and I thought that this needed a description, inevitably brief, given a deadline for this article of four days later. Obviously to report on precisely what all this meant, I should do some direct incursions into the new AlphaFold Database (AlphaFold DB). I had already taken to heart this short extract from the EMBL scientists (Cusack et al., 2021) that came with the announcement:
"While AlphaFold DB will, in general, accelerate structural biology research, it will likely also induce a shift in emphasis from initial structural determination to the study of the more mechanistic and functional aspects of protein structures. Although this in turn may lead to an objective re-evaluation of the large-scale structural biology infrastructures devoted to structure determination (e.g. synchrotron X-ray crystallography beamlines), it is likely that for the foreseeable future they will be essential to validate and thus fully harness the potential of structure prediction, and to enable structural investigations for which no reliable predictions can be made at this time (structure of nucleic acids and large complexes, ligand and fragment screens, investigations of dynamics, etc.)."
The above paragraph's perspective was quite similar to my own (Helliwell, 2020), apart from the bit about the need for an "objective re-evaluation of the large-scale structural biology infrastructures devoted to structure determination (e.g. synchrotron X-ray crystallography beamlines)." As a pioneer of the development of synchrotron radiation beamlines and their use in crystallography, I obviously have some affection for such infrastructures, and so this is me saying I have a conflict of interest in scrutinising such a statement. Whether these beamlines are to be eventually replaced, under the auspices of the AlphaFold DB, I was in turn rather struck by what I presume is the Google lawyers' view of an AlphaFold DB structure:
"REMARK 1 DISCLAIMERS
REMARK 1 ALPHAFOLD DATA, COPYRIGHT (2021) DEEPMIND TECHNOLOGIES LIMITED. THE
REMARK 1 INFORMATION PROVIDED IS THEORETICAL MODELLING ONLY AND CAUTION SHOULD
REMARK 1 BE EXERCISED IN ITS USE. IT IS PROVIDED "AS-IS" WITHOUT ANY WARRANTY
REMARK 1 OF ANY KIND, WHETHER EXPRESSED OR IMPLIED. NO WARRANTY IS GIVEN THAT
REMARK 1 USE OF THE INFORMATION SHALL NOT INFRINGE THE RIGHTS OF ANY THIRD
REMARK 1 PARTY. THE INFORMATION IS NOT INTENDED TO BE A SUBSTITUTE FOR
REMARK 1 PROFESSIONAL MEDICAL ADVICE, DIAGNOSIS, OR TREATMENT, AND DOES NOT
REMARK 1 CONSTITUTE MEDICAL OR OTHER PROFESSIONAL ADVICE. IT IS AVAILABLE FOR
REMARK 1 ACADEMIC AND COMMERCIAL PURPOSES, UNDER CC-BY 4.0 LICENCE."
(the prefacing of each line of the above with "REMARK" is the style of a PDB coordinate file).
Anyway, whatever the lawyers might imagine, I set about doing some tests. I chose my Oxford University DPhil project (1974 to 1977): "X-ray studies concerning the structure of sheep liver 6-phosphogluconate dehydrogenase (6PGDH)", available on request from the Bodleian Library, University of Oxford. My research and those of a few further PhD students, under the expert supervision of my supervisor Dr Margaret Adams, led to the PDB deposition 2pgd (Adams, Helliwell & Bugg, 1977; Adams et al., 1991). AlphaFold DB offered several predicted 6PGDHs, none for the sheep (Ovis aries) liver enzyme. But I could choose the human enzyme predicted structure, with its closely similar amino-acid sequence (Fig. 1).
I used this AlphaFold DB human 6PGDH 3D structure to perform molecular replacement (MR) with the Phaser MR program (McCoy et al., 2007) to solve the sheep liver 6PGDH X-ray diffraction data's structure. The program ran swiftly with the concluding message "EXIT STATUS: SUCCESS CPU Time: 0 days 0 hrs 2 mins 39.18 secs (159.18 secs)". As Fig. 2 shows, the placement of the AlphaFold DB model by Phaser MR into the sheep liver 6PGDH unit cell gave the (Fobs – Fcalc) electron density with clear evidence for one of our experimental crystallization procedure sulfate ions. That is, we had used high-concentration ammonium sulfate for crystallization, whose diffraction signal was in the Fobs (observed structure factor) list. The AlphaFold DB model was, of course, an in silico model, in effect in a vacuum, with no ligands and no bound waters. In the Fobs – Fcalc difference map there were also clear indications of the amino-acid substitutions of histidine for glutamine at positions 56 and 213 of the sheep versus the human sequence (not shown). So, in terms of my DPhil, the time I had taken searching for isomorphous replacement heavy-atom derivatives for phasing would have been saved; difficult to estimate exactly, but let me say one year. But note that I also spent a considerable time on the functional crystallographic studies (substrate binding experiments and so on) and which formed a significant part of my thesis.
I was alerted by my Twitter friend @aemiele "structural biophysicist based in Lyon" that the web server in Seattle https://robetta.bakerlab.org/ had much to commend it as an alternative to AlphaFold DB, also published a few days ago (Baek et al., 2021). I duly registered for Robetta and was able to submit the sheep liver 6PGDH amino-acid sequence. I also noted a similar view to the Google lawyers by the University of Washington lawyers:
THE INFORMATION, DATA, PROTOCOLS, AND SOFTWARE AVAILABLE ON THIS WEB SITE ARE PROVIDED ON AN "AS IS" BASIS WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO WARRANTIES OF TITLE, NONINFRINGEMENT OF INTELLECTUAL PROPERTY, AND IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.
The entire risk for use of the Web site lies with the user. The University of Washington reserves the right to modify the Web site or reduce or discontinue service at any time.
The Web site is provided for educational and informational purposes only and is not engaged in providing professional services. The Website is experimental in nature, and has been developed as part of research conducted at the University of Washington."
At the time of submitting this article to the IUCr Newsletter, I have not yet received an email with Robetta's predicted sheep liver 6PGDH 3D structure. I would note though that there was a sizeable queue of submitted jobs awaiting their turn.
I would conclude this short article by saying that I think this is a game-changer in protein structural science; it will speed up many experimental studies. Again I congratulate all concerned. Besides a speeding up it may even ease a project past a complete impasse of crystal structure determination, because of the unavailability of a suitable MR model or isomorphous or anomalous dispersion phasing tools, by providing a predicted structure as the phasing start. It will, I imagine, also stimulate new project directions.
There are some further points for interesting discussion. Firstly, the vast majority of the training set of experimental structures are cryostructures, not room-temperature let alone 37°C (physiological temperature for mammals) structures. Secondly, the predicted structures have no bound waters. For substrate or inhibitor binding these waters are either displaced or are the hydrogen bonders through which a ligand molecule attaches itself to a protein. Thirdly, there is an elaborate range of validation processes that the crystallographic community has evolved (see e.g. https://ecm2019.org/satellites/data-science-skills-in-publishing/), including the work of the PDB Validation Task Forces, over many years. The discussions for and against the need for rigorous evaluation of predicted structures versus their internal 'accuracy estimates' will probably continue. My view (Helliwell, 2020) remains that, although AlphaFold does give its own uncertainty estimate (Fig. 1), rigorous assessment requires it be measured against an experimentally determined protein structure. The new aspect, with AlphaFold DB, is that where there is a predicted structure alone, I suggest that one should proceed with the lawyers’ views, quoted above, in one’s mind.
Meanwhile, the last word I think should be given to the PDBe in its tweet:
“@PDBeurope The structure predictions on the AlphaFold DB website include those that already have experimentally determined structures in the PDB. In these cases, the AlphaFold pages display links to our #PDBeKB protein pages to compare existing data.“
Adams, M. J., Helliwell, J. R. & Bugg, C. E. (1977). J. Mol. Biol. 112, 183–197.
Baek, M. et al. (2021). Science, 10.1126/science.abj8754.
Cusack, S., Eustermann, S., Kleywegt, G., Kosinski, J., Mahamid, J., Marquez, J. A., Müller, C., Schneider, T., Thornton, J., Vamathevan, J., Velankar, S. & Wilmanns, M. (2021). https://www.embl.org/news/science/alphafold-potential-impacts/
Tunyasuvunakool, K., Adler, J., Wu, Z. et al. (2021). Nature, https://doi.org/10.1038/s41586-021-03828-1
John R. Helliwell is Emeritus Professor, Department of Chemistry, University of Manchester, UK, and is a DSc in physics from the University of York, UK; email@example.com.
Editor's note: See the Acta Cryst. D article on AlphaFold2.
Copyright © - All Rights Reserved - International Union of Crystallography