Data science skills in publishing: for authors, editors and referees
An ECM32 satellite Workshop
August 18th 2019 TU Wien, Austria
This day-long workshop was the first major initiative of the IUCr Committee on Data (CommDat), established at the 2017 IUCr Congress in Hyderabad to survey and report to the IUCr Executive Committee on all relevant aspects of data management. In line with the Committee’s broad remit, the Workshop included contributions from the fields of chemical and macromolecular crystallography and powder diffraction, and was concerned with the quality of reporting of structural models through the medium of publication as well as database deposition.
The day commenced with an introduction by John R. Helliwell, CommDat Chair. He explained that the inspiration for the day’s topic was the IUCr chemical journals’ refereeing processes, in which article narrative, checkCIF report, and the underpinning processed diffraction data and derived coordinates data were assessed together, in order to arrive at the version of record for each of these aspects of a study. A running theme throughout the day was consideration of how far this exemplary procedure might be extended into other areas, e.g. biological crystallography.
This Workshop followed earlier workshops organised by the Diffraction Data Deposition Working Group on raw diffraction data preservation and reuse [1, 2]. Several speakers took up the theme of raw data deposition in some detail, reflecting a significant community shift in favour of raw data availability, especially for biological (i.e. macromolecular) crystallography and also powder diffraction. There were also reflective remarks from structural chemists on the usefulness and desirability of access to raw data; this aspect is intended to be the theme of the Day Zero CommDat Workshop at the next IUCr Congress, IUCr2020. Enquiries were made about similar workshops from the X-ray laser serial crystallography community, and there was significant interest in artificial intelligence analyses of data – a growing trend in other science fields that should be looked into in detail for crystallography.
Session I: the checkCIF paradigm
The existing practice in the area of chemical crystallography was reviewed in the first session. Anthony Linden, a former Acta C Section Editor, described the details of the procedures adopted by IUCr chemical journals, including a historical introduction.
Brian McMahon, who had been involved from the beginning with the development of CIF as a publication and data exchange standard, teased apart the components of checkCIF that were facilitated by the fundamental CIF design principles. He identified these as completeness (the specification of sufficient metadata to reproduce fully the experimental and modelling steps of the study), correctness or internal self-consistency (the validity of the description of the model in terms of the available data), and context (the comparison, using existing knowledge bases, with similar structures to highlight unusual or outlier behaviour). These components are common to both chemical and biological structure validation. He advertised publBio, an authoring tool developed for IUCr journals to make it easier for authors to supply the necessary experimental metadata (especially in the area of biological crystal sample preparation).
Ton Spek completed the chemistry overview with detailed case studies of very poor practice in some submissions. He documented where specific raw data availability would resolve some cases.
Session II: beyond chemical crystallography
Miguel Aranda emphasised the potential in powder diffraction of raw data preservation and reuse and documented this in his own studies using Zenodo to share his raw data with his recent article for Cement and Concrete Research. He reported a very positive experience and fruitful outcome of sharing his raw data in the review stage.
In a late programme change, Kay Diederichs stepped in for Manfred Weiss, a former Acta F Section Editor, and they worked together to prepare the delivered talk on diffraction data deposition and publication in biology. Kay used highlighted sentences from the Workshop prospectus to describe enthusiastically the important roles that MX raw diffraction data could play in challenging cases.
Loes Kroon-Batenburg delivered a remote presentation further developing the theme of raw data opportunities for biological crystallography publishing. Loes structured her talk around assessing the FAIR principles of data use (Findable, Accessible, Interoperable and Reusable) and their current practical implementation. She also unveiled a marvellous new tool where the calculated diffraction pattern for the macromolecular model presented by authors was subtracted from the measured diffraction images in order to see what remained to be interpreted. [The Workshop participants were informed that Loes had been selected by CommDat and approved by the IUCr 2020 Congress Programme Committee to present a Keynote Lecture in Prague, which she had accepted.]
Session III: enhancing the chemical record
The third session reviewed aspects of identifying and remediating problems that had managed to slip past validation efforts, and helping to ensure that authors could continue to work in developing best practices, based on their own experiences and those of others.
Simon Coles and Suzanna Ward undertook to present the work and ideas of Carl Schwalbe, who died shortly before the Workshop, on the challenges of correct tautomer determination. This research undertaken by Carl as a CCDC Honorary Fellow led to questions of how best to remediate these cases held in the CSD.
Mariusz Jaskolski considered the theme of post publication peer review and need for remediation in the biological fields. He documented a wide range of cases of problematic and even non-existent ligands described in publications and PDB depositions. He documented that 12% of the ligands in the PDB were 'blatantly wrong'. His solution to this was a continued, even expanded, emphasis on training in macromolecular crystallography and validation as well as 'less enthusiastic ligand placement by authors'.
Petra Bombicz, Editor of the review journal Crystallography Reviews, described the roles that review articles can play in such training. She described a range of published articles that addressed topics such as assessing data quality, estimations of processed diffraction intensity variances and weak signals in SAD phasing. She outlined a relevant review article on data science skills for referees  and advertised similar articles aimed at referee training articles for electron crystallography, powder diffraction, chemical crystallography and so on.
Session IV: Future prospects
Simon Billinge, Acta A Section Editor, opened the last, forward-looking, session by describing machine learning possibilities and tests as pilot studies, such as for space group determination in pair distribution function raw data.
Gillian Holmes, from IUCr Journals, described the history, current and future development of IUCrData. Currently focussed on short reports on chemical crystal structures, it was hoped to expand into the area of Biodata reports. This autumn would see the start of planning for a new category involving raw data, with Loes Kroon-Batenburg visiting Chester.
In the concluding lecture John Helliwell described his 20-year-long efforts to encourage biology research to emulate the refereeing and editorial processes of the IUCr chemical journals. He documented that he was now making headway by always insisting, when accepting an invitation to review an article, that a journal provide him with the underpinning data sets (processed structure factors and coordinates) as well as article narrative and PDB Validation report. He described in detail the typical layout of his report, as recorded in his cited Crystallography Reviews article . He was able to give one example of his referee’s reports in detail, which Nature Communications had published with his and the authors' consent . He mentioned that he had not so far asked for the DOI of any raw data set, but – given the enthusiasm expressed through the day for raw data checking and reuse, as well as a recent IUCr Journals editorial encouraging raw data sharing in general and requiring it in specific types of article  – he would implement this in his refereeing from now on. This improved refereeing practice in biology should considerably assist in reducing the number of post-publication critique articles, if adopted as formal policy by journals.
The day concluded with an open and wide-ranging discussion on such topics as: open peer review, artificial intelligence and machine learning in extracting information and knowledge from crystallographic data, and the expanding data flows at the ever improving synchrotron radiation and X-ray laser facilities and enhancement of detectors that is still ongoing.
 Helliwell, J. R. (2018). Data science skills for referees: I biological X-ray crystallography, Crystallography Reviews
, 263-272, DOI: 10.1080/0889311X.2018.1510878
 Helliwell, J. R., Minor, W., Weiss, M. S., Garman, E. F., Read, R. J., Newman, J., van Raaij, M. J., Hajdu, J. & Baker, E. N. (2019). Findable Accessible Interoperable Re-usable (FAIR) diffraction data are coming to protein crystallography. IUCrJ
, 341-343. DOI: 10.1107/S2052252519005918