Crystallographic data

Final report of the IUCr Diffraction Data Deposition Working Group

John R Helliwell, Brian McMahon, Steve Androulakis, Marian Szebenyi, Loes Kroon-Batenburg, Tom Terwilliger, John Westbrook and Edgar Weckert

August 8th 2017

john.helliwell@manchester.ac.uk and bm@iucr.org

Introduction

The Diffraction Data Deposition Working Group (DDDWG) was established as an initiative of Professor Sine Larsen, then the President of the IUCr, to address the growing calls within the crystallographic community for the deposition of primary diffraction images, with some mechanism that allows their retrieval by other scientists for such purposes as reanalysis, software and methods development, validation and review. The terms of reference were as follows:

It is becoming increasingly important to deposit the raw data from scattering experiments; a lot of valuable information gets lost when only structure factors are deposited. A number of research centres, e.g. synchrotron and neutron facilities, are fully aware of the need and have established detector working groups addressing this issue. The IUCr is the natural organization to lead the development of standards for the representation of data and associated metadata that can lead to the routine deposition of raw data. A Working Group on these matters has thereby been launched by the IUCr Executive Committee, to which the Working Group will report, to be Chaired by Professor John R. Helliwell. Its provisional title is 'Diffraction Data Deposition Working Group of the IUCr'.

The Group convened for the first time at the Madrid IUCr Congress in August 2011, and launched a consultation programme that surveyed the perceived requirements for deposition and retrieval of raw (also referred to as primary) experimental data from all the IUCr Commissions. While the main focus of the Working Group remained on diffraction images, it was useful to know what other categories of experimental data needed to be treated in a parallel fashion - such knowledge helps to orchestrate work flows, metadata standards and other mechanisms to improve the management of experimental data across all of crystallography. The minutes of the inaugural meeting were posted at the IUCr’s Forum on Data to allow for wide and open public consultation on the topic (viewtopic.php?f=21&t=54). There was also an active DDDWG email list to help structure its own internal discussions and workings, and to facilitate communication between members of the working group, and between a wider group, largely Commission Chairs and alternates, as well as the IUCr Executive Committee, who were all part of the formal consultation group.

This report spans the whole period of the DDDWG, i.e. 2011-2017.

Summary of Workshops held

After the launch meeting at the IUCr Congress in Madrid several workshops were held by the IUCr DDDWG:

The Bergen Workshop basically surveyed the scene and defined the challenges of archiving of raw diffraction data as well as confirming terminologies and highlighting issues. From this Workshop DDDWG Member Tom Terwilliger brought together a group of articles published, after full refereeing, in Acta Cryst. D (see references list below).

The COMCIFS Workshop in Warwick brought the DDDWG deliberations to the direct attention of the IUCr's COMCIFS and raised the concept of a check cif for raw data. This in turn emphasised the need for a better clarity of the all important metadata that should accompany any raw diffraction dataset deposition.

The Warwick COMCIFS Workshop then naturally led to the Rovinj Workshop 'Metadata for Raw Data', whose speakers included the Chair of COMCIFS James Hester from Australia. By this time major, independent, initiatives of the establishment of large capacity data archives were in place, both general and specific. These, and many other aspects of the DDDWG findings, were summarised in an overview article by DDDWG Members Loes Kroon-Batenburg, Tom Terwilliger, Brian McMahon and John Helliwell published in IUCrJ in early 2017.

The final DDDWG workshop was held in New Orleans in May 2017 and focussed on bringing together as many as possible of the IUCr Commissions' experimental activities to report on their progress with defining their core metadata for raw data, which largely arose from the direct call made to them by the DDDWG at its very well attended launch meeting held within the Madrid IUCr Congress in August 2011.

Other activities

The IUCr DDDWG Members John Helliwell and Brian McMahon with the IUCr President Marvin Hackert and the IUCr Secretary Treasurer Luc van Meervelt led the writing of a Response by IUCr (http://www.iucr.org/iucr/open-data) to the publication by the International Council for Science (ICSU), the Inter Academy Partnership (IAP), The World Academy of Sciences (TWAS) and the International Social Science Council (ISSC) on Open Data in a Big Data World (https://www.icsu.org/publications/open-data-in-a-big-data-world). In the Response the IUCr acknowledged the importance of this Accord, and endorsed the analysis of the values of open data and the Principles of Open Data set out in the document Open Data in a Big Data World, published in short and long forms on the ICSU website. The IUCr Response noted that the Accord is very general, and has applicability across the entire panorama of science, which it defines as embracing 'all domains, including humanities and social sciences as well as the STEM (science, technology, engineering, medicine) disciplines'. Because the specific values, significance and implementation of Open Data principles will vary in detail between disciplines, the IUCr considered it useful to contribute a detailed response to the Accord as a case study of best practice emerging in one particular field. Specifically the IUCr holds that the essential component of openness is that the data supporting any scientific assertion should be:

  • complete (i.e. all data collected for a particular purpose should be available for subsequent re-use);
    and
  • precise (the meaning of each datum is fully defined, processing parameters are fully specified and quantified, statistical uncertainties are evaluated and declared).

In addition DDDWG Members John Helliwell and Brian McMahon proposed a session be held at the International Data Week held in Denver in September 2016 on Crystallographic Databases, which was accepted. The following databases presented a summary of their activities: the CSD, COD, ICDD and PDB. The session was introduced with an opening talk by John Helliwell and Brian McMahon. A jointly authored article was submitted to the Data Science Journal conference proceedings, and published in summer 2017.

Summary of publications made

As mentioned above those directly arising out of the DDDWG Workshops were as follows:

  • Terwilliger, T. C. (2014). Archiving raw crystallographic data. Acta Cryst. D70, 2500–2501.
  • Kroon-Batenburg, L. M. J. & Helliwell, J. R. (2014). Experiences with making diffraction image data available: what metadata do we need to archive? Acta Cryst. D70, 2502–2509.
  • Meyer, G. R., Aragao, D., Mudie, N. J., Caradoc-Davies, T. T., McGowan, S., Bertling, P. J., Groenewegen, D., Quenette, S. M., Bond, C. S., Buckle, A. M. & Androulakis, S. (2014). Operation of the Australian Store.Synchrotron for macromolecular crystallography. Acta Cryst. D70, 2510–2519.
  • Guss, J. M. & McMahon, B. (2014). How to make deposition of images a reality. Acta Cryst. D70, 2520–2532.
  • Terwilliger, T. C. & Bricogne, G. (2014). Continuous mutual improvement of macromolecular structure models in the PDB and of X-ray crystallographic software: the dual role of deposited experimental data. Acta Cryst. D70, 2533–2543.
  • Kroon-Batenburg, L. M. J., Helliwell, J. R., McMahon, B. & Terwilliger, T. C. (2017). Raw diffraction data preservation and reuse: overview, update on practicalities and metadata requirements. IUCrJ, 4, 87–99.
  • Bruno, I., Gražulis, S., Helliwell, J. R., Kabekkodu, S. N., McMahon, B. & Westbrook, J. (2017). Crystallography and Databases. Data Sci. J. 16, p. 38.

Technical trends

There have been very significant improvements since 2011 with respect to: Provision of data archives; and Major changes in central facility and home lab capabilities.

These are described in detail in Kroon-Batenburg et al. (2017), IUCrJ, 4, 87–99. This article led to email correspondence involving John Helliwell and Brian McMahon with Andreas Forster of Dectris Ltd, who emphasised the improved knowledge of digital data compression algorithms for helping contain somewhat the considerable expansion of diffraction images data acquisition facilitated by the pixel detectors of Dectris; these compression algorithms could be applied without significant loss of measurement detail or precision (Forster, pers. commun.). A Phenix newsletter article on the topic has been published by Andreas Forster and a more detailed article has been encouraged by John Helliwell and Brian McMahon.

IUCr DDDWG Recommendations

  • Authors should provide a permanent and prominent link from their article to the raw data sets which underpin their journal publication and associated database deposition of processed diffraction data (e.g. structure factor amplitudes and intensities) and coordinates, and which should obey the 'FAIR' principles, that their raw diffraction data sets should be Findable, Accessible, Interoperable and Re-usable (https://www.force11.org/group/fairgroup/fairprinciples).
  • A registered Digital Object Identifier (doi) should be the persistent identifier of choice (rather than a Uniform Resource Locator, url) as the most sustainable way to identify and locate a raw diffraction data set.
  • An archive of raw diffraction data sets for currently unsolved crystal structures should be pursued.
  • An archive of raw diffraction data sets showing significant diffuse scattering should be pursued.
  • Workshops for research data management training for the community should continue and be sponsored and organised by the IUCr.
  • There should be continued regular checking by the IUCr Executive Committee of the progress of the IUCr Commissions logging of their raw diffraction data metadata.
  • Archived raw diffraction data should be automatically validated wherever possible via a 'checkcif for raw data approach', and be peer reviewed where necessary, at the minimum to include core metadata: beam centre of diffraction image, wavelength, wavelength bandpass (pink beam case), orientation of all axes, pixel sizes, detector position and orientations.
  • Jointly with the IUCr Commission on Crystallographic Computing, the IUCr should pursue reproducibility of science objectives which require open source software and accurate versioning.
  • IUCr should engage with vendors and the World Data System to promote the certification of raw diffraction data standards.
  • IUCr’s CommDat, whose first meeting is scheduled for August 27th 2017 in Hyderabad, should continue the directory of data archives by adding any new data archives that are established in future. [These are currently listed and described in Loes M. J. Kroon-Batenburg et al. (2017) IUCrJ, 4, 87-99.]
  • IUCr should invite the community to alert CommDat of further case studies that document the value of archiving of raw diffraction data. [Current case study examples are included in a publication in preparation: Helliwell, McMahon, Guss & Kroon-Batenburg, The Science is in the Data; submitted for publication.]
  • IUCr recognises that metadata for the sample are clearly vital for all the IUCr Commissions (and are especially diverse in small angle scattering), and whose standardised descriptions should be actively pursued by the Commissions.
  • CommDat should regularly monitor the evolution of technology as the pace of change in data measurement rates, and of metadata logging, with new detectors, computer hardware, networks and electronic laboratory notebooks is especially notable.
  • IUCr should actively support the neutron, synchrotron and X-ray laser facilities in their raw data archiving activities.

Appendix: Membership

  • John R Helliwell and Brian McMahon (UK), Chair and Co-Chair;
  • Steve Androulakis (Australia)
  • Sol Gruner (2011-2014)/D. Marian Szebenyi (2014-2017) (USA)
  • Loes Kroon-Batenburg (Netherlands)
  • Tom Terwilliger (USA)
  • John Westbrook (USA)
  • Heinz-Josef Weyer (Switzerland) †
  • Edgar Weckert (Germany)