Metadata for raw data from X-ray diffraction and other structural techniques
Rovinj, Croatia, August 2015
'Thank you for organizing such an inspiring workshop on what was nominally a rather dry subject.' Andreas Förster, Dectris
Indeed, the term 'metadata' - often described as 'data about data' or 'information to help you understand the data' - is generally held to be a dry topic, of importance to digital librarians and data analysts, but irrelevant or even an obstacle to the real business of science. This two-day satellite workshop of the 2015 European Crystallography Meeting demonstrated emphatically that this is far from the truth. Some 20 expert speakers from Europe, Australia and the USA (two presenting remotely over the Internet) surveyed the central importance of detailed and high-quality metadata to the interpretation, validation and use of experimental data.
The workshop was organized by the IUCr Diffraction Data Deposition Working Group (DDDWG) in association with the Croatian Association of Crystallographers. J. Helliwell, Chair of the DDDWG, explained how it had been working over the past four years to analyse the prospects for routine deposition of raw experimental data, and had become aware that storage capacity for the vast amounts of raw data being generated at modern synchrotron and neutron facilities is almost the least of our worries. For these data to be re-used, it is essential that all details of the experimental arrangement are documented and retrievable - this is where 'metadata' comes into play.
L. Kroon-Batenburg and W. Minor, among others, highlighted the very low level of standardisation in storing basic information about orientation, exposure, oscillation axis etc. in the header of each image. The workshop renewed calls for agreement on a minimum set of such metadata that should be recorded in every image. H Bernstein and A. Förster illustrated how the necessary definitions already existed in the imgCIF dictionary, and could effectively be carried over to the HDF5/NeXus files that are becoming the norm in high-volume imaging.
J. Hester and B. McMahon, both active in the Committee to maintain the IUCr CIF data exchange standard (COMCIFS), discussed the importance of identifying concepts that needed to be recorded, and the relative lack of importance of the chosen storage format. While multiple formats do in practice hinder interoperability, there is no fundamental barrier to creating concordances and translation tools to build seamless data management systems in which crystallography is but one of many contributing disciplines.
Current and evolving practice in data capture and management was described across a range of large-scale facilities accommodating a variety of techniques and sciences: the European Synchrotron ESRF (A. Götz, G. Leonard), the Inst. Laue-Langevin (M. Blakeley), and the UK STFC and Diamond Light Source at the Rutherford Laboratories (B. Matthews and P. Aller). S. Coles spoke about the challenges of data management in home laboratories and medium-scale service providers such as the UK National Crystallography Service. In all these locations, all the data from an experiment must be handled in the context of resource management, provenance, validation and bulk storage, all of which require ever greater volumes of metadata that should conform to widely accepted standards.
The importance to databases of carrying extensive metadata throughout the scientific process was described by S. Ward (CCDC) and J. Westbrook (PDB), while T. Terwilliger developed the theme of 'The Living PDB', where deposited structures could be revised, improved and continuously updated in the light of new scientific developments. M. Wall emphasised that exciting new science potentially lay in the diffuse scattering in the images that is largely ignored when deriving structures solely from the Bragg peaks. K. Dziubek outlined the additional metadata that were needed to perform a complete analysis of structures collected under high pressure and other non-ambient conditions.
In an intriguing presentation, N. Johnson demonstrated that plausible diffraction images could be manufactured. In principle, such artificial images could be produced to support fraudulent experimental results. Here, again, rich metadata describing the full provenance of the images and the context in which they were collected could help in forensic analysis of suspect data. Indeed, quite apart from worries about fraud, the more metadata that are available for cross-comparison, the more the data can be analysed (or reanalysed) for consistency, and the more trust can be placed in the scientific deductions that use the data.
The same considerations had encouraged the development by the IUCr of checkCIF as a validation tool in the publication of crystal and molecular structures. There was a strong feeling in this workshop that the time was rapidly approaching for the crystallographic community to work on a similar 'checkCIF' mechanism for the validation and evaluation of experimental data - perhaps a topic for the next DDDWG Workshop?
Perhaps most noteworthy is that the work of the DDDWG has become so much more urgent as raw data sets become increasingly available in the scientific environment. When this workshop was first planned, rather few images were being stored on publicly-accessible platforms. Now, one may find raw data sets in repositories such as Australia's Store.Synchrotron, on the NIH BD2K website www.proteindiffraction.org/ run by W. Minor's group, on the shared resource site Zenodo, and in the powder pattern database maintained by the International Centre for Diffraction Data. Whether this growth will turn into a deluge of diffraction data sets is still unclear; what is certain is that the best use of such data sets will depend on metadata developments such as those explored during those two sunny days in Rovinj.
Videos of all the presentations are available at the Workshop website http://tinyurl.com/diffraction-metadata. We are grateful to all our speakers for their outstanding presentations and contributions to the discussion, to the Croatian Association of Crystallographers for hosting the event, and to the IUCr and industrial sponsors for providing the necessary funding.Brian McMahon and John R. Helliwell