International Union of Crystallography

The importance of metadata

[Workshop banner]

The Diffraction Data Deposition Working Group (DDDWG) of the IUCr has been in operation since 2011; its activities can be followed on its home page and public forum (http://forums.iucr.org?f=21).

A meeting of the Working Group at the 2014 IUCr Congress in Montreal concluded that there were promising movements towards widespread deposition of raw (otherwise known as 'primary') data, but there are still a number of limiting factors. (1) With no obvious single institution to archive all crystallographic raw data, the initial strategy should be the encouragement of voluntary deposition in locations most convenient for authors (e.g. synchrotron and other instrument facilities, university and institutional repositories, domain repositories such as the Australian Synchrotron.Store). (2) Search and discovery functions across diverse locations depend on common metadata identifying and describing data sets. The obvious candidate for an identifier is the Digital Object Identifier (DOI), because of existing machinery to register and share DOI information. (3) Because molecular/atomic structural studies increasingly rely on a range of technologies and techniques, it is desirable to harmonise metadata descriptions across as many such technologies as possible. Studying the 'arrangement of atoms' in its most general sense - as well as diffraction, spectroscopy and microscopy - has long been recognized as fitting within the remit of the IUCr.

While 'metadata' enters the discussion in the context of building distributed systems for search/discover, identification and retrieval of data sets, it rapidly becomes apparent that there is much more to metadata than that. 'Metadata' is variously defined, but the general sense is that it is the information that is needed to make sense of data, to allow its reuse, validation and critical analysis. Yet such 'information' is itself data - data that collectively open doors to further avenues of study, and even new scientific insight. Standard uncertainties on atomic positions modify the weights that should be given to structural models collected in databases, and so subtly affect our understanding of chemical bonding or biological function (e.g. in knowledge-based research using the Cambridge Structural Database or Protein Data Bank). The raw intensities ignored in models based solely on Bragg peaks (i.e. diffuse scattering) can now be reanalysed to provide insights into correlated disorder. Comparison of structural models derived from X-ray crystallography or from NMR can deepen understanding of protein structure and dynamics. Analysis of diffraction intensities from different experiments can yield examples of systematic bias (or, in extreme examples, dishonest practice).

Overall, the richer the metadata available to the scientist, the greater the potential for new discoveries. Crystallography is exceptional in the richness and granularity of metadata descriptors already available, mostly in diffraction-based research, and largely owing to the data dictionaries developed within the Crystallographic Information Framework (CIF), as shown in a Satellite Symposium to ECM28. (That said, the achievements of other research communities in making available their data - such as astronomers - should also be recognized. Our enthusiastic participation in organisations such as the International Council for Science (ICSU) and its Committee on Data (CODATA) is vital, both to represent crystallography, and to learn of best practice from other research communities.)

A two-day Satellite Workshop at the forthcoming European Crystallography Meeting will survey the many uses already being made of crystallographic metadata, especially where associated with raw data capture, analysis and reuse. We will identify areas where better metadata descriptors are required, and we shall begin to look at the challenges of defining new metadata, especially in studies which do not have the clean, well-defined parameters of classical single-crystal or powder diffraction experiments. Some of the biggest challenges being faced are at the centralised synchrotron (and X-ray laser) and neutron facilities, where colossal quantities of diffraction, spectroscopy and especially microscopy raw data are being generated, and also in the databases which must organise and protect access to the fruits of all our  researches in perpetuity. Attendees at ECM29 are encouraged to register and participate in this important Workshop. People unable to attend may watch the Proceedings streamed live on the Web (http://ecm29.ecanews.org/ecm29-live/). We warmly encourage the community to join us in this Workshop and in follow-up activities.

John Helliwell, University of Manchester
Brian McMahon, IUCr