Reasons for raw data archiving and reuse in chemical crystallography
Crystallographers have a long-standing tradition of linking the underpinning data to their publications. Chemical crystallography has led the way, harnessing new technologies for data storage as well as carefully defining its metadata via the world-renowned crystallographic information file (CIF) and developing an extensive checkCIF procedure for vetting these data. Acta Cryst. C has also led the way with its exemplary article submission procedure comprising the authors’ narrative, their underpinning coordinates and structure factors along with the checkCIF report. Thus an editor and referees can undertake direct calculations and scrutinise the outcomes (alerts) highlighted in the standardised checkCIF process. Readers thereby enjoy articles with data carefully vetted at the highest degree possible and the databases likewise can then harvest these perfect fruit. In recent years there has been a major increase in digital storage capability along with an expansion of generic data archives such as those provided by individual universities or administered centrally such as the EU’s Zenodo. Naturally the question has now arisen as to the need for the preservation of raw i.e. experimental diffraction data sets, and across all of crystallography. That it is feasible and that it is recommended was the top recommendation of the IUCr’s Diffraction Data Deposition Working Group (DDDWG) in their final report presented to the IUCr in 2017 in Hyderabad. This would also continue to keep the IUCr in line with the general exhortation for scientific data to be FAIR, i.e. Findable, Accessible, Interoperable and Reusable. The individual IUCr Commissions are now digesting the IUCr DDDWG Final Report. The DDDWG has been incorporated into the IUCr’s new Committee on Data (CommDat). Two members of CommDat, Amy Sarjeant and Simon Coles, communicate with the structural chemistry community now via a questionnaire.
I encourage you to take part in providing your answers to the questionnaire.
John R. Helliwell
Chair of IUCr CommDat and Chair of the IUCr DDDWG (2011-2017).
It is now common to deposit structure factors when publishing journal articles and this now caters very well for routine small-molecule structures. However, this is only the case if everything in a raw image is fully and/or properly accounted for and the model is correct or appropriate. In some cases raw data may no longer be required, but in others it may be necessary to validate or 'do better' in the future. Now is the time to explore raw data archival practice and gather opinions as to if/how raw data could/should be used if it were to be made more widely available.
'Data' can generally be considered to be raw data, processed data and derived data - in the crystallographic context these are namely diffraction images, structure factors and crystal structures, respectively. Recently some progress has been made in that software will include derived data (structure factors) in the CIF result and that validation processes and the Cambridge Structural Database (CSD) will make use of and curate this data. But we can go further still - not only could raw data improve validation processes and provide valuable training sets for software developers to improve algorithms, etc., but there is a more interesting issue. A diffraction experiment records the average signal from the whole sample, which includes defects, impurities, etc., yet often only the data to get a perfect result is extracted. For materials engineering it can be crucial to be able to understand these additional effects, yet it is never made public that they have been observed!
Raw data availability therefore can be very important; however, there are often further counter arguments around the overhead to doing this. The diffraction experiment is relatively quick and cheap, so why not just do it again? The real cost of doing a structure again was assessed by the UK National Crystallography Service as part of the ‘Keeping Research Data Safe Project’. There are many nuances to such a costing, but if one has to factor in that the research expertise/group/laboratory that originally generated the material may not exist any more or may not still be specifically set up (people, apparatus, etc.) to make such materials, then the cost rapidly escalates. The replacement cost of the CSD is therefore almost unmeasurable!
Barriers to raw data archival include file size, file format interoperability, and a lack of perceived need. The macromolecular community has recognized the need for raw data archival and various workflows and deposition standards have arisen to meet this need. However, the small-molecule community lags behind.
Data transfer and storage problems are now being overcome and for around 15 years there has been an ‘extension’ to CIF (imgCIF) that can cater for raw data, yet its uptake by the small-molecule community has been very slow indeed. So why aren’t we amassing more of our valuable raw data for the community to widely exploit? For the last five years a group, known as the Diffraction Data Deposition Working Group, has been looking into the issues surrounding this topic. The outcome from the activity of this group is that the IUCr has recently convened a Committee on Data, ‘CommDat’ as an advisory committee to the Executive.
It is generally assumed that small-molecule crystallographers do archive raw data, but that it is not archived in the ‘best’ way, i.e. easily searchable, in a ‘structured’ environment and that this community does not think about making raw data visible outside of their own use. We are therefore looking to find out the following:
- The extent to which small-molecule crystallographers archive raw data (in a sharable way?) and what stands in their way.
- How much 'educating' of crystallographers is required to illustrate the benefits of archiving (both for oneself and for others).
- How could raw data archives be used in validation, e.g. would it be more justifiable to publish a ‘poorer’ result if raw data were made available?
- What are the driver(s) for the community in terms of using the contents of a raw data archive.
We have created a survey to canvass the small-molecule community to determine the answers to these questions. Whether you currently actively archive your raw data or not, we encourage you to take part in order to help to better define the problems and barriers to this important endeavour. The survey explores the following two themes:
1. Your archived raw data: why? where? managed? searchable? available? ever revisited?
2. Different people in different roles might have different drivers/reasons for archiving and revisiting raw data. We suggest the following as some food for thought and want to canvas views around these and any other related matters:
- Validation: a result provides a contribution to chemical knowledge, but is poor quality
- Validation: to support a 'grand' claim
- To back up modelling of disorder, twinning, incommensurate, modulated structures
- To back up modelling of diffuse scattering
- To make available e.g. disorder, twinning, incommensurate, modulated, diffuse scattering datasets so others can attempt to resolve them
- To support ‘Advanced Experiments’ e.g. charge density, high pressure, phase transition, gas environment, excited states
- When it is clear that future improvement may be possible through developments in software and modelling
- Training sets/benchmarking for software/methods developers
The survey will take just a few minutes to complete the following multichoice/option format questions:
- If you don't archive raw data, what are the main reasons?
- Where do you keep your raw data?
- Do you manage your archive?
- Is your raw data searchable?
- Is your raw data externally available?
- If it were to become policy to make funded-research raw diffraction data accessible after three years since measurement, would you endeavour to comply?
- Are there any organisations that might have a controlling hand in your research that have a policy mandating you to manage and/or share your raw data?
- Would you anticipate being able to pay for external archive/repository services and facilities e.g. through research grants or institutional funding?
- How likely is it that you will need to revisit raw data in your own work?
- How likely is it that you will need to examine raw data when reviewing someone else’s work?
- If you had access to a repository containing raw data, what would you want to use it for?
- In what situations do you feel publishing raw data may be necessary?
- The description that best fits my role is ...
- I come from ...
The survey can be accessed here and will close on 1 March 2019.
We would like to thank you in advance for helping us understand this complex topic and we aim to communicate the results by mid-2019.
Simon Coles (National Crystallography Service, University of Southampton, UK)
Amy Sarjeant (Cambridge Crystallographic Data Centre, USA)