Raw data availability: the small-molecule crystallography perspective
In a research climate that encourages the application of ‘FAIR’ data principles (that scientific data be Findable, Accessible, Interoperable and Reusable), crystallography has been able to hold its head high. Development of the Crystallographic Information Framework has led to the standardisation of datafile formats (and, more importantly, the precise codification of machine-readable terms for describing all useful attributes of data and associated metadata). Journals have required deposition of derived data in the form of atomic positional coordinates and displacement parameters, and, in many cases, the structure factors or Rietveld profiles from which models have been derived. The value of collecting these results as aggregated collections of searchable structural models in databases and journals has been well demonstrated over the last half-century.
However, in recent years, in line with concerns about research reproducibility (a 2018 Special Collection of Nature articles illustrates concerns in a variety of disciplines), focus in crystallography has shifted towards the desirability – or need – to retain and make available the primary experimental data (‘raw’ data) coming from the instruments. The IUCr commissioned a Diffraction Data Deposition Working Group (DDDWG) to consider the motivation and value of routinely storing and making available raw data sets.
The DDDWG’s final report (2017) spanned all the IUCr Commissions and the opportunities for these communities to harness the massive increases in archiving capabilities, even for raw data. The first of 14 Recommendations was:
Authors should provide a permanent and prominent link from their article to the raw data sets which underpin their journal publication and associated database deposition of processed diffraction data (e.g. structure factor amplitudes and intensities) and coordinates, and which should obey the 'FAIR' principles.
Several case studies across biological and chemical crystallography and powder diffraction were published to demonstrate the value of preserving raw data (Helliwell et al., 2017).
At the inaugural meeting of the IUCr Committee on Data (CommDat) during the August 2017 IUCr Congress in Hyderabad, India, a start was made on discussing the Commissions’ reactions to the DDDWG's final report. Subsequent detailed discussion in the Commission on Biological Macromolecules led to an article published in IUCr Journals with a specific implementation plan encouraging the systematic archival of experimental data with trusted repositories, such as the PDB, and linking to raw data from publications where possible (such linking is available from IUCr Journals, for example) (Helliwell et al., 2019). The powder diffraction community has also begun to react to the DDDWG's final report, see for example, Aranda (2018).
However, for structural chemistry, a general hypothesis was aired during the CommDat meeting that it is not necessary for small-molecule crystallographers to make all raw data available, although there are many clear cases where this is desirable. In most cases, small-molecule data are clean and simple. If, in an ideal situation, it is possible to demonstrate that all Bragg diffraction has been accounted for, that there is nothing of interest remaining in the images and that they have been processed into structure factors appropriately, then why would it be necessary to retain the raw data?
Surveying the field
Here follows a summary of our exploration of this hypothesis, the first step of which was a survey (announced in the IUCr Newsletter and sent to a number of individuals) questioning current raw data management practices, as these underpin the ability to make the data available in the first place. Following our summary of the survey results presented here, we will hold a chemical crystallographers’ workshop at the 2020 IUCr Congress and General Assembly from which we intend to develop initial guidance on best practices for archiving small-molecule crystallographic raw data and making it publicly available.
While our personal experiences led us to believe that the small-molecule community lagged behind other disciplines in their readiness to store and share raw data, we were curious to know if this was truly the case and, more importantly, why. The methodology used was the design of a short online survey in consultation with CommDat colleagues, which was then disseminated through various crystallographic and social networks. A total of 193 responses were received from around the world, representing a good cross-section of academia, industry, government laboratories, researchers, professors and staff crystallographers. The questions, responses and raw survey data have been deposited in the Chemical Crystallography Community grouping of the Zenodo repository here.
Current raw data archiving practice
While a majority of survey respondents, nearly 80%, claim they do archive their raw data in response to a binary yes/no question, more probing questions reveal that our initial suspicions were largely correct. Few people archive their raw diffraction data in a systematic, searchable, robust and secure way. In the main, data archiving is predominately performed 'in-house' – that is, on laboratory or office computers and with little backup, either off-site or to a secondary hard drive. Only a very few respondents store their raw data on facility/institutional archives, and almost nobody uses an independent or commercial cloud-based approach.
For those who do archive their raw data, more than a quarter have no management of their archives. The rest claim some management practices, though these are often 'dark', i.e. not accessible by others and to be used only for disaster recovery, or they are intermittent and incomplete (Fig. 1). The majority of archives are not only inaccessible to all but the facility manager but also unsearchable. By unsearchable, we mean that there is no structuring of the information or indexing based on a number of key identifiers/fields. Any search capability that does exist relies on operating-system-based tools and pivots on a single identifier, such as folder name (which invariably relates to a sample identifier). Sometimes, there is additional grouping, but this is in arbitrary categories such as year of data collection or which diffractometer was used (although such categorisation may suggest suitable metadata for structuring an efficient distributed search/indexing system). This behaviour leads to a situation where only those 'in the know' are able to find data for a particular experiment. Though it is admirable that the majority of respondents to the survey do currently take some steps to back up their raw data, clearly more can be done and there are likely to be a number of 'quick wins' that can be achieved at little or no cost (a considerable contributing factor) simply by raising awareness.
The availability of raw data
The predominant culture in small-molecule crystallography appears to be somewhat protective: 152 out of 186 respondents declare that data are not made readily available to collaborators. This is likely to reflect the fact that many of these facilities are relatively small-scale enterprises run by, at most, a handful of people, many of whom have to charge service fees in order to continue operating. This is generally in sharp contrast to macromolecular crystallography facilities. Macromolecular crystallography has taken the lead in making experimental data freely available, but while many great initiatives from this community have been followed by other crystallography subdisciplines in the past, this may be an approach that is less attractive in small-molecule crystallography. Another aspect of raw data availability is compliance with mandates imposed by funding institutions to make all experimental data openly available. Of our survey respondents, more than half claimed to be unaware of funding mandates, yet suspected that they must exist. Most respondents would also be willing to follow policies to make raw data sets for funded research accessible at some time after measurement, roughly evenly split between those who would do so for all their raw data sets and those who would do so only where mandatory. There were 17 respondents who claimed that they would not comply with such a policy, even if such compliance were mandatory. Generally, the aspirations of funders in adopting these approaches are well founded, in that they want data (generally funded by taxpayers) to be more widely exploited. However, the state of policies varies significantly around the world, particularly in terms of implementation and policing. These mandates can, therefore, be very polarising in the research community and often grudgingly followed at minimum compliance levels only – they are rarely embraced for their aspirations or intentions.
To understand why some may be resistant to or unable to comply with data archiving, it is revealing to look at the reasons why facilities might not currently archive or make their data available. Forty percent of those who do not archive their raw data (14 out of 35) believe this is an unnecessary measure. However, the most common reason people don’t back up their data appears to be a lack of appropriate infrastructure. Some respondents declare a lack of ability as a reason, and we suspect, therefore, that 'infrastructure' refers not only to supplying space for storage but also to the tools required to manipulate, transform, annotate, search and validate raw data. A lack of finances is also a clearly stated factor, with more than half of respondents stating that they are unwilling to absorb the costs associated with archiving and managing raw data. The current funding and operating models of small-molecule crystallography facilities are quite simply unable to cater to these aspects – this is unlikely to change and points to the requirement for community-level and centralised solutions and approaches.
The value of making raw data available properly
A key factor often overlooked in data management activities is incentivisation. Initially, there is often resistance to changes in routine practice, and when these changes are imposed by external institutions or funding agencies, they are all too often viewed as punitive. It is, therefore, important to understand what our community sees as the value of good data management and to articulate this clearly. Responses to the survey in this regard were generally positive, with many of the opinion that making raw data available would enable new scientific insights (Fig. 2). Validation of results is a clear driver for making data available, while training and methods development were also considered worthy incentives.
It is also worth noting that this discussion has centred around service crystallography – the predominant mode of operation for most facilities. However, there is a large and growing element of crystallography that we will group here as 'advanced techniques', including, but by no means limited to, quantum crystallography, dynamic crystallography (covering numerous methods), neutron scattering, electron diffraction, nuclear magnetic resonance crystallography and diffuse-scattering analysis. These advanced techniques are becoming foundational methods for many areas of research – and generally produce diffraction data of a lower standard (or requiring a deeper level of analysis) than that commonly accepted in service crystallography. Arguably, it is these advanced techniques that will have the strongest need for robust raw data management, validation and sharing – although there will of course be commonalities with some of the tougher service crystallography examples such as disordered and incommensurate structures. In all of these cases, the drivers for sharing will be to generate further insight – or that future methods might be able to extract more information without the need to repeat the experiment.
Centralised neutron facilities have had an established approach to raw data preservation for some time and this has more recently led to a greater degree of availability as the ability to 'publish' has developed, i.e. repository capability of assigning DOIs and opening up to the internet – see, for example, ILL and ISIS data management and DOI policies. In the macromolecular community, raw data archiving is becoming just another part of the process of publication – largely in the face of fraud and out of a sense of being able to do a better job of modeling the structure oneself. Most small-molecule crystallographers would agree that routine well-behaved structures don’t need re-refining just to measure bond distances or angles, let alone re-integrating the entire dataset. However, the respondents of this survey do seem to agree that in cases of difficult refinements, disorder, twinning and modulation, having access to the raw diffraction images would benefit the community. Additionally, more than half of all respondents felt access to raw data was essential in examining pathological samples or for validating scientific claims and quality (Fig. 3).
Clearly, while the community does seem to agree that raw data archiving is a worthwhile practice, most lack the funds, capability, infrastructure and motivation to do so in a structured way (particularly one that would then readily enable wider sharing). Yet, as already proven by the macromolecular community, there is a real benefit to having raw data available. When we look back over the evolution of crystallographic data sharing, more crystallographers made more of their experimental data available once a standard format for data became available – the CIF (Crystallographic Information File). Though standard formats for sharing experimental data, e.g. FCF, existed around the same time as the CIF, it did not become routine for crystallographers to share their structure factors until relatively recently when this became an automated output of the refinement and embedded in the CIF. In the same vein, we should now consider how sharing of raw data can be made an automatic part of publishing crystallographic data. What steps can we, as a community, take to ensure this valuable practice becomes part of everyday operations? Who should bear the costs involved and the responsibility for maintaining such an archive?
In conclusion, we focus on two factors in this article – data management practice and the sharing of data to support scientific assertions or findings. While we use the term archiving to refer to data management practices in general, we understand that this includes the aspect of locking data away to keep it safe, as opposed to sharing it for the greater good. With a modest amount of culture change and relatively little extra money or effort, it should be possible for small-molecule crystallography facilities to manage their raw data so that it is easy to make it more widely available. However, without clear guidance on when to make data available and without centralized tools and infrastructure to support the process, widespread data sharing will continue to be elusive in small-molecule crystallography. Where is that guidance to come from, and who will create and maintain the necessary tools and infrastructure?
Should you wish to be part of the conversation, please join us in Prague on 22 August for a community-driven workshop addressing issues of raw data management and availability – see here for the schedule.
Copyright © - All Rights Reserved - International Union of Crystallography