Crystallographic data

Workshop on Raw diffraction data reuse: the good, the bad and the challenging

Tuesday August 22 2023

Melbourne, Australia

A major effort of the IUCr Diffraction Data Deposition Working Group (2011 to 2017) and now the IUCr Committee on Data since 2017 has been exploring the practicalities, the costs and benefits, and the opportunities for new crystallographic science arising from large-capacity data archives that have become available.  We think it timely to hold a full-day workshop aimed at: (1) discussing current practices in raw data archival and sharing, (2) educating those who generate and deal in crystallographic data on best practices in data reuse in various categories of crystallographic science by leading experts, (3) offering a summing up, including the role of IUCrData’s new Raw Data Letters. We expect attendees to learn about the opportunities for raw data reuse, including the use of raw data as test data sets for machine learning, and to achieve an understanding of how to effectively archive their own raw data to maximise the potential for data sharing and reuse in the future. This workshop will explore in detail the successes and challenges in practice of raw data sharing and reuse. Being a full day, it will complement the proposed microsymposium on "Raw diffraction data reuse: warts and all" in the Congress itself; most importantly the microsymposium will allow for the usual submission of up to four abstracts from anywhere in the world whereas the workshop is principally made up of invited speakers. Furthermore, the microsymposium will highlight the importance of databases. Of the workshop and the microsymposium the proposed keynote on the "European Photon and Neutron Open Science Cloud" (by Andy Götz of ESRF) is the major highlight, making it the world leading effort of this consortium of more than ten European synchrotron and X-ray laser radiation sources with raw data management and sharing.

   

Reminder: The CommDat user forum is at https://forums.iucr.org

 

Programme

Times are given in Australian Eastern Standard Time (AEST) (Melbourne) and Central European Summer Time (CEST) (Paris, Berlin) for the benefit of participants in Eastern Australia and Central Europe.


Other locationsTime zoneStart timeEnd time
UTC22:20 Monday 21 August05:55 Tuesday 22 August
UKBST23:20 Monday 21 August06:55 Tuesday 22 August
USA East CoastEDT18:20 Monday 21 August01:55 Tuesday 22 August
USA West CoastPDT15:20 Monday 21 August22:55 Monday 21 August
JapanJST07:20 Tuesday 22 August14:55 Tuesday 22 August

Tuesday 22 August

Session 1: Facility and raw data archive providers Part I

08:20-08:30 (00:20-00:30) Chair Opening remarks
08:30-08:50 (00:30-00:50) Andreas Moll (Australia) Scientific computing and data management at the Australian Synchrotron Abstract

[A. Moll]

A. Moll
ANSTO – Australian Synchrotron, 800 Blackburn Road, 3168, Clayton, Victoria, Australia

The Australian Synchrotron is a division within ANSTO and one of Australia’s premier research facilities. It produces powerful beams of light that are used to conduct research in many important areas including health and medical, food, environment, biotechnology, nanotechnology, energy, mining, agriculture, advanced materials and cultural heritage.

After 15 years of uninterrupted operation with the original ten experimental end stations, called beamlines, the Australian Synchrotron is currently entering an exciting new phase with the addition of eight new beamlines, including a new high-throughput Crystallography beamline. This created an opportunity for the Scientific Computing team to redesign the whole software stack from the ground up.

This presentation will take you on a journey of Scientific Computing at the Australian Synchrotron. You will learn how we employ modern, industry standard tools and architectures in a research environment in order to handle the large data throughput of modern detectors and provide the robustness our users expect from us. A particular focus will be on our use of cloud technologies, running on-premises, across our whole stack from hardware control to data processing on GPUs.

(hide | hide all)
08:50-09:10 (00:50-01:10) Anton Barty (Germany) Scientific computing and data flows at PETRA IV

[A. Barty]

Anton Barty
affiliation

abstract

(hide | hide all)
09:10-09:30 (01:10-01:30) Bridget Murphy (Germany) The German research data initiative DAPHNE

[B. Murphy]

Bridget Murphy
affiliation

abstract

(hide | hide all)
09:30-09:50 (01:30-01:50) Andy Götz (France) The European Photon and Neutron Open Science Cloud (PANOSC) and/or ESRF data sharing and re-use

[A. Goetz]

Andy Götz
affiliation

abstract

(hide | hide all)
     
09:50-10:10 (01:50-02:10) Coffee break

Session 2: Facility and raw data archive providers Part II

10:10-10:40 (02:10-02:40) Alun Ashton (Switzerland) Scientific computing and data sharing and reuse at PSI

[A. Ashton]

Alun Ashton
affiliation

abstract

(hide | hide all)
     
10:40-11:00 (02:40-03:00) Genji Kurisu (Japan) X-tal Raw Data Archive (XRDa): A crystallographic raw diffraction image archive in Asia Abstract

[G. Kurisu]

G-J. Bekker[a] and G. Kurisu[a,b]
[a] Institute for Protein Research, Osaka University, Suita, Osaka 565-0871, Japan, [b] Protein Research Foundation, Minoh, Osaka 565-8686, Japan

The Protein Data Bank (PDB) is a public archive of atomic coordinates and crystallographic structure factors including professionally curated meta data. The PDB archive is maintained by the world-wide PDB (wwPDB), a global organization founded in 2003 by RCSB PDB in the USA, PDBe in Europe, and PDBj in Japan, and later jointly managed with the Biological Magnetic Resonance Data Bank (BMRB) and the Electron Microscopy Data Bank (EMDB) as the wwPDB core members [1]. Quite recently, the wwPDB organization welcomed Protein Data Bank China (PDBc) as an Associate member of the wwPDB, and PDBc has started remote processing of some of the structures deposited to and allocated by PDBj. In addition to the PDB core archive, we, PDBj, collaborate with other wwPDB members to maintain the BMRB for experimental data from NMR experiments, and the EMDB for Coulomb potential maps from single-particle or sub-tomogram averaging in Cryo-Electron Microscopy. PDBj is the only wwPDB member who engages in the processing of data for all these three wwPDB core archives [2]. Although the above structural data in the core archives (PDB, BMRB, and EMDB) are actively collected and curated by the wwPDB, the raw image data that were the direct result of the primary experiments, and were used to determine the structures by macromolecular crystallography or cryo-EM microscopy are not collected by the wwPDB. For cryo-EM data, sub-members of the EMDB collect experimental raw micrographs or movies, and archive these in the Electron Microscopy Public Image Archive (EMPIAR). PDBj has been functioning as a local distributor of the EMPIAR archive since 2018, based on a bilateral agreement between EMBL-EBI and Institute for Protein Research, Osaka University. EMPIAR at PDBj (EMPIAR-PDBj) holds the exact same entries as EMPIAR at EMBL-EBI, and we have helped local depositors to transfer their large images/movies, thereby providing our own services (including deposition) through our original website for EMPIAR-PDBj (empiar.pdbj.org/).

[XRDa home page]

Figure 1. The front page of XRDa as operated by PDBj (xrda.pdbj.org).

For macromolecular crystallography (MX) raw images, two major archives currently exist; Diamond Light Source in the UK, and the SBDB (SB Grid Data Bank), CXIDB (Coherent X-ray Image Data Bank) and IRRMC (Integrated Resource for Reproducibility in Macromolecular Crystallography) in the USA. However, up till now, no such archive for depositions from Asia has existed, and neither of the existing ones in the UK and the USA are wwPDB members. From 2020, PDBj has started our original diffraction archive named “X-tal Data Archive” (XRDa, xrda.pdbj.org) that securely stores the experimental diffraction images from Asian depositors. As a member of the wwPDB, we have streamlined deposition with the wwPDB’s OneDep system. For depositors from Asia, after depositing their structural data to PDBj via wwPDB’s OneDep system, their entry will be automatically linked to their ORCiD-ID in XRDa.

Depositors to XRDa can login using their ORCiD-ID, where any PDB IDs that have been registered in OneDep by them or their co-authors will be available. In addition, depositors can also submit their raw data before submitting their structures to the PDB (and link these afterwards), or submit raw data for structures not to be submitted to the PDB, e.g. for micro electron diffraction data of small molecules. Following login, users can easily deposit diffraction images via the “My entries” page. Once submitted, PDB-linked entries will enter a holding status and will be automatically co-released, while independent entries will be released immediately. Please feel free to deposit your diffraction images to PDBj.

XRDa is operated by PDBj and supported by the Platform Project for Supporting Drug Discovery and Life Science Research (BINDS) from AMED under Grant Number JP21am0101066.

[1] wwPDB consortium. (2019). Nucleic Acids Research, 47, D520-D528.
[2] Bekker, G. J., Yokochi, M., Suzuki, H., Ikegawa, Y., Iwata, T., Kudou, T., Yura, K., Fujiwara, T., Kawabata, T. and Kurisu, G. (2022). Protein Science 31, 173-186.

(hide | hide all)
11:00-11:20 (03:00-03:20) Fabio Dall'Antonia (Germany) Handling of big data at the European XFEL Abstract

[F. Dall'Antonia]

Fabio Dall’Antonia, Janusz Malka, Egor Sobolev, Philipp Schmidt, Krzysztof Wrona and Luca Gelisio
European X-ray Free-electron Laser Facility GmbH, Holzkoppel 4, 22869 Schenefeld, Germany

The European XFEL (EuXFEL) is a unique photon-source facility producing free-electron laser (FEL) pulses in the soft and hard X-ray regime, of extreme brightness and ultra-short duration. These are delivered at MHz repetition rate, enabling various experimental techniques and time-resolved setups. The seven scientific instruments mostly employ pixelized area detectors that can record up to 8,000 1-Mpx images per second.

These opportunities for research come at the cost of huge data volumes, which can reach a few PiB per beam-time, posing challenges for data storage and retention, as well as for data re-use purposes.

EuXFEL data collected with imagers requires facility services for the correction of pixel intensities. First steps of technique-specific data reduction such as azimuthal integration or crystallographic indexing are done by users remotely on facility resources as well, since download to local computers is not feasible. Currently we are in the process of updating the scientific data policy so as to account for data reduction prior to the long-term storage on disks, as well as for FAIR principles [1] of data management.

We are also developing facility services to apply specific data reduction techniques. For example, in case of serial femtosecond crystallography (SFX) [2] data can typically be reduced to only a few percent, sometimes even below 1%, of recorded frames since many FEL shots miss the sample crystals delivered by a liquid jet, leading to images without Bragg diffraction. In this case we are working on facility services that implement Bragg peak detection for automatic filtering procedures, either before data acquisition or at early stages of the offline correction and processing pipeline.

Concerning the re-use of open data, EuXFEL has got cloud-based services in the testing stage, which are initially employed for educational purposes but shall become a means for remote data analysis of selected and filtered data sets of each proposal after the embargo period.

[1] Wilkinson, M. D., Dumontier, M., Aalbersberg, I. J. et al. (2016). Sci. Data 3, 160018.
[2] Wiedorn, M. O., Oberthür, D., Bean, R. et al. (2018). Nature Commun. 9, 4025.

(hide | hide all)
11:20-11:40 (03:20-03:40) Wladek Minor (USA) A subject specific repository for MX (proteindiffraction.org) Abstract

[W. Minor]

W. Minor, M. Cymborowski and D. R. Cooper
Department of Molecular Physiology and Biological Physics, University of Virginia, Charlottesville, VA22903, USA

Preservation and public accessibility of primary experimental data are cornerstones necessary for the reproducibility of empirical sciences. We present the Integrated Resource for Reproducibility in Molecular Crystallography (IRRMC). In its first six years, several hundred crystallographers have deposited thousands of datasets representing more than 5,800 indexed diffraction experiments. We will present several examples of the crucial role that original diffraction data played in improving previously determined protein structures.

(hide | hide all)
     
11:40-12:00 (03:40-04:00) Alexandra Tolstikova (Germany) Processing data in serial crystallography on-the-fly: what kind of raw data do we want to store? Abstract

[A. Tolstikova]

A. Tolstikova [a], T. A. White [a], T. Schoof [a], S. Yakubov [a], V. Mariani [b], A. Henkel [c], B. Klopprogge [c], A. Prester [c], S. De Graaf [c], M. Galchenkova [c], O. Yefanov [c], J. Meyer [a], G. Pompidor [a], J. Hannappel [a], D. Oberthuer [c], J. Hakanpää [a], M. Gasthuber [a] and A. Barty [a]
[a] Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany, [b] Linac Coherent Light Source, SLAC National Accelerator Laboratory, Menlo Park, USA, [c] Center for Free-Electron Laser Science, Deutsches Elektronen-Synchrotron DESY, Hamburg, Germany

Serial crystallography experiments involve collecting large amounts of diffraction patterns from individual crystals, resulting in terabytes or even petabytes of data. However, storing all this data has already become unsustainable, and as facilities move to new detectors and faster acquisition rates, data rates continue to increase. One potential solution is to process data on-the-fly without writing it to disk. Recently, we have implemented a system for real-time data processing during serial crystallography experiments at the P11 beamline at PETRA III. Our pipeline, which uses CrystFEL software [1] and the ASAP::O data framework, can process frames from a 16-megapixel Dectris EIGER2 X detector at its maximum full-frame readout speed of 133 frames per second. The pipeline produces un-merged Bragg reflection intensities that can be directly scaled and merged for structure determination.

Processing serial crystallography data on-the-fly offers numerous advantages. It allows for real-time data quality control during the experiment, decreases the time spent between data collection and obtaining a final structure and can significantly reduce the amount of data that needs to be stored and managed. However, there are potential disadvantages and risks to not storing all raw data, such as losing the ability to revisit the data for reanalysis or to reproduce the results. Therefore, careful consideration is needed when deciding which data to store and which to discard. In this talk, we will discuss the challenges and opportunities of real-time data processing in serial crystallography and explore possible strategies for deciding which data to store.

[1] White T. A., Mariani V., Brehm W., Yefanov O., Barty A., Beyerlein K. R., Chervinskii F., Galli L., Gati C., Nakane T., Tolstikova A., Yamashita K., Yoon C. H., Diederichs K. and Chapman H. N. (2016). J. Appl. Crystallogr. 49, 680-689.

(hide | hide all)
     
12:00-12:40 (04:00-04:40) Lunch Break

Session 3: Raw data reusers

3.1: Macromolecular crystallography

12:40-13:00 (04:40-05:00) Gerard Bricogne (UK) The raw, the cooked and the medium-rare: unmerged diffraction data as a rich source of opportunities for data re-use and improvements in both methods and results Abstract

[G. Bricogne]

Gerard Bricogne, C. Flensburg, R. H. Fogh, P. A. Keller, I. J. Tickle and C. Vonrhein
Global Phasing Limited, Sheraton House, Castle Park, Cambridge CB3 0AX, UK

Deposition into the PDB of experimental diffraction data, in the form of merged intensities or of `structure factor[ amplitude]s', to accompany atomic models determined and/or refined from them, was made mandatory in 2008. This brought benefits that went well beyond the intended purpose of making the deposited models verifiable and correctable, in the form of an unanticipated `virtuous circle' whereby deposited data fuelled improvements in refinement software that in turn enabled improvements to be made in the initially deposited models, from the same data. The need to manage the outcome of this continuous improvement process led to the introduction of a versioning mechanism into the archiving of atomic models by the PDB in 2017.

The creators of the Electron Density Server had already noted in 2004 that `perhaps we should consider deposition of unmerged intensities or even raw diffraction images in the future' [1], without however anticipating the potential for a similar auto-catalytic cycle of simultaneous improvements in data reduction methods and in final structural results that could follow. This potential was later articulated in e.g. [2] in the following terms: `those [merged] deposited X-ray data are only the best summary of sets of diffraction images according to the data-reduction programs and practices available at the time they were processed. Just like refinement software, those programs and practices are subject to continuing developments and improvements, especially in view of the current interest and efforts towards better understanding radiation damage during data collection and in taking it into account in the subsequent processing steps.'

Strong general support for the idea of archiving raw diffraction images, together with the recognition that this task was beyond the remit and resources of the PDB, has led over the past decade to the emergence of a delocalised infrastructure (whereby raw data storage and curation takes place at synchrotrons and other dedicated repositories while the PDB provides a capability to annotate entries with a DOI that points to the raw data storage location) that is a major topic in this Workshop.

Our own interest has been to document the scientific case for depositing and archiving suitably annotated unmerged diffraction data into the PDB, a goal achievable with modest storage requirements while already creating a standardised resource capable of feeding improvements in scaling and merging methods resulting in better refined models than those originally deposited. This goal is the focus of the current activities of the Subgroup on Data Collection and Processing of the PDBx/mmCIF Working Group of the wwPDB, in which we participate, to expand the mmCIF dictionary to support such extended deposition and archiving.

Crucially, unmerged data collected by the rotation method can preserve instrumental metadata about the image number and the detector position at which each diffraction spot was located and integrated, providing a broader decision-making scope over the way it is incorporated into the scaling and merging process. This opens a wide range of possibilities for improving any initially performed scaling/merging steps and for extraction of further data. We will present examples touching upon the following areas:

  1. production of full validated data quality metrics that are often incomplete or inconsistent in deposited merged data;
  2. detection of problematic images and image ranges, and remediation by their selective exclusion from scaling/merging;
  3. anisotropic diffraction limit analysis (or re-analysis) with STARANISO, if not already performed;
  4. extraction of previously unexploited anomalous signal and computation of anomalous difference Fourier maps;
  5. `reflection auditing' by tracing outliers detected at the refinement stage back to their unmerged contributors in terms of specific image numbers and detector positions, thus diagnosing ice rings, poor beamstop masks, angular overlaps, \textit{etc.};
  6. detection of radiation damage via Fearly - Flate maps; adapting parametrisation to patterns of structural radiation damage.

We are grateful to the PDBx/mmCIF Subgroup on Data Collection and Processing, especially Aaron Brewster, Ezra Peisach, Stephen Burley and David Waterman, for a stimulating collaboration that provided a context for presenting these investigations.

[1] Kleywegt, G. J., Harris, M. R., Zou, J. Y., Taylor, T. C., Wählby, A. & Jones, T. A. (2004). Acta Cryst. D60, 2240-2249.
[2] Joosten, R. P., Womack, T., Vriend, G. & Bricogne, G. (2009). Acta Cryst. D65, 176-185.

(hide | hide all)
     
13:00-13:20 (05:00-05:20) David Aragao (UK) Experiences with data reuse in MX at Diamond

[G. Winter]

Graeme Winter
affiliation

abstract

(hide | hide all)
     
13:20-13:40 (05:20-05:40) Eugene Krissinel (UK) Raw data reuse: what it means for CCP4 Abstract

[E. Krissinel]

Eugene Krissinel
CCP4, Rutherford Appleton Laboratory, UKRI STFC, Harwell Campus, Didcot, Oxfordshire, OX11 0QX UK

Collaborative Computational Project Number 4 in Macromolecular Crystallography (CCP4 UK) has a mission to distribute, develop and facilitate development of crystallographic software for all stages of the structure determination pipeline, from raw image processing to phasing, refinement, completion, validation and deposition. Over the 44-year history of the Project, crystallographic software underwent a series of evolutionary changes, caused by advances in theory, sample preparation techniques, quality and properties of raw data.

MX is often regarded as a technique with limited reproducibility, which suggests high importance of data retention in the field. For many years, the Protein Data Bank (PDB) collected only end results of interpretation of raw data, the atomic coordinates, leaving no scope for revisiting structure determination in future. The situation improved in 1999 and further in 2020 when, respectively, merged and unmerged data became available for deposition. Deposition of unprocessed, raw data is a natural next step in this direction, which is rather demanding on the storage side and is actively discussed in value for cost terms. We would like to bring the software effect into consideration.

Notably, there is no single established format for raw data in MX, and in addition, raw data should be processed with instrument/detector specifics (see Reference [1] as an example). Data processing software, such as XDS [2], HKL [3], Mosflm [4], d*TREK [5], DIALS [6], include extendable sets of routines or plugins for dealing with the variety of formats. Commitment to raw data retention and reusability effectively means commitment to maintaining data processing software, format plugins, and associated beamline metadata forever. This is a significant challenge as software ages faster than data and usually gets retired exactly for maintainability reasons. For example, Mosflm and d*TREK are effectively in sunset mode, and the newest development in the field, DIALS, is not supposed to work with all older formats. Most probably, even if raw data were kept for all PDB entries from day zero, we would not be able to use the oldest datasets today. A possible solution to this problem may be in introducing a “storage” format, but in any case, reuse and storage of raw data cannot be detached from running the corresponding software project.

STFC and Diamond synchrotron show an example in maintaining raw data. In 40 days after collection, data are pushed from beamlines to long-term storage, from where they can be downloaded years later. CCP4 is in a good position to facilitate reuse of such data by setting links between data facilities at Diamond, STFC/SCD and CCP4 Cloud [7]. This matches well with introducing CCP4 Cloud Archive facility in January 2023, where completed structure determination projects can be deposited, so that not only the project data and metadata but also the way the structure was solved can be retained; archived projects can be revisited and revised in future.

Linking this facility with raw data storage and PDB entry would provide a fully accountable data line for MX. This bears obvious benefits for researchers and makes a foundation for the efficient reuse of collected data, also helping to maximise longevity and robustness of data-handling software. Works have begun in this direction, and much will depend on community take up and feedback.

CCP4 is funded by BBSRC UK (Grant BB/S006974/1) and industrial licencing.

[1] http://www.globalphasing.com/autoproc/wiki/index.cgi?BeamlineSettings
[2] Kabsch, W. (2010). Acta Cryst. D66, 125-132.
[3] Minor, W., Cymborowski, M., Otwinowski, Z. and Chruszcz, M. (2006). Acta Cryst. D62, 859-866.
[4] Battye, T.G.G., Kontogiannis, L., Johnson, O., Powell, H.R. and Leslie, A.G.W. (2011). Acta Cryst. D67, 271-281.
[5] Pflugrath, J.W. (1999). Acta Cryst. D55, 1718-1725.
[6] Winter, G., Waterman, D.G., Parkhurst, J.M. et al. (2018). Acta Cryst. D74, 85-97.
[7] Krissinel, E., Lebedev, A., Uski, V., Ballard, C. et al. (2022). Acta Cryst. D78, 1079-1089.

(hide | hide all)
    
13:40-14:00 (05:40-06:00) Melanie Vollmar (UK) Reusing raw data for machine learning in MX Abstract

[M. Vollmar]

M. Vollmar [a] and G. Evans [b,c]
[a] EMBL-EBI, Hinxton, United Kingdom, [b] Rosalind Franklin Institute, Harwell, United Kingdom, [c] Diamond Light Source Ltd, Harwell, United Kingdom

Large quantities of raw diffraction data from protein crystals are collected at synchrotron facilities and in-house X-ray sources every day. The vast majority of this data never yields a protein structure and never leaves the local data storage. Over the last five years there has been a steady increase in interest in the development of machine learning and artificial intelligence models in structural biology. To train any predictive model for high-quality predictions, large quantities of data are required, preferably standardised, curated and labelled.

However, the closed state of data storage, i.e. the data is only found on local storage, makes accessing raw diffraction data challenging for anyone who wants to use such data for developing machine learning and artificial intelligence models. Additionally, if raw diffraction data has been made publicly available it usually only represents a certain type of data, namely the one that resulted in successful structure solution, while any diffraction data that did not yield an atomic model remains hidden. Contacting data holders to gain access also often brings the challenge of tracing and finding the raw data on local storage depending on how well data and file management are handled within a facility or research group.

Here, we provide a retrospective analysis and share our experiences when developing a machine learning model using raw diffraction data [1]. We describe the challenges in finding suitable data, difficulties accessing that data, efforts needed to trace data locally and, finally, how a well-defined set of raw diffraction data was used to train a machine learning model.

We thank Arnaud Baslé, Dominic Jaques, Garib Murshudov, James Parkhurst and David G. Waterman for their contributions and vivid discussions when developing a machine learning model as described in [1].

[1] Vollmar, M., Parkhurst, J., Jaques, D., Baslé, A., Murshudov, G., Waterman, D. and Evans, G. (2020). IUCrJ, 7, 342-354.

(hide | hide all)
     
14:00-14:20 (06:00-06:20) Tea break

3.2: Chemical crystallography

14:20-14:40 (06:20-06:40) Jim Britten (Canada) The special cases of chemical crystallography for raw data reuse: there are various categories

[J. Britten]

Jim Britten
affiliation

abstract

(hide | hide all)
    
14:40-15:00 (06:40-07:00) Simon Coles (UK) The increasing diversity of small molecule data: can one size fit all? Abstract

[S. Coles]

Simon J. Coles
School of Chemistry, Faculty of Engineering and Physical Sciences, University of Southampton, Highfield, Southampton SO17 1BJ, UK

Today's accepted approaches to handling chemical crystallography data have largely been established in the ‘boom period’ of crystal structure analysis, that is the late 1990s and early 2000s as CCD area detectors took hold and data volumes increased significantly. Chemical crystallography is facing another change, with a range of alternative structure determination methods becoming viable and dynamic crystallography seeing more widespread use. Some examples of initiatives from our laboratory illustrate the nature of this imminent expansion in our field.

3D-Electron Diffraction (3D-ED) is set to significantly impact on small-molecule crystal structure analysis with the introduction of new dedicated instrumentation that will dramatically increase the volume of results generated and lift the technique from research in itself, to being generally applicable and providing a widespread service. 3D-ED is generating some truly amazing results, producing structures from nano crystallites traditionally considered as powders, on materials that normally would never have been applicable to single-crystal analysis. However, the nature of the experiment is challenging and invariably datasets from several crystallites must be merged to maximise the completeness of data, with the result that structures do not meet the same quality standards as we have come to expect from X-ray single-crystal analysis. Other similarly emergent structure determination techniques applied to particular problems, such as NMR Crystallography and XFEL studies, present wonderful opportunities but come with the same data quality problem.

The Crystal Sponge technique enables the uncrystallisable to be crystallised. Compounds that do not crystallise well, or at all, or that can only be synthesised in minute amounts can be soaked into a crystalline porous material and the composite host+guest structure determined. This provides a molecular structure that can have great value for synthetic chemists for characterisation or confirmation of product. However, the experimental technique is variable in that molecules can arrange in the sponge in different ways significantly affected by soaking conditions and this leads to diffuse diffraction, disorder and lower quality results. Similarly dynamic crystallography, that is structures under change mediated by e.g. temperature, pressure, gas adsorption, electric current, also suffers from these effects.

These exciting advances are set against the backdrop of traditional X-ray crystal structure analysis, with >100 years of enhancing instrumentation, >50 years of collecting results into databases, 40+ years of trusted common refinement processes, 30 years of standards and 20+ years of validation tools. So, the established processes, metrics, etc. for small-molecule crystallography provide a well-established and trusted ‘quality framework’ for our results. This means that the small-molecule crystallography community now caters very well for the validation and quality control of relatively routine structures as part of the checking and publication process. However, this quality framework doesn’t cater well for these exciting new frontiers of chemical crystallography in the sense that results are deemed to be of a lower quality. But the results of these experiments drive and underpin investigations in ways that could never have happened before and with a comparable accuracy to the gold standard of single-crystal X-ray analysis.

Being able to answer questions such as ‘what is the compound I have made?’, ‘what is this reaction by-product?’, ‘how has my structure changed?’, ‘how does this material manifest these properties?’, particularly for materials that are not ideally crystalline, can be crucial to further the progress of research.

Clearly these are strong examples that extend the community discussions around making raw chemical crystallography data available [1]. But how can we balance this current contrast between well established and emergent techniques? Firstly, it is necessary to consider extending the current quality framework and secondly it is imperative to make raw data available alongside the results from these new techniques. This talk will present the concept of ‘structure grading’ as an indicator of what claims can be made based on a particular result. These claims, and therefore the structure grading, should be backed up by the raw data – particularly in the case of emergent techniques where it is highly likely that methods will improve and so a better result can be derived in the future from the original data. The talk will therefore also consider how it can be shown that the best possible result has been obtained from the raw data, or indications can be provided that declare that there is room for future improvement.

[1] When should small molecule crystallographers publish raw diffraction data (2021). Twenty-Fifth Congress and General Assembly of the International Union of Crystallography, https://www.iucr.org/resources/data/commdat/prague-workshop-cx.

(hide | hide all)
    

3.3: Powder diffraction

15:00-15:20 (07:00-07:20) Elena Boldyreva (Russia) Powder diffraction raw data

[E. Boldyreva]

Elena Boldyreva
affiliation

abstract

(hide | hide all)
    
15:20-15:40 (07:20-07:40) Miguel A. G. Aranda (Spain) Powder diffraction data sharing and reuse: advantages and possible practical obstacles Abstract

[M. Aranda]

Miguel A. G. Aranda
Universidad de Málaga, 29071-Málaga, Spain

Scientific data in our crystallographic community can be classified, in broad terms, in three large categories: raw, reduced and derived data. On the one hand and for decades, IUCr has been and is being very active in promoting the sharing of reduced and derived data in independently verified databases. The final results, in narrative style, are also shared in the scientific journals. On the other hand, the need for raw data sharing is clearly increasing, being nowadays technically feasible and likely cost-effective.

Within the crystallography field, the powder diffraction (PD) community is a subgroup dealing with several goals, mainly (1) average crystal structure determination; (2) quantitative phase analyses; (3) microstructural analyses; and (4) local structure determination and quantitative analyses of nanocrystalline materials. It should be noted that many PD users are not directly associated with crystallography but with material science, solid-state chemistry and physics, etc., some practices being different in different fields. For PD, derived data for objectives (2) and (3) and to a large extent (4) cannot be incorporated in current `standard' (independently verified) databases. Therefore, and in my opinion, the need for sharing raw PD data is even more compelling than that of sharing raw single-crystal diffraction data.

In order to ensure that raw powder diffraction data sharing is useful, the methodology has to be robust. From the computing point of view, the shared data must be findable, accessible, interoperable and reusable – i.e. comply with FAIR standards. However, this is necessary but not sufficient. On the other hand, and from the involved scientific community point of view, the shared data must have sufficient quality and their quantitative reuse should be relatively easy.

Some possible benefits of sharing powder diffraction raw data were discussed in a previous publication [J. Appl. Cryst. (2018), 51, 1739-1744]. In this communication, I will further elaborate on the benefits but mainly on some practical obstacles to be addressed. For powder diffraction data from point detectors, the sharing seems to be straightforward. However, for powder diffraction data taken from 2D detectors, this is not the case. It is noted that both correction and integration steps have choices that need to be unified. This is a challenging task that needs to be undertaken.

(hide | hide all)
    
15:40-15:55 (07:40-07:55) Loes Kroon-Batenburg (Netherlands) / Selina Storm (Germany) Summing up: the role of IUCrData’s new Raw Data Letters in serving all the above
15:55 (07:55) Close

The Congress Opening Ceremony is at 6 pm