Using the Cambridge Structural Database in research and teaching: a personal view

William Clegg
[A favourite structure]
A favourite structure. W. Clegg, W. R. Gill, J. A. H. MacBride and K. Wade (1993). Angew. Chem. Intl Ed. 32, 1328, doi: 10.1002/ani3.199313281; CSD REFCODE PETZOD.

In 2019 the Cambridge Structural Database (CSD; Groom et al., 2016) reached the significant milestone of a million entries (Fig. 1; Taylor & Wood, 2019): crystal structures of organic and metal–organic compounds reported in formal publications, patents, some conference proceedings and personal depositions of otherwise unpublished results as CSD Communications. These 1,000,000+ structures, along with those in other structural databases for inorganic, metallic and biological materials, provide a convenient and comprehensive resource for scientists involved in crystallographic and related research, training, education and scholarship (Bruno et al., 2017).

 
[Fig. 1]Figure 1. Growth of the CSD since 1972. The darker blue bars show the number of structures added annually. Date: 6 June 2019.

The ‘CSD one million’ achievement was celebrated at various events, including major national and international conferences. An evening session at the 32nd European Crystallographic Meeting (ECM32) in Vienna, Austria, in August 2019, organised and hosted by the Cambridge Crystallographic Data Centre (CCDC, the curators and developers of the CSD), included presentations from some users representing a range of applications of the CSD. I was invited, as an author of over 3000 CSD entries, to speak about the impact of the CSD on my life in crystallography. In view of the interest shown in this presentation, a version of it is provided here for a wider audience. The topics covered will relate to the research and teaching activities of many readers.

My career in crystallography, which spans 50 years, reaches back to the relatively new and modest-scale operation of the CCDC. This was started by Olga Kennard in 1965, and occupied rooms adjacent to my own PhD research laboratory in the Lensfield Road chemistry building in Cambridge University, UK, in the early 1970s (Groom & Allen, 2014). The impact of the CSD over this period can conveniently be summarised under three headings, each illustrated by a photograph taken in my shared Newcastle University, UK, office, where I have a part-time continuing engagement following formal retirement in 2009.

1.  Research and publications

I first began to make significant use of the CSD in connection with crystallographic research in the 1980s. This was in a version maintained at Daresbury Laboratory as part of a broader provision of chemical database resources, the UK Chemical Database Service (CDS; Fletcher et al., 1996). I later chaired the CDS management panel for a few years before taking up a part-time secondment to a Joint Appointment at Daresbury, at which point this might have been seen as a conflict of interest. This version, supplied with a convenient independently developed software search system, was known as Crystal Structure Search and Retrieval (CSSR) and provided faster interactive search protocols than the initial Quest software available from CCDC. We later adopted CCDC’s much-improved ConQuest software (Bruno et al., 2002) as part of a locally installed CSD annual major release and periodic updates, an approach that continues now alongside the internet WebCSD access (Thomas et al., 2010).

The CSD is an indispensable tool at all stages of a research project having any significant crystallographic component, whether personal or collaborative, and whether this is in specific chemical topics or a more general structure determination service operation. It has enormous value before, during and after results are submitted and published.

In the formulation and development of a new project, one key question is the extent to which the research area has already been explored and reported by others. Coverage of the extant literature is greatly assisted and simplified by appropriately constructed CSD substructure and text searches to reveal relevant crystal structure results. Access to an exhaustive compendium of published structures helps avoid unnecessary duplication of effort, particularly in cases where a supposedly new compound proves to be a starting material or unexpected by-product with an already established crystal structure. Some diffractometer control systems link to databases of known unit cells derived from the CSD.

For structure solution or refinement that proves problematic, perhaps because of disorder or challenging data quality, CSD entries can provide useful and reliable structural geometries for techniques such as fragment searches, rigid group refinement and appropriate constraints and restraints. During and after refinement, the CSD also serves as a library of results for validation, especially if unusual geometrical features are found; the CSD program Mogul finds particular application here (Bruno et al., 2004; Cottrell et al., 2012). Following the completion of a structure determination, at the stage of writing up results for publication, the CSD is a comprehensive compilation of relevant known structures that can be used for comparison and discussion, highlighting similarities, trends and possibly genuine novelties. Much time and effort can be saved in research literature surveys, particularly for crystallographers working across many diverse fields of chemistry. This gives confidence that relevant published structures have not been overlooked, and contributes to the high volume of publications that crystallographers can often achieve (Fig. 2).

[Fig. 2]Figure 2. A personal collection of reprints from almost 50 years of publications.

After the publication of any results that include new organic or metal–organic crystal structures, these are added to the CSD in due course. Usually, this occurs relatively quickly, the structures having already been deposited with the CCDC as part of the manuscript submission process in fulfilment of journal policy requirements. A deposition number supplied by CCDC appears in the paper and is part of the new CSD entry, which is subject to an embargo by CCDC until the journal formally publishes it. In this way, the CSD serves as a permanent curated public record of the published structures.

2.  Teaching and training

The CSD has contributed significantly to undergraduate chemistry teaching and postgraduate/postdoctoral research training programmes and events in which I have been involved over many years, including associated published material (Fig. 3).

 
[Fig. 3]Figure 3. Crystallography textbooks written/edited by the author.

In research training, the CSD serves the same purposes as outlined above in general research terms. It provides a catalogue of known structures and their geometry for information, comparison and validation. It is a valuable resource, not only for those who are mainly engaged in crystallographic research, but also for those in synthetic chemistry research groups who make use of crystallography as one of their characterisation techniques, either by direct hands-on involvement or through a provided structure determination service. It aids their interpretation of their own results and the assessment of their significance in the context of previously reported research.

At the chemistry undergraduate level, in Newcastle, we have used CSD searches and the CSD molecular graphics program Mercury (Macrae et al., 2020) in lectures, demonstrations and practical classes. We use it as a tool for teaching various aspects of molecular geometry, including topics such as torsion angles, ring conformation, isomers, chirality and coordination geometry. It is important to make this a hands-on experience and not just a demonstration ‘from the front’, and to develop small-scale investigative projects rather than just a prepared script to follow so that students explore structural features for themselves. As the contents of the CSD are crystal structures and not just molecular structures, topics covered also include intermolecular aspects such as hydrogen bonding and other interactions, together with polymorphism and solvates. Lectures also briefly cover the topic of structural databases, including the CSD. One of my colleagues at Newcastle, Peter Hoare, has also developed a subset of CSD structures for use in secondary school teaching of chemistry (Hoare & Henderson, 2014; Hoare, 2016).

The Oxford Chemistry Primer text X-Ray Crystallography (second edition), shown in Fig. 3 (Clegg, 2015), includes several detailed case studies and numerous other illustrative examples of structures taken from the CSD. In each case, the CSD REFCODE is given, and there are online resources, including the measured diffraction data and the complete crystal structure in CIF format, so that students can follow the crystallographic processes and analyses themselves if they have access to appropriate software.

3.  Unpublished results

It would be inaccurate to describe the CSD as a collection of all known organic and metal–organic crystal structures; it encompasses only those that are known in the public domain, and this is a major distinction! Estimates vary widely for the number of unpublished structures, but there is general agreement that they would inflate the CSD to more than double its current size – how much more than double is debated.

There are many reasons, good and bad, why solved and refined crystal structures do not get published; perhaps this is a topic for another IUCr Newsletter article. Much of the research has been carried out using public funding, and there are increasing pressures for open access to public-funded research results and data. There is growing awareness, at least in some parts of the world, that the failure to publish such results represents a problem having ethical dimensions that may be regarded as anything from a pity to a scandal.

The availability of such unpublished results would add hugely to the value of what is already in the public domain, giving better coverage and more reliable data for analyses of structural trends and patterns in what is sometimes known as ‘data-mining’ research. Even so-called ‘negative results’ are worth making known, for completeness, if only to help others avoid wasting time and resources in unnecessarily repeating essentially the same work; this includes results such as unexpected structures, by-products and unreacted starting materials.

My own experience here is probably typical of many other prolific crystallographers working in a wide range of chemical areas. Fig. 4 shows just a selection of unpublished structures (each in its own filing cabinet entry) from several decades of collaborative research projects. It is an under-representation because more recent results have been stored only electronically with no paper record for retention in this way.

[Fig. 4]Figure 4. Filing cabinets of unpublished crystal structures.

Some of these will yet be published (a few are already submitted), but most, realistically, will not. The practical approach, now being actively pursued, and far preferable to consigning everything to eternal non-publication as a total waste of time and resources, is to refresh each structure with a further refinement using current software to generate a CIF output in accordance with modern standards, resolve any issues that now become apparent and deposit it as a CSD Communication (Rogers, 2019). This approach generates a CSD entry without a journal publication, with an associated DOI and on a par, in CSD terms, with published entries. Although the work is not subject to formal peer review for publication, it is validated through standard checkCIF procedures (International Union of Crystallography, 2020; Spek, 2020), with a facility for authors to respond to significant alerts. These responses are retained as part of the complete entry in the CSD. CSD Communications are also manually curated by a team of CCDC editors in the same way as structures published in journals. A comparative survey has shown that, in general, structures contributed to the CSD from these two sources are of similar overall quality and reliability (Tovee, 2018).

The CSD Communication approach is strongly supported and encouraged by CCDC as a facility to help crystallographers make their unpublished results available to others. Its use is growing, so CSD Communications are likely to represent an increasing proportion of the CSD contents in future years. At the time of writing, CSD Communications comprise close to 3% of the total database entries. Adding to this is one of my main activities now during part-time research activities.

Conclusion

The CSD is well established as a critical resource in modern chemical crystallography. Its uses and applications are broader than initially envisaged. These include a comprehensive compilation of crystal structures in the public domain, data for structural comparisons and validation, a library of geometrical fragments for a range of crystallographic and theoretical calculations, a rich vein of treasures for data mining and attractive material and software for training and teaching purposes. Recent developments have made it more easily and widely accessible, simple to use and attractive for the public deposition of structures having no formal publication.

This is not the place for a discussion on the relative merits of, and access to, the CSD and the Crystallography Open Database, COD (Gražulis et al., 2012). The CSD has approximately double the number of entries than the COD, which is not restricted to organic and metal–organic structures. In this article, I have exclusively addressed the use of the CSD as it currently exists, comprising the compiled database itself, its facilities for deposition, validation, curation, maintenance and access, and its powerful associated software. The question of free open access can be addressed elsewhere; perhaps someone will do so in response to this article.

References

Bruno, I. J., Cole, J. C., Edgington, P. R., Kessler, M., Macrae, C. F., McCabe, P., Pearson, J. & Taylor, R. (2002). Acta Cryst. B58, 389–397.

Bruno, I. J., Cole, J. C., Kessler, M., Luo, J., Motherwell, W. D. S., Purkis, L. H., Smith, B. R., Taylor, R., Cooper, R. I., Harris, S. E. & Orpen, A. G. (2004). J. Chem. Inf. Comput. Sci. 44, 2133–2144.

Bruno, I. J., Gražulis, S., Helliwell, J. R., Kabekkodu, S. N., McMahon, B. & Westbrook, J. (2017). Data Science J. 16, 1–17.

Clegg, W. (2015). X-Ray Crystallography, 2nd edition. Oxford: Oxford University Press.

Cottrell, S. J., Olsson, T. S. G., Taylor, R., Cole, J. C. & Liebeschuetz, J. W. (2012). J. Chem. Inf. Model. 52, 956–962.

Fletcher, D. A., McMeeking, R. F. & Parkin, D. (1996). J. Chem. Inf. Comput. Sci. 36, 746–749.

Gražulis, S., Daškevič, A., Merkys, A., Chateigner, D., Lutterotti, L., Quirós, M., Serebryanaya, N. R., Moeck, P., Downs, R. T. & Le Bail, A. (2012). Nucleic Acids Res. 40, D420–D427.

Groom, C. R. & Allen, F. H. (2014). Angew. Chem. Intl Ed. 53, 662–671.

Groom, C. R., Bruno, I. J., Lightfoot, M. P. & Ward, S. C. (2016). Acta Cryst. B72, 171–179.

Hoare, P. & Henderson, S. (2014). Educ. Chem. 51, 14–17.

Hoare, P. (2016). Abstracts of the 66th Annual Meeting of the American Crystallographic Association, Denver, CO. Abstract 01.10.07.

International Union of Crystallography (2020). https://checkcif.iucr.org/

Macrae, C. F., Sovago, I., Cottrell, S. J., Galek, P. T. A., McCabe, P., Pidcock, E., Platings, M., Shields, G. P., Stevens, J. S., Towler, M. & Wood, P. A. (2020). J. Appl. Cryst. 53, 226–235.

Rogers, E. (2019). https://www.ccdc.cam.ac.uk/Community/blog/share-data-as-a-csd-communication/

Spek, A. L. (2020). Acta Cryst. E76, 1–11.

Taylor, R. & Wood, P. A. (2019). Chem. Rev. 119, 9427–9477.

Thomas, I. R., Bruno, I. J., Cole, J. C., Macrae, C. F., Pidcock, E. & Wood, P. A. (2010). J. Appl. Cryst. 43, 362–366.

Tovee, C. (2018). https://www.ccdc.cam.ac.uk/Community/blog/2018-11-12-structure-quality/

 

William Clegg is one of the founding Editors of Acta Cryst. E; please see his latest article in that journal, "Some reflections on symmetry: pitfalls of automation and some illustrative examples" here.

18 February 2020

Copyright © - All Rights Reserved - International Union of Crystallography