Feature article

Duplicate Dilemma – when should experimental crystallographic databases be concerned about near-duplicate crystal structures?

Natalie JohnsonIan Bruno, Seth WigginMatt Lightfoot

Image showing the overlay of different structure determinations of the compound ROY, famous for its ability to form multiple polymorphs. The CSD contains over 90 determinations of ROY; the image from CCDC’s Mercury visualiser shows an overlay of eight data collections that were determined by a Crystal Packing Similarity search to have a cluster of 15 molecules (the central molecule plus 14 others) in common, compared to the original report published in 2000.

“Duplicates of crystal structures are flooding databases, implicating repositories hosting organic, inorganic, and computer-generated crystals” was a claim made in a recent C&EN article expressing concerns that “duplicate structures haunt crystallography databases” and raising questions about “curation practices at databases” [1].

If true, this would of course be of concern to science, crystallography and those databases that set out to provide trusted access to quality research data. There are perhaps two aspects to this – how many duplicates are there and are they necessarily a bad thing?

In this article we explore these aspects from the perspective of an experimental crystal structure database and in particular the Cambridge Structural Database (CSD), a database of over 1.4 million experimentally measured organic and metal-organic crystal structures curated by the Cambridge Crystallographic Data Centre (CCDC).

While highlighting the issues researchers had found with structures predicted by GNoME – a deep learning tool that claimed to have predicted 2.2 million new crystal structures – the C&EN article also commented on work by Professor Vitaliy Kurlin from the University of Liverpool identifying near-duplicate structures in crystallographic databases made up of experimentally produced data [2]. The CSD was reported as having 8343 near-duplicate structures (0.92% of the CSD structures used in the analysis or 0.62% of the total CSD).

The CSD, like other experimental databases, aims to be a record of all published crystal structure models. Reproducibility has been a cornerstone of the scientific method for centuries. Scientific experiments are rarely performed once and may be repeated to ensure the same result is achieved. In structural chemistry, the same compound may be studied in a variety of different ways, under different conditions or by different research groups looking at the same or different aspects of a structure. If these structures are characterised by X-ray crystallography, this can result in the publication of multiple measurements of the same compound, sometimes even within one publication.

Unlike a computationally generated database of new crystal structures, near-duplicate structures of experimental data are expected as part of the scientific process. Equally, the same crystal structure may be used by authors in more than one article to support different scientific narratives. As Dr. Saulius Gražulis who maintains the Crystallography Open Database, which was also surveyed by Dr. Kurlin, noted in the C&EN article “The same structure published in two different journals or two different books is not a duplicate. It is a historical fact, which we preserve” [1].

The CSD has included multiple experimental determinations of the same compound since near its creation, over 60 years ago. This is reflected by the identifier scheme adopted early in the CSD’s history. Each structure in the CSD is assigned a CSD Refcode, comprising of 6 characters followed optionally by two digits. Different determinations of the same structure share the same 6 characters and vary in their digits, resulting in Refcode “families” that group together studies of the same compound. The earliest instance of the CSD – a book series called Molecular Structures & Dimensions published from 1970–1984 that includes bibliographic records and structural information about published X-ray structures – contains several instances of multiple determinations of the same structure [3]. One example is the entry for Hexamethylenetetramine (Figure 2), where a single publication reported six different collections of the compound at various temperatures and with different X-ray wavelengths. These entries were assigned Refcodes HXMTAM–HXMTAM05. To date, around 5% (or approximately 67,000 of 1,259,000) of the Refcode families in the CSD contain more than one entry.

An excerpt from volume A1 of 'Molecular Structures and Dimensions' published in 1972 by the CCDC and IUCr showing the entry for Hexamethylenetetramine with 6 different X-ray collections of the structure.

Every new data collection for the same compound can add a valuable piece of information to the story of the structure and its properties. Multiple data collections of the same chemical compound can provide insight into how the structure of a compound changes with variations in temperature or pressure – discovering new polymorphs, identifying phase transitions in the structure or phenomena such as negative thermal expansion. The field of crystallography is also constantly developing and new instrumentation may lead to the determination of better-quality crystal structures of existing compounds.

More reliable information about hydrogen positions can come from measurements using neutron diffraction [4] – or as the result of measuring compounds with emerging techniques, like the growing increase in structures measured with electron diffraction [5]. Higher resolution collections can focus on the modelling of aspherical electron density in multipole refinements [6], while existing structures can be re-refined using techniques beyond the independent atom model of structure solution, like Hirshfeld Atom Refinement [7] or Transferable Aspherical Atom models.

Not all of these repeated collections will be near-duplicates. Different experimental conditions (such as changes in temperature) would, in many cases, result in a large enough difference of the atomic positions within the structural model that they wouldn’t be flagged by methods such as Professor Kurlin’s Pointwise Distance Distributions (PDD) technique, which uses an Earth-Movers Distance of less than 0.01 Å to identify pairs of structures that are near-duplicates. Near-duplicate structures are thus not necessarily structures with identical atomic positions, but those with very small differences in atomic coordinates. When differences are small it is not unreasonable to question why – are these indicative of a dataset that has been misappropriated or manipulated, an oversight on the part of the database or a legitimate scientific result?

One reason for a near-duplicate pair of structures could be multiple models of the same experimental data [6], [8]. In these cases a cross-reference is added in the CSD between structures, indicating that one structure is a re-refinement of another. Cross-references are links between structures and can be found in the desktop CSD software ConQuest (Figure 3), as well as accessible from the cross-references field [9] in the Entry Module of the CSD Python API. Cross-references in the CSD are added for re-refinements of the same data or re-interpretations of existing structural models. Duplicate data checks take place during the deposition and curation process with the aim of assigning these links between structures. Cross-references are also added between related structures in other Refcode families, highlighting structures of tautomers, stereoisomers and racemates in the CSD.

Screenshot of CCDC ConQuest software showing the Author/Journal information for CBMZPN22, illustrating cross-references with other models of the same experimental data in the CSD.

There may also be times when researchers deposit very similar models of the same structure to the CSD. This occurrence was acknowledged by Professor Kurlin and others in a recent paper discussing near-duplicate structures in the PDB [10]. These structures may have undergone a round or two of re-refinement, resulting in a small change in the atomic positions or atomic displacement parameters, or there may be different metadata present in the CIF. Oftentimes the reflection data is the same in both CIF files. As the CCDC’s policy is to be a record of published data, if there has been a separate deposition, both structural models are kept (and a cross-reference is added between structures).

However, there may be instances where a near-duplicate structure indicates a potential issue with the published crystal structures. Previous work by Professor Kurlin identified 5 pairs of structures in the CSD where the atomic coordinates were exact duplicates for different chemical compounds [11], [12]. These pairs were subsequently investigated by the CCDC and the journals, resulting in three structures being either updated or retracted. In the remaining cases, these structures have had a remark added highlighting Professor Kurlin’s findings to users.

The CCDC’s Retraction policy is that when a published journal article is retracted by the journal – for any reason, not necessarily due to a problem with the crystallographic data – that the associated data is labelled as retracted in the CSD [13]. If a structure is retracted then the publication record is maintained, but all atomic information (2D chemical diagram and 3D atomic coordinates) is removed along with other experimental data. A remark is also added to the entry indicating the article has been retracted along with a link to the retraction notice.

Data integrity is at the heart of the CCDC. We aim to be a trusted resource of crystallographic data, so as well as ensuring data is retracted promptly, we have been developing dedicated processes aimed at identifying potential cases of misconduct. Every structure in the CSD is curated by an expert curator, to ensure structures are findable and the information describing the entry is standardized. In recent years the CCDC has also developed workflows that are run prior to each data release to identify structures that may require further investigation. We work closely with publishers and contact the relevant journals if we have data integrity concerns involving any of the structures in the CSD and also investigate issues highlighted to us by our users [14]. We welcome discussion, collaboration with researchers or any ideas as to how we can continue improving to ensure the CSD remains a trusted resource for decades to come.

Whilst multiple collections of the same structure can be useful in some areas of research, this may not always be the case. To minimise biases that may be introduced by multiple determinations of the same structure or form when performing large scale structural analysis, the CCDC maintains four ‘best representative’ CSD subsets [15]. These are collections of the best representative structure of a compound at i) high temperature, ii) low temperature, iii) with all hydrogen positions modelled or iv) the model with the lowest R factor, for structures that meet certain data quality criteria. Searching or assessing data using one of these subsets will return only one record for each Refcode family (unless multiple polymorphs are present). Other search functionality, like filters, can also be applied to narrow down searches to help select the most relevant structures for an individual’s research.

So when should experimental databases be concerned about near-duplicate structures? Many of the structures identified as near-duplicates in experimental databases probably result from repeat collections of data for the same compounds and it is unlikely that every pair identified is a cause for concern. Near-duplicate structures should be assessed individually, to ensure that the similarities are rationalised, scientifically valid, and represented correctly in the databases. On the rare case that a potential concern with the data is identified, the publisher will be contacted and asked to investigate any similarities with other structures. An understanding of the aims of experimental crystallographic databases – as records of all published structures – along with good curation of the data will enable users to get the most out of a very valuable resource. As stated by the Executive Director of the CCDC, Suzanna Ward, in C&EN “It’s trying to make the data available and then allow users to pick fit-for-purpose data.” [1]

Note: The CCDC have contacted Professor Kurlin to request a list of near-duplicate structures identified in his research for further analysis.

All authors work for CCDC, an organisation that curates and maintains the Cambridge Structural Database.

References

[1] D. S. Chawla, “Duplicate structures haunt crystallography databases,” Chemical & Engineering News, Dec. 2025, doi: https://doi.org/10.1021/cen-259818-Feature. Available: https://cen.acs.org/research-integrity/Duplicate-structures-haunt-crystallography-databases/103/web/2025/12. [Accessed: Jan. 15, 2026]↩

[2] D. Widdowson and V. Kurlin, “Pointwise Distance Distributions for detecting near-duplicates in large materials databases,” arXiv.org, Aug. 2021, doi: https://doi.org/10.1137/25M1736657. Available: https://arxiv.org/abs/2108.04798v4. [Accessed: Jan. 16, 2026]↩

[3] O. Kennard, Molecular Structures and Dimensions Volume A1: Interatomic Distances 1960-1965. Oosthoek Publishing Company, 1972.↩

[4] F. H. Allen and I. J. Bruno, “Bond lengths in organic and metal-organic compounds revisited: X—H bond lengths from neutron diffraction data,” Acta Crystallographica Section B Structural Science, vol. 66, no. 3, pp. 380–386, May 2010, doi: https://doi.org/10.1107/s0108768110012048↩

[5] C. G. Jones et al., “The CryoEM Method MicroED as a Powerful Tool for Small Molecule Structure Determination,” ACS Central Science, vol. 4, no. 11, pp. 1587–1592, Nov. 2018, doi: https://doi.org/10.1021/acscentsci.8b00760.↩

[6] R. Kamiński et al., “Statistical analysis of multipole-model-derived structural parameters and charge-density properties from high-resolution X-ray diffraction experiments,” Acta Crystallographica Section A Foundations and Advances, vol. 70, no. 1, pp. 72–91, Dec. 2013, doi: https://doi.org/10.1107/s2053273313028313↩

[7] S. C. Capelli, Hans-Beat Bürgi, B. Dittrich, S. Grabowsky, and D. Jayatilaka, “Hirshfeld atom refinement,” IUCrJ, vol. 1, no. 5, pp. 361–379, Aug. 2014, doi: https://doi.org/10.1107/s2052252514014845↩

[8] M. Chodkiewicz and K. Woźniak, “Towards improved accuracy of Hirshfeld atom refinement with an alternative electron density partition,” IUCrJ, vol. 12, no. 1, pp. 74–87, Jan. 2025, doi: https://doi.org/10.1107/s2052252524011242↩

[9] CCDC, “Entry API — CSD Python API 3.6.1 documentation,” 2019. Available: https://downloads.ccdc.cam.ac.uk/documentation/API/modules/entry_api.html#ccdc.entry.Entry.cross_references. [Accessed: Jan. 18, 2026]↩

[10] A. Wlodawer et al., “Duplicate entries in the Protein Data Bank: how to detect and handle them,” Acta Crystallographica Section D Structural Biology, vol. 81, no. 4, Mar. 2025, doi: https://doi.org/10.1107/s2059798325001883.↩

[11] O. Anosova, Vitaliy Kurlin, and M. Senechal, “The importance of definitions in crystallography,” IUCrJ, vol. 11, no. 4, pp. 453–463, May 2024, doi: https://doi.org/10.1107/s2052252524004056↩

[12] D. Widdowson, M. M. Mosca, A. Pulido, A. I. Cooper, and V. Kurlin, “Average minimum distances of periodic point sets – foundational invariants for mapping periodic crystals,” MATCH Communications in Mathematical and in Computer Chemistry, vol. 87, no. 3, pp. 529–559, Dec. 2021, doi: https://doi.org/10.46793/match.87-3.529w↩

[13] CCDC, “CCDC Support: Retractions in the Cambridge Structural Database,” 2025. Available: https://support.ccdc.cam.ac.uk/support/solutions/articles/103000353401-retractions-in-the-cambridge-structural-database. [Accessed: Jan. 15, 2026]↩

[14] CCDC, “I’ve found an entry in the CSD which either contains an error or does not correctly reflect the original publication. Who should I notify?,” CCDC, Aug. 01, 2024. Available: https://support.ccdc.cam.ac.uk/support/solutions/articles/103000306314-i-ve-found-an-entry-in-the-csd-which-either-contains-an-error-or-does-not-correctly-reflect-the-origi. [Accessed: Jan. 20, 2026]↩

[15] J. van de Streek, “Searching the Cambridge Structural Database for the `best’ representative of each unique polymorph,” Acta Crystallographica Section B Structural Science, vol. 62, no. 4, pp. 567–579, Jul. 2006, doi: https://doi.org/10.1107/s0108768106019677↩

4 February 2026

Subscribe

Submit

Advertise

IUCr Newsletter

Feature article

Duplicate Dilemma – when should experimental crystallographic databases be concerned about near-duplicate crystal structures?

References