The Cambridge Crystallographic Data Centre

40 years of database development, software and research


[CCDC logo]The CCDC was created to record crystal structures, and the Cambridge Structural Database was one of the first numerical databases created anywhere in the world. The CCDC dates from January 1, 1965, when David Watson joined Olga Kennard’s research group in the Dept. of Chemistry at Cambridge U., UK. They began collecting references to organic and metal-organic crystal structures and began to encode bibliographic and numerical data in ‘machine-readable’ form. Subsequent CSD growth statistics suggest that if this work had not started then, it is doubtful if it would have started at all. But it did, and 40 years later the fully retrospective CSD contained 335,276 structures.

The CSD – then and now

[Figure]Early progress was slow: computer technology was in its infancy and hardware was temperamental! Nevertheless, other group members were increasingly attracted to the new world of informatics, since crystallographer-programmers were needed to turn the vision into a reality. They created systems for logging, encoding and validating data, and today our scientific editors and their assistants are fully responsible for CSD creation and maintenance. Structure validation software was an early requirement, to detect local data entry mistakes and the typographical errors that occurred in some 10% of printed tables. Data acquisition itself has now completed its own ‘electronification’ from printed journal tables to electronic CIF depositions. The IUCr has itself played a key role in this essential transformation.

[Hartley, Kennard, Allen]Former directors David Hartley and Olga Kennard join current director Frank Allen. photo courtesy of Sarah Houlton, Chem@CamNewsletter
An electronic bibliographic file was being regularly updated by 1970, and was published via the Molecular Structures and Dimensions book series, again with IUCr involvement. Meanwhile, the first 5,000 crystal structures were being validated and entered into a CSD data file. Finally, a chemical connectivity file was created, thus enabling 2D and 3D substructure searching and adding tremendous value to the underlying crystal structure data. These three separate files were soon amalgamated into the present-day CSD.

Now, the CSD includes 335,276 crystal structures, and grew by nearly 29,000 structures in 2004. The size and complexity of structures has also increased steadily with time. The CCDC has excellent relationships with journals, and 84 titles now require electronic data deposition to the CCDC when a paper is submitted. These data enter the CSD when the paper is published, and the CCDC maintains a growing parallel archive of more than 160,000 of these initial ‘raw’ CIFs. Individual CIFs can be freely downloaded from this archive using a simple web form. The CCDC’s enCIFer software, available for free download, is designed to help depositors create format-compliant
CIFs for submission to journals and databases. Current CSD statistics, available on the website, refer primarily to published data, although the CSD now contains more than 3,000 Private Communications. Finally, the very large number of unpublished structures that languish in laboratory archives is surely a matter that must be addressed in the future.

Millions of lines of code

Software development has always been at the heart of CCDC activities, and we have run the gamut from FORTRAN II to our current object-oriented C++ environment. The CCDC develops three types of code: that which underpins CSD creation, that which forms part of the distributed CSD System, and applications software that uses crystal structure data to solve problems in structural chemistry and biology. CCDC software developers have been at the forefront in creating novel systems for 3D substructure searching, including searches for intermolecular interactions, and the statistical analysis of parameter distributions retrieved from the CSD.

Two new components of the distributed CSD System have been added since 1997. These are knowledge-based libraries of intramolecular geometry (Mogul) and intermolecular interactions (IsoStar). They provide click-of-a-button access to millions of individual pieces information that can be derived from the CSD (and PDB protein-ligand complexes in the case of IsoStar). Further development, and integration of structural knowledge with other software, is ongoing in both cases.

Recent years have also seen the CCDC diversify into developing specific software applications for rational drug design (GOLD, SuperStar, Relibase+) and structure solution from powder data (DASH). All of these products use crystal structure data from the CSD or PDB, and most are being developed through collaborations with industry and academia. The life sciences products, concentrating on protein-ligand interactions and protein-ligand docking, help solve difficult problems, and promote the value of small-molecule crystal structures to a wide audience. The CCDC continues to broaden its own horizons, by seeking new areas of science where structural data adds value to research and development activities.

CSD System releases

By the mid-1970s, the first version of the CSD System had been released to academics in the UK, USA, Japan and Italy. Many other countries formed academic National Affiliated Centres and access extended rapidly. The pharmaceutical and agrochemicals industries began to experiment with computational chemistry and modelling tools for rational molecular design, and the number of industrial subscribers began to rise during the 1980s. Early releases were on magnetic tape, and software was released as source code to be compiled under the user’s local operating system. Today all that has changed, with just a few universal operating systems, CDs and internet downloads, click-of-a-button installers, and e-mail support desks.

1,200 applications papers

The first papers that made use of the CSD for fundamental research began to appear in the late 1970s. Inspired by the work of Hans-Beat Buergi and Jack Dunitz on structure correlation, there was a rapid acceleration in this type of research from about 1980. A key issue was to improve database searching and develop a proper statistical basis for data analysis, so that improvements in CSD distributed software were often driven by current research needs. The CCDC itself has been heavily involved in this research effort, and has published applications papers covering both intramolecular and intermolecular topics. Tables of mean bond lengths [1,2] have jointly received more than 10,000 citations, while in the study of intermolecular interactions the CSD has provided tools for studying protein-ligand interactions, and played its part in the emergence of crystal engineering as a sub-discipline. One CCDC paper, which categorised short C-H…O interactions as true H-bonds [3], has received >1,000 citations and has re-shaped the global view of weaker interactions. The CCDC maintains a web accessible database of published product applications, and the 1,200 current entries chart the many and varied uses of the CSD. The CCDC is well represented with over 150 papers, but more than 1,000 other references indicate the truly international impact of CSD-based research.

The CCDC as an independent institution

The CCDC was grant-funded from 1965 until 1989, when it became an independent institution: a non-profit charitable Company Limited by Guarantee under English law. Thus, the CCDC must now be financially self-sufficient, and any surplus income must be used internally (e.g. for new equipment) or for specific charitable activities. Significant contributions are made to ensure CSD System access in developing countries, together with support for students and professional organisations. The CCDC’s affairs are overseen by an international Board of Governors, eight eminent scientists who, in their turn, are responsible to UK Companies House and to the Charity Commissioners for England and Wales.

Our most valuable assets: staff and customers

The CCDC has expanded steadily, and now has 50 employees divided between database creation, product development, research, scientific and technical support, and administration. More than 250 people have worked at the CCDC, and each has left their mark on the organisation. The CCDC now has a worldwide customer base in academia and industry, and the 2,000 CSD System licenses were distributed across 56 countries in 2004. Customers and data depositors also leave their mark, through their constructive feedback on our efforts. The CCDC, our products, and ultimately all of our customers, have benefited enormously from these interactions.

We look forward to the next 40 years.

[1] J.Chem.Soc. Perkin Trans., pp S1-S19, 1987.
[2] J.Chem.Soc. Dalton Trans., pp S1-S83, 1989.
[3] J. Amer. Chem. Soc., 104, 5063-70, 1982.