Report of IUCr Representative to CODATA 2002-2004

  • Date: Fri, 17 Sep 2004 14:44:35 +0100
CODATA requires a report on activities from each member organization
for presentation to its General Assembly (this will take place in Berlin in
November). For your interest I append a copy of the report I have just sent
to CODATA, only slightly beyond the deadline.

Best wishes

                           Report to CODATA
                         of Activities of the
               International Union of Crystallography (IUCr)

The International Union of Crystallography (IUCr) is a scientific union
adhering to the International Council for Science (ICSU). Its objectives
are to promote international cooperation in crystallography and to
contribute to all aspects of crystallography, to promote international
publication of crystallographic research, to facilitate standardization
of methods, units, nomenclatures and symbols, and to form a focus for
the relations of crystallography to other sciences.

Crystallographic Databases
Several independent databases exist that store and manage the results of
crystal structure determinations. Among the most important are 
 - the Cambridge Structural Database for organic and metal-organic
   small-molecule structures and oligonucleotides (CSD);
 - the Protein Data Bank for protein and nucleic acid structures (PDB);
 - the Inorganic Crystal Structure Database for inorganic materials (ICSD);
 - the Metals Crystallographic Data File for metals (CRYSTMET).

Other crystallographic databases store non-structural data, including:
 - the NIST Biological Macromolecule Crystallization Database and the
   NASA Archive for Protein Crystal Growth Data ;
 - the Powder Diffraction File.

These databases are curated by independent organisations, but the IUCr
monitors their development through a standing Database Committee (CCD) that
reports directly to the Union's Executive Committee. Among the activities
noted during the period by the CCD are:

 - Publication of a special issue of the journal Acta Crystallographica
   devoted to the crystallographic databases. The issue contained current
   descriptions of the major databases and their access software systems,
   together with a number of papers that reviewed their research
   applications across a very broad range of science.
 - The number of macromolecular structures deposited with the Protein Data
   Bank is growing at a high rate (18% during 2002); new deposits are
   being ably assimilated under the management of the Research
   Collaboratory for Structural Biology. By mid-July 2003 the total number
   of holdings was over 21500. An extensive program has been undertaken to
   standardize the archival holdings and export them in CIF and XML
 - In 2003, the Cambridge Crystallographic Data Centre released Version 1.0
   of the enCIFer program for CIF validation and editing, which is available
   for free download from their web site for bona fide research use.
 - The International Centre for Diffraction Data released PDF-4/Organics,
   a set of powder patterns calculated from the contents of the Cambridge
   Structural Database.

Data Exchange
Development continues on the Crystallographic Information File (CIF),
the standard file format for archiving and exchanging crystallographic data.
A new dictionary of data items for reporting accurate electron densities in
crystals (rhoCIF) was released in August 2003, and revised versions were
released during 2003 of the dictionaries of core data items in
small-molecule structural crystallography (coreCIF) and of image-plate
data, annotation and analysis (imgCIF). These complement the stable
dictionaries of data items for use with powder diffraction (pdCIF),
modulated structures (msCIF), macromolecular structures (mmCIF) and the
description of crystallographic symmetry (symCIF).

As part of its data standardization program, and in preparation for
harvesting and annotating the expected large number of structural genomics
results, the Protein Data Bank has developed an extensive dictionary of CIF
data items complementing the mmCIF dictionary. These items will be tested
and refined in a number of forthcoming procedures, including the development
with the IUCr of a structure reports section in a new journal of the Acta
Crystallographica family, and are likely to lead in due course to an
expansion of the standard mmCIF data dictionary.

The IUCr representative to CODATA has been privileged to collaborate with
Professor S. R. Hall, the inventor of CIF, as Co-editors of a volume in the
reference series International Tables for Crystallography that will provide
a complete and authoritative documentation of the CIF standard. This Volume
will be published in early 2005.

CIF remains the data exchange format of choice within crystallography, but
interest is growing in format translation into and out of XML. The Protein
Data Bank now makes available macromolecular structure descriptions in XML
(using a schema derived from the mmCIF data model). Collaboration continues
with IUPAC on such topics as the development of a chemical identifier (the
IUPAC-NIST Chemical Identifier INChI), and an IUCr working party on phase
identifiers is producing a recommendation for the incorporation of crystal
structure phases within INChI.

Data Validation
All structural data sets published in IUCr journals have since 1990 been
checked for internal consistency by software capable of reading CIF
submission or deposit files directly. The checking procedures are published
on the web, and a public service to return a standard report on structures
subjected to these checks has been established at http://checkcif.iucr.org.
This service has begun to attract sponsorship from scientific publishers and
databases, and has become accepted as a community standard for reviewing and
assessing the consistency and quality of small-molecule and inorganic
structure determinations.

Electronic Publishing
The IUCr continues to publish six primary research journals in
crystallography, and a seventh covering the technology, instrumentation
and uses of synchrotron radiation. A new online-only journal of Structural
Biology and Crystallization Communications will be launched in 2005, and
will include macromolecular structure determinations. The experimental data
sets discussed in these publications are freely available for download as
supplementary files. Access to these files is not restricted to journal

The IUCr collaborates with the Cambridge Crystallographic Data Centre and
the Inorganic Crystal Structure Database to check new submissions for prior
publication, and is continuing to pursue its goal of establishing
bidirectional hyperlinks between published literature articles and
associated database records.

Open Access to Crystallographic Data
Although the IUCr has always provided open access to the crystallographic
data associated with its publications, and has championed the open
availability of structural data sets (e.g. in the recommendations of the
InterUnion Bioinformatics Group co-sponsored by CODATA), it has become
aware of community concerns that more is needed to secure this goal. Two
community-led initiatives are currently taking shape. In one, voluntary
submission by researchers of their individual data sets to a crystallography
open database is being encouraged. In the other, service crystallography
facilities are collaborating with national scientific computing grid
funding agencies to collect and store data sets. The motivations behind the
two approaches are different; one seeks to make visible data sets that 
accompany published articles but are either not held as supplementary
material at all or are released only to journal subscribers; the other aims
to provide early access for the existing databases to collected data. Both
approaches make provision for the deposit of data that does not accompany
published literature, and both make use of open-source open-access 
software for data harvesting, storage and searching.

The IUCr has certain concerns regarding the critical evaluation of data
collected in these ways, and the longevity of community-sponsored data
repositories; it intends to work with the parties concerned to increase the
value of these initiatives. Among aspects to be considered are
the provision of critical analyses of data sets (for which checkCIF reports
may be suitable), the establishment of portals or common search engines
covering a multiplicity of such sites, and mirroring to safeguard
long-term access.

Long-Term Preservation of Digital Content
The IUCr's interest in long-term preservation of digital publications
and data continues with a project undertaken for ICSTI to design and 
propagate a questionnaire to assess current practice within crystallography.
The IUCr representatives to ICSTI and CODATA have worked together on the
project, and are currently analysing data from over 600 individual and 20
institutional respondents. The results will be published in 2005.

Brian McMahon

17 September 2004
