(9) Review of the CIFtools workshop

To: [email protected]
Subject: (9) Review of the CIFtools workshop
From: [email protected] (Brian McMahon)
Date: Tue, 2 Nov 93 17:05:23 GMT
Dear Colleagues

Attached are my impressions of last month's workshop. This isn't intended
as a complete report of the meeting - I've made no reference to several
interesting talks that weren't directly relevant to our concerns. I believe
Phil Bourne will make the proceedings available over the network, and I'll
supply details of how to obtain them when they're distributed.

I'm also circulating copies of this to Nick Spadaccini and Phil Bourne, who
are invited to correct any errors of fact and debate any matters of
interpretation (mail me at [email protected], guys). 


            CIFtools Workshop Tarrytown NY 15-18 October 1993
            -------------------------------------------------

This workshop was funded by the National Science Foundation to promote the
development of tools for handling macromolecular data in CIF format. It was a
sequel to a workshop in York in April 1993, on the theme of data validation
in macromolecular crystallography. Several participants at both meetings
characterised the York workshop as an event of some significance in the
acceptance and development of the CIF format for data archive and transfer in
this field. It was felt that many participants had come to York with no
particular enthusiasm for CIF. However, the (often animated) discussions at
York had persuaded many of the participants whose particular interests lay in
the sphere of database operations that CIF could be used as a valuable data
transfer mechanism if it were enhanced to indicate explicit relationships
between data items. The recent Workshop was characterised by broad acceptance
of the CIF modifications brought about since York, and an eagerness to
promote applications using the extended CIF format.

The major changes to the CIF format following the York meeting were extensions
to the dictionary definition language (DDL) that is used to define terms in
CIF dictionaries. In particular, the term _category was introduced to
associate data items possessing some common relationship [such relationships
were previously implicit in the hierarchical name elements of a CIF data
name]; and _list_link_child and _list_link_parent terms were introduced to
describe relationships between data appearing in different lists [thus, the
atoms partaking in bonds are labelled by _geom_bond_atom_site_label_1 and
*_2; these must take the values of atom site labels as listed in the
coordinates loop, and so they have a _list_link_parent of _atom_site_label].
Other DDL terms were also introduced to constrain or relate data that should
appear together in a list: _list_reference is the data item present in a list
which allows references to that list; _list_mandatory signals whether the
data item must be present in a list of items of the current _category; and
_list_uniqueness describes those data items which must (singly or in
combination) appear uniquely in a valid list.

These changes taken together permit a detailed description of list
structures that may appear in a CIF, and fulfil most of the requirements of
relating data access across tables in a relational database representation.

However, there was some discussion over the way in which _category should be
defined or understood. In the current definition, if the data item belongs in
a looped list then it may only be grouped with items from the same category.
Therefore, data items from different categories *may not* appear together in
the same list. This was felt by Nick Spadaccini to be unduly restrictive, as
there might reasonably be cause for mixing items of separate categories in
the same list. It was also opposed strongly by Brian Toby, whose powder
extension dictionary currently *requires* lists containing data items of
different _category.

There is also on occasion a requirement to represent data belonging to the
same category across more than one list. Currently, this is impossible to do
if there is a _list_reference, because all looped data in the category must
contain the data name which acts  as the _list_reference, and this cannot
appear in more than one loop. 

Nick Spadaccini suggested a more flexible approach to list building. Consider
the _atom_site_ and _atom_site_aniso_ loops, which currently have different
categories (atom_site and atom_site_aniso, respectively). The
_atom_site_fract_ items have a _list_reference of _atom_site_label, but
_atom_site_aniso_U items have a _list_reference of _atom_site_aniso_label.
This implies that the two sets of data items will normally be given in two
separate lists. However, the anisotropic U's *could* be placed in the same
list if the _list_reference value may be derived transitively through the
_list_link_parent pointer. That is, a parser finds an anisotropic U in a list
of atom site labels and fractional coordinates. The parser looks for the
_list_reference value for the U's, which should be _atom_site_aniso_label,
and fails to find it. However, _atom_site_label is the parent of
_atom_site_aniso_label, and *is* found in the list; so the list (containing
x, y, z and U) is valid.

Michael Scharf argued that the changes to the DDL still did not go far enough
towards supplying data in a fully machine-parsable form. For instance, dates
in CIF format are given as strings with verbal instructions for parsing the
strings ('in yy-mm-dd format'), where a date could sensibly be split into
atomic elements (year, month, day) and each such element could be given a
separate data name. There was some discussion over this issue, and it was
pointed out that it was not always reasonable to split an item into its
smallest possible constituent elements: a crystallographer seeking a
bibliographic reference is very unlikely to be interested in, say,
'_citation_date_day'. Shoshana Wodak emphasised that the primary requirement
was to guarantee that the maximum amount of data was present in the file.
Subsequent post-processing strategies could always be devised to extract the
information, if required.

However, it was considered desirable to make the file as fully machine
readable as possible, and one approach to identifying structure within a
string might be to constrain the form of the string to some regular
expression value. This might help in extracting data from the string; but
would certainly help in validating the contents of the string. For instance,
if the 'type' of a date were defined by a regular expression (such as
[0-9][0-9]-[0-9][0-9]-[0-9][0-9]) this would mean that a supposed date of
93-AB-X@ would be detected as invalid. The idea of such validation types was
further developed by Peter Murray-Rust and others (see below).

Another suggestion of Michael Scharf was the development of a formalism
allowing data items to be referenced *from within free-text fields*. This
would be a style of hypertext, and Scharf suggested that such links could be
established by use of the SGML dialect already employed in the World-Wide Web
hypertext system.

The suggestion of a hypertext link generated no particular enthusiasm
(although it was generally regarded as a good idea in principle); but there
appears to be a pressing need for standard means of referring to data outside
the current data block, and outside the current file. Brian Toby described
the powder diffractionist's desire to have globally unique identifiers for
data sets referenced in a file; ideally such global identifiers could
actually be used as *locators* of the relevant file. The macromolecular
dictionary developers were also keen to develop a method of providing
external references to files containing lists of allowed keywords or standard
dictionaries of chemical properties [these latter requirements might be met
to some extent by enumerations within the dictionary or by introducing the
concept of 'save frames' into CIF. However, it is preferred to have a
relatively volatile set of keywords maintained outside of the dictionary
itself, and 'save frames' might constitute a very mixed blessing, as
evidenced by the experience to date with MIF].

The 'external reference file pointer' discussed above is one aspect of what
can be called the 'name space' problem. In its original format, the CIF is
seen essentially as a self-contained and local data file. There is now a
requirement for locating CIF data in the crystallographic information
universe. How can one reliably refer to data in other data blocks in the
file? in other files? Often it is sensible to impose some structure on those
parts of a CIF that could be modified to hold location information. Thus the
data block name, which is a purely arbitrary string, is likely to be
constructed locally in such a way that it labels an individual data set
collected, or an independent refinement result. Should there be an external
specification of ways to do this? If data blocks were ordered so as to
archive a sequence of refinements (or experiments), would not it be useful to
introduce a global_ block (as permitted in the full STAR syntax, and used in
the Dictionaries)? This then would restrict the arbitrariness with which data
can be placed in a CIF; it would not be permissible to concatenate two CIFs
(since the scope of a global_ declaration is from the point of declaration
onwards). Many of these points were raised in general discussion, but without
any resolution.

Peter Murray-Rust described a possible extension to the DDL where a specific
field could contain an algorithm relating the data item currently defined to
other data items. One application of this could be a precise description of
the arithmetic relationship between, say, U and B factors (thus replacing the
rather vague "_related_function constant" entry in the CIF dictionary). The
idea is that the algorithm would be stated either in a machine-parsable
pseudo-code, or, at least for prototyping purposes, in an expressive language
such as the freely-available tcl. So there might be an entry for a data item,
say "_construction", which lists the generating algorithm in a free-text
field. This idea might also be extended to allow for validation of the
contents of a data field - e.g. _symmetry_Int_Tables_number might be related
algorithmically to _symmetry_space_group_H-M.

A DDL working group, including Murray-Rust, Scharf and others interested in
extending the scope of the DDL, discussed a strategy for seeking further
extensions to the DDL. Given that data dictionaries are themselves STAR
files, they should benefit from the CIF principle of *extensibility*. It is
therefore suggested that further modifications to the DDL be undertaken as
local extension dictionaries that will be blended into the official
dictionaries for the purpose of prototyping further developments. The
principles to be followed by the working group are:
 (-1)  All existing CIFs must remain valid.
 (0)   This is considered a local exercise in prototyping dictionaries.
 (1)   Changes will be introduced by merging in extensions from external files.
 (2)   Dictionary validation will be separated operationally from other
       evaluation of data (i.e. consistency between bonds and coordinates is
       not - at least at present - determinable from the dictionary
       relationships alone).
 (3)   More than one group will work *independently* on extending the DDL.
 (4)   A "_validation_type" will be developed. Possible values for this will
       include Boolean, 'multi-valued Boolean' ("yes", "no", "maybe"), enum,
       cif_name, date, regexp, and a "smart" type which contains an
       executable procedure.
 (5)   A file is to be considered a dictionary iff it begins with global_ and
       _dictionary_name.
 (6)   Algorithms will be stated in tcl.

Some other contributions to the meeting illustrated the extent to which the
macromolecular community is eagerly awaiting CIF. Dave Stampf explained how
the Protein Data Bank is preparing to change its operation from PDB file
format to CIF all at once, with no transition period. This will require staff
retraining, but is considered a better option than trying to introduce a
sliding transition. John Westbrook described how the Nucleic Acid Data Bank
has already for some time used CIF as its internal data exchange format. The
NDB has developed several tools for merging and editing CIFs, although these
depend largely on the syntactic characteristics of the file, and do not
validate fully against _list_link_ style relationships.

Phil Bourne described an approach to a high-order visualisation tool, to
which he proposed to give the name CIFbook. Apparently there has been some
interest by Macromolecular Structures in investigating this approach as a
companion to their current hard-copy data compendia.

Following Nick Spadaccini's description of the MIF project, involving data
exchange files with the full STAR syntax, there was some discussion over the
possibility of introducing multiple-level loops into CIF. By and large,
those present seemed convinced by Nick's argument that the mmCIF project
would not be so far advanced if it had to handle the complexities of the full
STAR syntax. However, it was felt that, as the community developed tools of
increasing sophistication and power, future extensions to include save frames
and nested loops could well be envisaged.


-----
Brian
Prev by Date: (8) restraints, dict files, *_[], comments
Next by Date: (10) STAR changes, DDL, dataname character sets
Index(es):
- Date
Discussion List Archives

(9) Review of the CIFtools workshop