Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Approval of the CIF methods Dictionary Definition Language

  • To: coreCIFchem@iucr.org, Distribution list of the IUCr COMCIFS Core Dictionary Maintenance Group <coredmg@iucr.org>
  • Subject: Approval of the CIF methods Dictionary Definition Language
  • From: David Brown <idbrown@mcmaster.ca>
  • Date: Wed, 05 Nov 2008 14:54:23 -0500
Title:
I am sending this email to a couple of COMCIFS discussion lists and I apologize if you get it twice.  Its purpose is to update you on developments in two areas  and to give notice that I will be looking for advice on matters mentioned below:

(1) the COMCIFS approval of the alpha version of the methods Dictionary Definition Language (DDLm)

(2) the request by COMCIFS to revive the discussion on providing a CIF description of the molecular, as opposed to the crystallographic, structure of a crystal, a topic that was discussed earlier but put on hold in 2005 pending the adoption of DDLm. 

In the first part of this email I describe the problems of defining a molecular structure in
CIF . In the second part I analyze some of the issues that DDLm raises for the structure of CIF dictionaries.  This email is for information only, you need take no action,

David Brown

++++++++++++++++++++++++++++++++++++++++++++++++++++++

A Chemical (Molecular) Description of a Crystal Structure
------------------------------------------------------------------------
A simple innocent question sometimes raises profound issues.  A number of years ago COMCIFS was requested by CCDC to provide a dataname and definition for Z', the number of crystallographically independent molecules in a crystal.  Before we could provide this we had to have a definition of a molecule.  Diffraction experiments tell us about electron density, but not about molecules.  We may see the shape of a molecule in the electron density but not everyone sees the same shape.  Although we can identify atoms in the electron density and we can measure the distances between them, we don't all agree on which are bonds, and even less do we agree on which atoms constitute a molecule.  As a result, a number of years ago COMCIFS established a special group (CoreCIFchem) to recommend how we should define a molecule in the CIF dictionary and how best to link the atoms in the molecule with the corresponding atoms in the asymmetric unit.

We quickly discovered that the problem was far from trivial, and in 2004 we turned for help to Peter Murray-Rust who has spent much of his career thinking about this problem.  I quote here his response to our trial set of CIF definitions.

I think you are addressing important points. There are several of them and they are complex. They have not all been fully or even partially solved elsewhere. The following points are taken from my 15 years' writing software for CIF and related systems and I hope they will be taken positively.

1. IMO a system has to be implementable. I adhere to the IETF motto "rough consensus and running code". I tried hard in the mid-90's and later at Syd's invite to implement the DDL system. It is far more difficult to implement than it appears on reading. In any such system there are lots of nuances which can be surprisingly difficult. I have now formally implemented a complete CIF DOM using the SAX/DOM/Infoset approach, but without dictionary control. Similarly I spent some time looking at mmCIF and again found that the task was large. mmCIF has been implemented, but has a considerable amount of dedicated resource. So do you have similar resource?

So if you wish software to *read and understand* this design it will need a lot of effort. Without prototyping the system you don't know whether the design is complete or self-consistent. Note the questions I have asked recently on COMCIFs - there are still several areas that I have no definitive answers to and I am sure this is likely to happen here. So I'm simply asking you to be aware of the size of the problem. IMO this was the problem with MIF [the Molecular Information File described in International Tables G] - it was a reasonable spec but no-one really implemented it.

Also who or what will generate [the files giving the descriptions of molecules and their relationship to the crystal structure]? I think the analysis [of the crystal structure] will require quite a bit of heuristic software, so presumably an author has to edit [the molecular description]. At  the least they will have to have an editing tool to keep the referential integrity for the pointers [linking the molecular atoms to the crystal atoms].

2. There are several concepts here that we are tackling with in CML [chemical mark-up language] and have not fully solved. I distinguish "design" where there is a formal spec and "implement" where it is actually shown to work or not. They include:
  - multiple conformations (designed but not implemented in software)
  - unique atom ids (designed and implemented, but possibly fragile)
  - levels of indirection (pointers) (designed but not widely implemented)
  - role of atomSets [collections of atoms forming all or a distinctive part
         of a molecule].(designed and partially implemented)
  - polymeric systems (not designed; far too difficult at present)

 It is possible to come up with reasonable solutions but it is often unclear how well these will stand up to the variety of examples that will be exposed to it when the system is released. It almost certainly will require a redesign. That has happened for CML and I predict it will occur here

As an example I am on a IUPAC group tackling stereochemical representation. It is a tough problem.  If/when some consensus is reached all of that would ideally have to be implemented in CIF solely to help describe what the substance actually is.

3. My suggestion would be to be somewhat less ambitious and to engage the community in a structured program
   - concentrate on molecular crystals (they are easier and the
          informatics/representation is better understood)
   - get the authors actually to submit the chemical structures (if this
          doesn't happen then the design won't be tested)
   - anticipate the problems of mapping crystallographic atoms
          onto chemical connection tables. These are: primarily
           - symmetry
           - disorder
           - unreported atoms

Around the time when we received this advice from Peter we decided to defer any further discussion pending the adoption of DDLm as we thought that dictionaries written in DDLm might simplify our task.  That was in January 2005.  This project has since been dormant, but as an alpha version of DDLm has now been approved by COMCIFS the time has come to dust off this project.  In the light of Peter's comments we might wish to consider whether the definition of a molecule is even an idea worth pursuing.

Development of DDLm (Dictionary Definition Language (methods))
-----------------------------------------------------------------------------------
DDLm is a language for writing CIF dictionaries.  It introduces a number of new features relevant to this discussion.  The first and most important is that DDLm is backwardly compatible with all existing CIFs.  Programs using CIF dictionaries written in DDLm will be able to read all existing CIFs (small cell and macromolecular).  Further, they will be able to interpret them using the advanced DDLm features.  These include 'methods', i.e., executable algorithms embedded in the CIF dictionary that define how the value of an item can be calculated from other items that may be present in a CIF.  DDLm also makes it easy to generate a virtual run-time dictionary by combining all or parts of a number of existing dictionaries.

DDLm was prepared by Syd Hall and Nick Spadaccini, who demonstrated a proof-of-concept version of a core dictionary at the Glasgow Congress in 1999.   At the Florence Congress in 2005, COMCIFS agreed to evaluate DDLm and see if it would be suitable for the crystallographic community.  James Hester has carried out this evaluation and at the Osaka Congress in 2008 he recommended that with some minor modification it should be adopted.  COMCIFS duly approved this as the alpha release of DDLm.  Details are available on the IUCr web site.

As part of Nick and Syd's proof-of-concept demonstration in 1999, they prepared a small CIF dictionary which  is serving as the basis of the DDLm coreCIF dictionary that will be submitted to COMCIFS for adoption.  In the course of this development a number of philosophical questions have arisen.  These were discussed by COMCIFS in Osaka and the following principles were agreed on:

1. Items will be classified as basic or derived.  Basic items are those that are experimentally measured (e.g., measured density) or assigned (e.g., space group).  These are items that cannot be derived from other CIF items.  Derived items are those that can be calculated if the appropriate basic items are present.

2. For derived items, the method (i.e., the equation used to calculate the value of the item) given in the dictionary will be the primary definition and will take precedence over the text definition.  Therefore only one method may be specified and this takes precedence over any alternative, but equivalent, ways of calculating the value of the item.  This requires that there be a consensus on the best method of calculation  to use, for example:  which coordinate system should the used in the calculation?

3. A consequence of (2) is that the derived items must follow an hierarchy, with the basic items forming the foundation and the derived items defined from various basic items following a unique route.  Requesting an item may initiate a cascade of calculations, e.g., calculated density <--- cell volume <---- cell constants.  Thus the request for a particular item may populate the CIF with various intermediate items .  These intermediate items may, for computational convenience, duplicate information already appearing in the CIF.  In previous CIF dictionaries this has been discouraged by providing only one way of presenting a given piece of information.  What are the implications of this development?

The current versions of CIF are like a language that contains only nouns and adjectives.  It is great for describing a static object like a crystal structure.  It gives a description that can be published, archived, retrieved and examined.  Adding Methods is like adding verbs to this language.  With them, a CIF can grow, triggered into adding derived items by a simple request.  The relation between user and CIF becomes interactive.  In this scenario how does one ensure that the derived items are all updated when a basic item is changed?  Changing the lattice parameters does not automatically update the cell volume, and calculating the density will not necessarily force an update of the cell volume if the CIF already contains an earlier (obsolete) value.  Murphy's Law (in its correct original version) states that 'if something can happen, sooner or later it will happen'.  We need to anticipate the different ways in which a DDLm CIF might be used.

This email is intended to alert you to some of problems that lie ahead.  I intend to use these discussion lists as a sounding board for specific problems as I work on the development of the DDLm coreCIF dictionary.  I look forward to receiving your comments on the points I raise, particularly warnings if we seem to be straying into dangerous directions.

Watch this space.

David Brown
begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
coreCIFchem mailing list
coreCIFchem@iucr.org
http://scripts.iucr.org/mailman/listinfo/corecifchem

[Send comment to list secretary]
[Reply to list (subscribers only)]