Restraints; *_intro; etc.

To: [email protected]
Subject: Restraints; *_intro; etc.
From: [email protected] (Brian McMahon)
Date: Thu, 30 Sep 93 11:43:31 BST
Dear Colleagues

Thanks for your various comments on the matters previously raised. I'll
summarise the discussion so far (in order to keep our communications
rolling); the correspondence is not closed on these issues if you have
any further points to make! In the summaries below, remarks from David
are flagged D>, from Syd as S> and from Paula as P>. Unattributed comments
are "from a source close to the Coordinating Secretary" (!).

(1) Treatment of restraints and constraints
-------------------------------------------
So far, two responses to George's remarks. Obviously, part of the difficulty
lies in reconciling specific techniques within individual programs with the
broader physical concepts that are admitted by the community as "universal".
It is possible within CIF to introduce local data names which, for example,
store the SHELXL-specific restraint conditions (though I seem to remember
George telling me that this was very difficult with the single loop level
permitted in CIF); but this is clearly undesirable if it thereby hides valuable
information away from other applications.

S> George, as usual, has raised a very thorny issue. It is this because, accord-
S> ing to my reading of the computational situation, there is not yet agreement
S> among the experts on how these processes should be handled or described.
S> Howard, in particular, should comment on this. 
S> 
S> This does not mean, however, that an attempt should not be made to try and 
S> reach a consensus on defining restraint and constraint data items. But all
S> of us must be concerned that, with a technique in its formative years, 
S> premature and inadequate definitions can (in a very short time) become the
S> bane of our and many people's lives! Paula and I have already had several 
S> long sessions on such matters as there are a number of such issues in mm 
S> work -- and there is great danger of one adopting data items which are 
S> specific to a piece of currently popular software rather than trying to
S> identify global and independent quantities which will be understood in
S> years to come.
S> 
S> One cannot emphasise too strongly that, because CIF's are used for archival 
S> purposes, definitions must be able to withstand the test of time, and have
S> a quite independent physical (or chemical or whatever) foundation. If this is
S> possible for constraints and restraints then an attempt should be made to
S> quantify them in a general way. Otherwise it would be better to stick to 
S> descriptive text for the time being. An ad hoc solution would certainly be
S> a disaster in the long run as it will force future developments in this field
S> to adhere to inadequate definitions.

D>    I do not have much experience in this field, but the queries illustrate
D> the difficulty that numeric electronic databases have with emerging
D> fields.  The problem appears to be: Do we write our dictionary to ensure
D> that any program can read these restraints and immediately repeat the
D> calculation, or do we write the dictionary so that a crystallographer can
D> figure out what was done and set the appropriate knobs by hand if s/he
D> wants to repeat the calculation.  My answer is firmly that at this stage
D> we should do the latter.  When the application of restraints has become
D> sufficiently routine, we can introduce fields that could be used by
D> another refinement program.  While I would not agree with George that this
D> is light years away (George may feel that he is being pushed around at the
D> speed of light), caution dictates that we do not start defining a range of
D> fields that will drop out of use within a few years (or light years?).

(2) Introductory sections within the dictionary
-----------------------------------------------
D>    If the problem is just one of the length of the field name, then we
D> could use _dict which is shorter than either _appendix or _intro.  It
D> specifies exactly that this is something to do with the dictionary but to
D> most users the name will be sufficiently opaque that they will not be
D> tempted to use this field for something else.

P> I am still settled on the idea of using *_mm_intro (and *_intro or whatever
P> elsewhere). I think you are correct that David's suggestions about potential
P> user misuse are dealt with in the dictionary by the _type declaration of 
P> null. We might think of adding a disclaimer to each *_intro definition,
P> though, explicitly stating that this is an informational item for the
P> dictionary, and that it is to to appear in data CIFs.  But users will not be
P> tempted into these mistakes because files using *_intro as data items will
P> not pass validation.

(Depends on the stringency of the validation process. CYCLOPS, for example,
would say that _blah_intro is a data name that appears in the dictionary,
and so is OK. Testing on _type is necessary here.)

S> You will remember that none of were really comfortable about _appendix
S> when it went into the original core definitions but this seemed like 
S> the only way to insert a summary about the chemical formulae. We may live
S> to regret this...on the other hand Paula and her gang may have refined the
S> idea into something more useful. Paula suggested to me two refinements
S> that should be considered: earmarking these summaries as _intro, _mm_intro,
S> _pd_intro according to the dictionary extension; or separating the summaries
S> into a quite separate file which would contain ONLY this information. I
S> believe both ideas are worthy of careful consideration. By the way I do
S> not agree that _intro vs _dict_def naming is a major issue -- such matters
S> can be clearly spelt out in the definition, and WE MUST ALWAYS REMEMBER 
S> that virtually all CIF's will be generated by software and programmers will
S> know about this!

OK, so there's not much dispute about this, though still quibbles over the best
terminology. [Who has the casting vote on this? The Chairman, whose view
differs from everyone else? :-) ]

Let me float a different idea here. One reason for me why '_appendix' is more
convenient than '_intro' is that it comes earlier in the alphabet, making it
easier to do a lexicographic sort [production dictionaries somehow are NEVER
quite in alphabetic order]. I was looking at ways of retaining this capability,
when it occurred to me that sorting is done on the data BLOCK name. So how
about this (the actual example need not be taken seriously):

data_atom_sites_[]                 # empty brackets for the Core dictionary
    _name      '_atom_sites_'      # note the trailing underscore
    _category  dictionary_definition
    _type      null
    _definition
; Data items in the _atom_sites_ category record details about the
  crystallographic cell and cell transformations common to all sites.
;

data_atom_sites_[mm]               # extension dictionary in brackets
    _name      '_atom_sites_'      # note the trailing underscore
    _category  dictionary_definition
    _type      null
    _definition
; In macromolecular work, a fractional transformation matrix is often
  employed in practice, and is added to the list of Core definitions.
;

Note that this convention ensures unique data block names if the dictionaries
are merged, can sort lexicographically before the individual data names,
presents identical _name's which indicate the category naming mechanism but
retain the trailing underscore (I can see that this may give rise to some
problems - comments on this?), has category "dictionary definition" and _type
null, and allows concatenated definitions. And we don't have to fight over
intro/dict/appendix...

Here are Syd's thoughts on the possibility of having separate files containing
the introductory material. I believe that (b)Con_1 [the need to maintain yet
another separate file] may be the determining factor on this...

S> (a) _intro's imbedded in each dictionary
S>     Pro's: Useful summaries if one is scanning the dictionaries either
S>            manually or with a software browser.
S>     Con's: Because I think that there will always be multiple dictionaries
S>            updating cross-referenced material will always be a problem.
S>            Cross referenced summaries must have a tag "mm", "pd", etc.
S>            Summaries clutter up a definition dictionary.
S> 
S> (b) _intro's in their own summary dictionary
S>     Pro's: One file for all summaries avoids the need to cross-reference.
S>            Updating summaries MAY also be simpler for the same reason.
S>            Summaries will not get confused with actual definitions and do not
S>            require special avoidance actions by validation software.
S>     Con's: One more file to look after, and another file for the "browser"
S>            software to read.
S>            Perhaps the summaries will get overlooked when definitions are
S>            constructed.
S> 
S> OK, I just wanted to focus on what I see as the central issue here. With
S> use of "_type null" (not needed at all in option (b)) and with clear 
S> reminders in the _definition, I don't see the extension _intro etc. as 
S> being critical for most applications. Perhaps I am wrong. What do other
S> programmers think?

(3) Relationship of Core to extension dictionaries: one/many files?
-------------------------------------------------------------------
No major difference of opinion here:

P> I could be persuaded either way on this subject.  Working with a single dict-
P> ionary would make many things simpler, but it lays a trap for things becoming
P> unwieldy in the future.  Who knows how many extension dictionaries there will
P> end up being.  I guess if forced into a straight up-or-down vote, I will go
P> for the present core/extension model

S> I believe that they way we are currently proceeding is correct. Extension
S> dictionaries will always be needed as CIF's are applied to more specialised
S> fields. Occasionally data items defined in these dictionaries will be
S> identified as being better suited for general use and therefore should 
S> reside in the core. This avoids proliferation of very similar definitions.
S> 
S> The mm and pd groups unearthed a number of extra general data items. This
S> was pretty much expected with the initial extensions -- the core group was
S> well aware that the initial definitions had gaps. But these gaps will get
S> fewer with future extensions and this won't be such an issue then.
S> 
S> Concomitantly I believe that future software using the dictionaries, for
S> validation or whatever, must be able to read and store a succession of 
S> nominated dictionaries (eg. cifdic.c93 + cifdic.m93 + etc.). The CIFtbx
S> software already has this facility, but cyclops as yet does not (easily
S> achieved though by cat'g them together as a super dictionary).

D>    I am inclined to the view that we should keep specialty applications
D> out of the core dictionary at the moment, but with the understanding that
D> any combinations of dictionaries could be combined.  This leaves open the
D> option of eventually producing a single dictionary.  Splitting a
D> single dictionary into specialist dictionaries at a later stage would be
D> more traumatic.  At the rate that dictionaries are growing, a full
D> dictionary is likely to become daunting for anyone trying to program
D> reading or writing a cif, most will be only too happy to have the subset
D> of the dictionary that suits their need.  In any case we will be
D> processing the dictionary in sections, so it is an easier matter to deal
D> with specialized sections.

(4) _enumeration values and abbreviations
-----------------------------------------
Syd felt got at by my blaming him for establishing a bad precedent in allowing
'c' for 'calc', but claimed that he could not now be held responsible for
decisions he took as long ago as November 1991 - the statute of limitations in
Australia covers a very brief timescale!!

Everyone who replied so far is happy to permit 'y' for 'yes' and 'n' for 'no',
but with the caveat that this shouldn't become general practice. 

D> To B or not to B
D> ----------------
D>    Whether 'tis nobler to respect our language by requiring a full 'yes'
D> or 'no' should not detain us for long.  Brian M has given us an excellent
D> rationale for allowing truncation of key words to the minimum that gives an
D> unambiguous answer.  Y and N should present little problem, but there is a
D> possible conflict when many different key words are allowed since we may
D> want to add to the list of allowed keywords at some future date. Using 'c'
D> for 'calc' might be a problem if later we introduced a keyword 'coupled'. 
D> In the spirit of Brian's analysis, I would not prohibit this, but I would
D> discourage it for any case where the list of keywords could be extended.

P> The whole point of enumerating items is so identical forms are used.  If you
P> are going to allow y instead of yes, then y must be present in the enumer-
P> ation list.  The formalism that you suggest for listing c as 'synonym for
P> calc' sounds just right.

(5) Good practice in handling 'invalid' CIF data
------------------------------------------------
As a separate point, I append Paula's response to my remarks on how to handle
invalid data items. Probably everyone will be in agreement with the general
sentiments here. However, 'validation' is to some degree an application-
dependent concept, and there are going to be clashes between the way we
try to implement a 'universal archive file' and the need to validate for
specific programs or databases.

There is no specific bone of contention here, and I don't think it's worth
developing this discussion thread at the moment; but these early comments
may be worth remembering at a later stage.

P> On the general point of what a CIF-processor should do with invalid values
P> - I agree that aborting and core dumping is extreme, but neither should
P> the problem be ignored.  There is no point in putting enumeration values
P> (numerical or text) in the dictionary if we don't mean to use them.  A file
P> should not pass validation if the file does not conform to the dictionary.
P> Of course, some conformation problems are because there are problems in the
P> dictionary, but the correct thing to do is to fix the problems, not to allow
P> the trangressions.

(6) Matters arising
-------------------
On topic (1): Howard, we would like to hear from you!
On topic (2): responses to my suggestion for handling the introductory sections.
              If this becomes too technical, we can move it off the general
              discussion list; but I'd be interested in all your reactions.
On topic (4): 'y' and 'n' to be ADDED to the enumeration lists where appropriate
              (and also 'c' for 'calc') [Syd: will you do this as the current
              Core dictionary maintainer-in-chief?]

Regards
Brian
Prev by Date: Macromolecular restraints/constraints; 'y' or 'yes'
Next by Date: (5) Procedural refinement
Index(es):
- Date
Discussion List Archives

Restraints; *_intro; etc.