(10) STAR changes, DDL, dataname character sets

To: [email protected]
Subject: (10) STAR changes, DDL, dataname character sets
From: [email protected] (Brian McMahon)
Date: Thu, 4 Nov 93 15:57:02 GMT
Dear Colleagues

First, a few follow-on points from earlier topics.

D4.1 Restraints
---------------
B> My purpose for use of _local_ would be that if one uses such entries
B> one can expect that they will be ignored or (even stripped) when the
B> file is "exported." Otherwise it will not be clear if one sees an
B> entry beginning _tnt_ if that entry contains comments from 
B> T. N. Thompson for his/her own use, in advance or in ignorance of the
B> "allocation" of _tnt_ to Dale. I see strong reasons for allocating
B> prefixes to commonly distributed programs. This is a bridge until, with
B> better vision, we can see how information in these items is better
B> included in the dictionary. There is a need for a register so that one
B> can know that a prefix is in use to avoid misappropriating it (I say
B> this after devoting a day to renaming a computer program and all menu
B> and documentation references). However, I see no reason for allocating
B> a laboratory or group a "private" prefix. If the information will have
B> a meaning and value to others, it should be a candidate for inclusion
B> in the dictionary. If not it will be of only _local_ value.

D4.2. Introductory sections
---------------------------
B> Couple quick points: How about vertical lines or even
B> bullets at the margins instead of horizontal lines for _.intro
B> material? (If I use it enough perhaps it will custom). It should be
B> fine for text that spans columns and I think that TeX will accommodate
B> it fairly easily. No objection to leather or parchment either.

D8.1 Comments
-------------
B> I agree with Paula's comment that anything important that belongs in
B> the file should not be in comments. (Though I don't think this is
B> applicable for files that are prepared as templates -- for human
B> reading). It is better if software retains comments, but this is not a
B> requirement in the standard and thus a file should not lose validity
B> through loss of comments. One of the additions that I plan to add to
B> Powder CIF when the chance comes, is to add several _comments entries
B> for information that just does not fit anywhere else.

S> From the outset I have been against "machine parsing" # comments. A comment
S> is a "visual cue" and has little to do with the data. If information about
S> the data needs to be retained and accessed with query tools, it must be a 
S> data item. I am VERY concerned that ad hoc comments replacing real data.

=============================================================================
Now I wish to introduce a few new topics for discussion. Some of these arise
directly from the Tarrytown Workshop meeting, and I hope the previous
circular gave you some of the background to why they are relevant. Some may
be easily resolvable, but may still merit a public airing on the record.
Others touch on fundamental changes to the CIF standard - arguably there
shouldn't be any (for a standard!), but the formalising of the DDL and some
other matters are of this nature. It is best to work through the implications
of these changes before the next major round of publications, which will set
in granite such changes as have been made, good or bad.

D10.1. Modifications of STAR
----------------------------
Since the publication of the original CIF paper, Syd (and Nick Spadaccini)
have made some modifications to the basic STAR syntax on which CIF is based.
Mostly (such as changes to the termination of nested loops and save
frames) these have no bearing on the subset of syntactic rules to which CIF
conforms. However, one change has a direct effect, and one a potential impact
on CIF. I'll discuss these below as two separate items, to facilitate
reference to the points under discussion.

Meanwhile, Brian Toby has made the following suggestion:
B> Over time I can forsee greater demand for to relax the CIF restrictions on
B> the STAR syntax (nested loops for example). Rather than fight this on every
B> front, I suggest we plan and announce a date (at least a few years off) when 
B> different restrictions will be dropped. This will give software designers a
B> chance to prepare for the future, and will end the debate for now.


D10.2. Privileged constructs in STAR
------------------------------------
In the version of the STAR paper now in press, special meanings are assigned
to the characters '?' and '.' if they appear as data values. The '?' means
that the value of a particular data item IS NOT KNOWN. The '.' means that the
value of the data item IS INAPPLICABLE. A couple of quick examples (from MIF)
should illustrate this:

loop_ _atom_node _atom_identity   1  C  2  C  3  ?  4  C  5  ?  6  N
  (we don't know the identities of atoms 3 and 5)

loop_ _atom_site_label _attached_atom_label  C22 H22 C23 C11 O24 . C25 H25
  (nothing is attached to O24)

If this is a valid STAR extension, the same meanings should carry over to the
CIF case, and this must be stated in publication. 

However, I have a quibble. The STAR definition describes the syntax of an
archive file structure. The special meanings attributed to '?' and '.' are
*semantic* (that is, STAR permits any character string at this location:
syntactically ? and abcdef are equivalent). Should not the special meaning
be taken out of the STAR specification and included in the MIF definition
(and, if appropriate, in CIF)? I'm aware that the question runs deeper than
it may seem, for this change arose from development of the star_base tool for
handling arbitrary STAR files, and the special semantics of these privileged
characters is important in determining what can be extracted from the data,
and how it may be presented.

So: Syd (and Nick) - *must* these privileged values be enshrined in STAR?
Or could they be extensions to the MIF and CIF instantiations of STAR? And,
if the latter, should they in fact be extended to CIF?

D10.3 Global data assignments
-----------------------------
The other significant change to STAR was the introduction of a global_ block,
which allows the assignment of global values to data items. These will be
inherited in subsequent data blocks to the end of the file. global_
statements may occur anywhere in the file that a datablock may be declared,
and have, as it were, a cumulative effect. This could have obvious use in a
CIF which reported several refinement results of the same structure, or
several structures handled in the same laboratory environment, or such like.
global_ has already been introduced into the current draft dictionaries to
specify default values.

There are objections that can be made. First, it makes the processing of the
CIF data more complex: a parser needs to store all global values and be able
to supply them if the relevant data items are requested from data blocks
later in the file. Second, it is not in general legitimate to concatenate
CIFs that may contain global_'s, as defaults are carried over into totally
unrelated data sets. Heretofore, CIF data has been considered local to its
containing data block (though Chester has agreed some local conventions with
the Cambridge Data Centre in data block naming to group related data blocks,
and Syd, Paul Edgington and I have looked at other aspects of this problem
before). One may argue that CIF's are fundamental objects, and should not be
concatenated - but Cambridge prefers that we send them a large multi-block
file rather than many smaller files; while we have previously discussed the
merging of data dictionaries in terms of a simple concatenation.

One possible escape is to introduce (to STAR - for "global_", as a keyword
token, is certainly bound to STAR at the syntactic level) another special
keyword ("unglobal_" or "global_null_" ?) that frees all previously bound
data values. This unglobal_ block could then be slipped in as the filling in
the sandwich when CIFs are concatenated.

Syd has already had a brief foretaste of this point, and gave a quick
response:

S> After discussing this with Nick and reading the relevant communiques, I think
S> that _global should not be present in CIF's except for the CIF dictionaries.
S> It is a concept that is not inherent in the CIF approach (whereas it is a 
S> primary concept for the development of MIF). It will certainly complicate
S> merging CIF's (and it MAY complicate merging CIF dictionaries!).

Trouble with global_'s in dictionaries only is that you still need to build a
parser to handle the concept; and if you have that, why not use it on the
CIFs? [Recall that the mmCIF applications gurus are looking at reading in a
DDL dictionary, parsing it and validating it *against itself*, then loading a
data dictionary and validating against the DDL, then loading the CIF and
validating against the data dictionary.] 

However, I suspect that the mmCIF folk would accept this solution, so long as
it is explicitly declared that global_'s occur only in dictionaries, never in
CIFs. Just seems inelegant. And it still leaves the problem of how to combine
different dictionaries, each with global_ blocks.

Another possibility, perhaps, would be to remove this from the syntactic
level, and introduce (in CIF or MIF) the semantic convention that
"data_global_n" is a data block whose contents act as would those in a global_
block (the _n is an arbitrary string to ensure uniqueness of multiple global
data block names). I realise at once that this would rob star_base of its
ability to handle global data. Comments?


D10.4. The DDL
--------------
Here I want to discuss the Dictionary Definition Language in a fairly general
way. The historical background to this is that the original CIF Dictionary
was submitted for publication as an English-language document. While in
press, it was modified to use the DDL concept that had been devised by Syd
and Tony Cook, who is closely involved in MIF. This had immediate value, even
in typesetting the dictionary (ciftex handled it as though it were a CIF),
and in allowing software to validate data against the dictionary definitions
(Paul Edgington supplied us with a nice Fortran program to do this, and
star_base takes it in its stride). At this stage the DDL terms used in the
CIF core dictionary were listed (in comments!) at the end of the dictionary.
The York meeting sparked off an extension to the existing DDL types to
describe data interrelationships, and Syd formalised this by defining the DDL
in a DDL dictionary (written, of course, in DDL). To bring matters up to
date, the recent CIFtools meeting is fairly happy with the status quo, but is
now seeking to explore DDL extensions that may be implemented through
subsidiary DDL dictionaries (in the way that CIF data may be defined through
extension dictionaries).

Syd's recent comment to me on this last point was:

S> In reality I can't see us stopping local DDL "extensions" if applications
S> people want to do it. They will not be recognised by "std" STAR software
S> such as Star_Base etc.

This seems fine - it's OK for extensions to be ignored by software not in
the know, just as it's OK to skip data names not found in the dictionary.

Now, my question is: should the "core" DDL dictionary be the version that Syd
is working on for his defining paper with Tony Cook, that covers both CIF and
MIF requirements, or should there be separate DDL dictionaries for MIF and
for CIF? And, if the latter, should COMCIFS preside over the CIF core DDL?

[I'm not suggesting we try to steal any of the credit for the DDL that is
properly Tony's and Syd's: it is essential that they publish their work
describing the theoretical basis of DDL, and certainly appropriate that they
should be free to present a working language in their paper. But even in the
current draft, there are entries that are inappropriate to CIF (in CIF,
_list_level is meaningless (or invalid) if greater than 1), while _type might
usefully have a different enumeration list in other disciplines, and likewise
the conventions in _esd may not be applied universally across all fields.]

D10.5 Change in the definition of _category
-------------------------------------------
Whether this topic continues to be debated in this forum depends somewhat on
whether the Committee does undertake a closer involvement with DDL; but I
think these comments are germane to an issue which is very important to the
mmCIF'ers in particular. [By the way, if anyone doesn't have a copy of the DDL
dictionary of 5 August and would like one, please let me know.] I think the
description of this point given in my previous mailing should have set the
scene adequately.

>From Brian T., regarding the powder extensions:
B> I have a big problem with the latest DDL -- the requirement that items
B> in different categories be forced to be in different loops creates
B> real havoc for the pd and the core dictionaries. This means, for
B> example, that anisotropic  temperature factors CANNOT be in the same
B> loop with atom coordinates. I would doubt that many of the CIFs at the
B> IUCr comply with this restriction. This comes up in several places in
B> the  PD dictionary. For example, I would prefer to see the
B> experimental and processed data included in a single loop, unless the
B> number of data points is changed. I understand that I am at odds with
B> how the relational database folks feel but it is a trivial job for
B> them to break up a loop and introduce pointers between them. It is
B> very important for crystallographers to see the connection maintained
B> between items that are often, but not always, linked on a one-to-one
B> basis. I really cannot live with this restriction.

And from Syd:
S> Another approach to _category which Nick now agrees with and would keep many
S> (and especially BT) happy, is that the _category string MUST be identical for
S> all items in a list (the original concept) BUT there may be more than one
S> list with the same _category value (new concept) PROVIDED all of the other 
S> dependency requirements are satisfied. For example the _atom_site_ data and 
S> the _atom_site_aniso_ data would be in _category "atom_site". This would
S> allow this data to be combined into one list with _atom_site_label as the
S> reference, or entered as two lists with _atom_site_label and ..._aniso_label
S> as the respective reference values. I think that this is a much simpler and
S> practical approach; it retains the original intention of _category but allows
S> for those special cases where you may need to split a list into separate
S> parts. I believe the related _list_ dependency values can be assigned to
S> allow this to happen. 

D10.6 Restrictions on character sets in data names?
---------------------------------------------------
Here are some edited comments from a correspondence with Peter Murray-Rust.
The cry is heartfelt! Note that there are various prejudices in his 
discussion - Unix filesystems, tcl syntax. Different operating systems and
languages will all have their own special characters - what to do? Stick with
alphanumeric (and _) datanames? But other symbols can be so useful...

> ...if a CIF name could contain any character it was going to be
> impossible to parse it (out at least far too difficult for me and the users). 
> Thus we already have a name (ugh) of the form:
> 	_something_sint/lambda
> This will cause all sorts of parsers to break since they will assume the / is
> a divide or other delimiter.  OK, we quote it:
> 	set "_something_sint/lambda" 0.3
> That's OK until someone calls a variable something like:
> 	_my_"funny"_variable
> when the parser breaks again.  I can envisage someone devising a variable for
> sterling/dollar exchange rate as:
> 	_$/#
> (We have to use the # for pounds).  That will break almost anything! I'm not
> sure whether it's legal CIF as the # would also be a comment.
> 	In the long run, it's CERTAIN that someone will do something like
> this!  Just when the parser writers have disappeared.  IMO the only way to
> proceed is to limit the characters allowed in CIF NAMES to [a-z0-9_].  
> 	We have an existing problem.  there are a very few names in CIF with /
> and %.  (If we allow % as the modulus operation this will break a parser).  I
> would suggest we proceed as follows:
> 	Disallow any future horrors.  Before it's too late perhaps we can get
> the / and % out of MMCIF?  There are only two.  And (perish the thought) we
> can then stick the sint/lambda and the something_% as speciall cases and BEG
> the IUCr not to allow any others.  
> 	I could work with the following compromise:  All nasties are converted
> to a not-quite-so-nasty, e.g. @ for UNIX systems where it doesn't have many
> meanings).  Then the parser continues.  At the end it looks up in a table what
> horror to stick back in the name.
> 	The sint/lambda is particular unfortunate since it is so nice to be
> able to auto-parse the name into subdirectories.  Everything works excpt this.
> I know the CIF dic will never change, but couldn't we suggest it never changed
> to something like _stol (which is used elsewhere)?

> 	...generally we're going to *have* to work communally on the namespace.
> There are going to have to be allowed and disallowed characters in it.
> 	I have managed (with a little difficuly) to get [] parsed in tcl
> (since [] represent the "command" option in that language!).
> 	I *really* hate the use of / in names!  It's so convenient to be able
> to run the whole dic through a parser and create files for each block - and
> it's only sint/lambda that crashes it!  I *think* I can cope, but it's a pity.


---------------
Hope this isn't too indigestible at one sitting!

Regards
Brian
Prev by Date: (9) Review of the CIFtools workshop
Next by Date: (11) Restraints; naming data blocks and external files
Index(es):
- Date
Discussion List Archives

(10) STAR changes, DDL, dataname character sets