(12) Schedule; STAR extensions and naming conventions, etc.

To: [email protected]
Subject: (12) Schedule; STAR extensions and naming conventions, etc.
From: [email protected] (Brian McMahon)
Date: Fri, 12 Nov 93 15:25:41 GMT
Dear Colleagues

Agreements
----------
According to the procedural rules suggested by the Chairman in circular 5,
we can now formally accept the following agreements:

(5)A4.3  CIF dictionaries should be maintained as separate files, but in a
         manner compatible with merging for appropriate applications.

(5)A4.4  Enumeration lists may contain synonym terms, appropriately labelled.

(7)A5.1  Internal cross-references in this series should take the form (r)Xm.n

Obviously, this last convention is already in use!

Call for Agreement
------------------
David has suggested that some more discussion threads have come to their
natural end, and agreement is sought on the following matters:

A4.1    Major software suppliers should be allocated reserved data name
        prefixes (the set of such prefixes to be managed by COMCIFS). Users
        may additionally employ _local_ as a prefix in their own data names.
        (There is more discussion of this below).

A4.2    Introductory sections of the Dictionary should follow the same file
        syntax as data name definition sections, with the following
        conventions: the data block name takes the form data_xxxx_[], where
        the square brackets may contain an identifier of the dictionary, if
        it is not the Core; _name is likewise '_xxxx_[]'; _type is "null";
        and _category is "dictionary_definition". The _definition is a free
        text field describing the general characteristics of category xxxx.

A8.1    While comments may be of use within the CIF, applications software is
        not required to retain the comments.

If no counter views are expressed within a month, these proposals will be
considered agreed.

========== New topic:

D12.1  Schedule
---------------
D>    We are now under pressure to approve the various dictionaries that are
D> at or nearing completion and we have been asked for an estimate of the
D> time needed to approve the powder cifdic.
D> 
D>    You will recall that I proposed various stages in the approval process
D> to ensure that we get it right first time, knowing that mistakes cannot
D> afterwards be corrected.  These stages were:
D>  
D>       1. On receipt of a draft dictionary, the chair and secretary will
D> check it in conjunction with the chair of the drafting committee to ensure
D> that any obvious problems are corrected.
D> 
D>       2. The document is then circulated to comcif for comment, revision
D> and approval as a 'draft dictionary'.
D> 
D>       3. The 'draft dictionary' is then made available to any member of
D> the community wishing to see a copy and comment on it.  The availability of
D> the draft will be widely announced and sufficient time allowed for comment.
D> 
D>       4. At the expiry of the time for comment, the secretary and chair of
D> comcif, in consultation with appropriate parties, shall make such changes
D> as seems appropriate and submit the final version to comcif for approval
D> as the definitive document.  I do not, generally, expect much change
D> between the draft and definitive document, and changes to the data names
D> would be kept to the absolute minimum.  Those who cannot contain
D> themselves could start developing software on the basis of the draft
D> dictionary but should be aware that changes could occur.
D> 
D>    I propose the following timetable for the core extension, powder and mm
D> dictionaries:
D> 
D>                core extension      powder        mm
D> 
D> Phase 1        93.11               93.10         94.1
D> 
D> Phase 2        93.12               93.12         94.3
D> 
D> Phase 3     94.1-94.6            94.1-94.6      94.5-94.11
D> 
D> Phase 4        94.7                 94.7          94.12
D> 
D>   It is not clear how much work has to be done on each file, ideally they
D> should be submitted ready to go, but I do not want to underestimate either
D> this committee's vulture like capacity to pick the bones of a dictionary
D> or the amount of time that the committee members have to devote to this
D> exercise.  In the two months since Beijing we have raised more questions
D> than we have been able to solve.  Please let me have your comments on the
D> above timetable.

========= Earlier threads:

D4.1 Restraints
---------------
Here is David's summary and call for agreement on this topic:

D>    The question about how to handle restraints opened the whole question
D> of how to handle concepts which have not yet become widely accepted.  The
D> solution on which there appears to be a consensus is that the major
D> software suppliers should be allocated, by comcif, reserved data name
D> prefixes.  Any dataname starting with this prefix will be defined and
D> managed by the manager of the software system and so will contain
D> definitions of restraints etc. that are appropriated to that software.
D> 
D>    We did not come to a consensus on the use of a _local_ prefix.  The
D> alternatives are:
D>     1. to allocate prefixes to anyone who asks for one
D>     2. to allow anyone to define their own datanames provided that they
D>           begin with _local_.  It is then the responsibility of the user
D>           to ignore these data items or to make sure they have the correct
D>           dictionary.
D>     3. not to make any provision for private datanames.
D> 
D> Of these, 3 would be a disaster, since it is clear that people will define
D> their own data names, that these names will often duplicate either
D> existing data names or ones that will be defined by comcif later, and that
D> the user definition will be different from (though possibly confusingly
D> similar to) the comcif definition.  1 is also a nightmare - the least
D> objectionable is 2.  Does anyone have a better solution?

(I've taken the liberty of including 2. in the proposed Agreement on this topic - BM)

If we do agree on this, we shall need to establish a mechanism for registering
reserved prefixes.

Presumably individuals who want to use such registered prefixes could also be
encouraged to supply a dictionary of their local terms in DDL format.

Ideally, even users employing "_local_" could construct dictionary definitions
that might be included in the CIF itself, allowing a sufficiently "intelligent"
parser of the type envisaged by the mm people to validate the data against
the local dictionary entry, and resolving the difficulty of determining which
local definition is meant. In practice, of course, few individuals would
want to do this, but it seems potentially a useful facility.

D10.1 Extension to use full STAR
--------------------------------
Recall that Brian T. suggested (mischievously, no doubt!) that a formal 
commitment might be made to drop the STAR syntax restrictions of CIF at some
future date. Syd considers this not to be an appropriate step at present,
and suggests postponing any consideration of this until applications using
STAR syntax (especially MIF) come on-stream. This seems to be a fair
suggestion - better to watch and benefit from the suggestions of our
chemistry colleagues before we commit ourselves to an approach which will
certainly have problems, whatever its benefits. This Committee could
profitably monitor such developments as they occur.

D10.2 Privileged constructs
---------------------------
In introducing this topic, I asked Syd whether he intended to keep the
semantic meanings assigned to '?' and '.' within the STAR file specification.
On this he has given ground:

S> Nick and I discussed short and hard. We are on the verge of sending back the
S> corrected ms's to JCICS. We have agreed to leave the . and ? specifications
S> out of STAR. It was a line ball decision as this is not a new issue or
S> debate. What swung it in the end was the possibility of DDL definition of
S> these constructs for each application. So that is now set in concrete.

David Brown also had a comment on this topic:

D> I do not see a problem in defining the function of '?' and "." when they
D> appear in a data field, in the DDL. ...  Presumably we could define:
D>      global_
D>        _derestrict '?'
D> if there proved to be some application where it was essential to use these
D> characters in some other way.  

(not if you don't have global_ in CIF!)

D>                                It is convenient to have well established
D> conventions of this kind rather than find that cifs have one convention
D> and mifs another.

OK, with the syntax clarified, we can now ask "Should the special meanings of
'.' and '?' (as previously described) be extended to CIF?" If yes, this would
need to be formally published.

D10.3 Global_ data
------------------
Syd has already indicated his intention to use global_'s only in data
dictionaries (thus this will not become a CIF extension). A corollary of
using global_ at all is that dictionaries cannot simply be concatenated.
 
S> Nick argues strongly that a CIF or a STAR File is a discrete
S> entity -- and as such they are not intended to be merged and retain all of
S> their original properties. In other words if you want to use a succession
S> of DDL dictionaries, you must open them as separate files and access the data
S> accordingly. Similarly for archived CIF's -- they should be retained as
S> discrete files and search software must allow for this (either by preloading
S> the total set or by some other process). He argues that we are locked into
S> old concepts of how to handle large archives, and that there are more       
S> advantages to retaining discrete CIF's than to merging them. Based on this
S> the global_ data is no longer a problem because it is file delimited.

This theme of a CIF as a discrete entity is also developed by Brian Toby
below, and has of course much weight. [One cannot, however, forbid people
from concatenating files, though the result may be different from the sum of
the parts!]

D10.5 Category
--------------
D> Syd seems to have come up with a workable solution.  Does this satisfy
D> Brian T?  I assume that the additions of _categories to the core cif is
D> part of the cif extension that we have to approve.

D10.6 Restricted characters sets for datanames
----------------------------------------------
D> I am puzzled to know why one needs to parse datanames as if they were
D> written in tcl.  Having said that, I can see some advantages in
D> restricting the character set that is used in the names, if only to avoid
D> some potential (and as yet unforeseeable) problems down the road.

Of course, one doesn't need to parse datanames for the benefit of tcl!
However, the point of the complaint was that many different applications use
(many different) special characters, and that there can be a severe practical
overhead in programming around all of these. I have some sympathy with this
viewpoint, but it's purely a matter of making life easier for computer
programmers [and the consensus of this Committee might be that this is not a
desirable objective!]. Syd, I know, takes a robust line on this - ANY ASCII
character (except white space) may follow the leading underscore in a data name.

D11.1/2 Naming of data blocks and files
---------------------------------------
D> D11.1 and 11.2 deal with the name space, again not directly comcif stuff. 
D> In general I do not think it possible or worthwhile to try to devise a
D> syntax for constructing names of datablocks.  There is no way we could
D> satisfy everyone's needs without producing an impossible monster.  The
D> same is true of file names.  Using cross references within the file
D> (_dataset_id) is preferable and is already present to some degree in the
D> _database_codes.  How these _ids are to be associated with file names has
D> to be left up to the user, unless we provide for some syntax outside the
D> data set for associating a particular _dataset_id with a particular file
D> of dataset name.  Since the file name would be locally defined, this
D> association would have to made explicitly by the user. 

And from Brian Toby:

B> The problem of renaming data_ blocks had not been an issue for me
B> until now. I am not too happy with tracing through a loop of _audit_
B> codes to see if one can satisfy an interblock pointer. If renaming of
B> blocks is allowed, it is probably best to treat them as arbitrary. In
B> that case we probably need a _block_id item that can be assigned  by
B> the originator software and then used as the "pointee" for the
B> _pd_phase_id "pointer." 
B> 
B> However, the idea of concatenating CIF together raises an issue for
B> me: is a "CI *File*" (CIF) a logical entity of some sort? Suppose an
B> author has included
B>        data_HA543_manuscript_only
B>        data_HA543_structure_1_of_2
B>        data_HA543_structure_2_of_2
B> in a single file. Is it appropriate to create three files with one
B> block per file? If so, would this not potentially result in a loss of
B> information? What if we concatenate three different structures of
B> "HA543_structure_1" (from three different groups, of course) into a
B> single file, along with another 500 structures for good measure, and
B> to increase our fun, we rename all the data_ blocks. How then do we
B> decide which structure the author of "data_HA543_manuscript_only" is
B> describing, when making an excuse for obtaining a negative thermal
B> factor?
B> 
B> I can see two ways to proceed. Assuming there is nothing sacred
B> about the file where the CIF was originally placed, then when one
B> wishes for a a logical connection between blocks, the blocks must
B> share a pointer. This would *require* a locally-derived,
B> uniquely-defined pointer along the lines of that messy _pd_dataset_id
B> construction. (Otherwise how do you sort out 499 references to
B> "structure_1"?). I would suggest that in this scenario the blocks must
B> share the *same* _dataset_id.
B> 
B> The second option is to consider the file where CIF is placed to be  a
B> logical construction. Then if one then concatenates CIF for
B> transmission or archive, one should include an end-of-file marker
B> so that the original logical structure can be retrieved. The CIF
B> standard does not address local storage -- only the exchange format.
B> Mechanisms for local storage or private exchange mechanisms that might
B> involve temporary concatenation of CIFs becomes a local issue that
B> does not need to be addressed in the standard, provided it is clear to
B> all parties involved and there is no general distribution. Note that,
B> in this case, renaming blocks is no longer needed.

In many ways this seems the cleanest solution. Alas, however, there isn't
a reserved word in STAR syntax that will denote "end-of-file" (there was,
inadvertently, when we drew up our first template CIF form. We put in an
"_eof" character string, before realising it was not legal. Perhaps we missed
an opportunity there! We could invent a local data name (which needs to have
a value): _end_of_file 'here'. Or we could indicate the file boundaries some
other way (in a comment!!!). These would all be local conventions...

I feel that there is no great enthusiasm for the inter-block pointer, at
least as a standard construction. We have no abiding interest in this matter,
and would have no difficulty in continuing our current arrangement with
Cambridge, where we change the datablock names and concatenate 70 or 80 CIFs
at a time (because we're not really concatenating, we're writing a new CIF
which has data in CIF format extracted from other files!). We don't yet
implement the _audit_data_block_ idea, but could do so as a local extension,
if ever CCDC wish this. We would need, however, to look out for things like
Brian T.'s _dataset_id which do point internally to block names.

B> Finally, I do have a problem with your "include" idea as pointing
B> directly to files by name. What happens to you when some DOS user names a
B> file "~my/cif.c94" or he/she tries to read in DOS a CIF that points to
B> cifdic.C93b, when DOS allows only three letter extensions. (I am no
B> DOS fan, but recognize that PC's are the dominant platform in second and
B> third world countries). For that matter which of the three different
B> generations of cifdic.c92 files that I have should be used when I see
B> a reference to cifdic.c92? This problem is what gave rise to the
B> _dataset_id for the Powder dictionary. My principle is that if one
B> wants to refer between files one *must* use pointers, not file names.
B> Otherwise, you will sooner or later have operating system dependencies
B> and conflicts between different, identically named, files.

Well, yes... The "<>" notation (meaning "look in the standard directory" is
meant to resolve problems of local directory hierarchies). The file name
within the brackets should have a restricted character set that applies
across all platforms. (We can impose restrictions on the names of "standard"
files that would go in such a directory.) However, one day someone will bring
out an operating system that doesn't allow dots in file names...

For dictionary files in particular, there is a more general problem of
defining the naming convention (whether this is understood as a filename or a
pointer isn't really relevant). If the core dictionary is cifdic.C91 (and the
revised core cifdic.C94?) what should one call the 1994 Crystallization
dictionary? - i.e. the only free variable in the naming convention we
currently have is the single character immediately after the dot. How should
we try to approach this?

I have returned to this well-trodden furrow largely because the mm people are
very concerned about "name space management", rather than because I see it as
a crucial issue myself. My own feeling is that cooperating communities can
easily establish their own conventions for cross-referencing data. I had
gained the impression that the large-molecule database folk wanted there to
be an "official" way of answering the question 'How do I locate the data for
structure ABCDE?' if those data were stored in a CIF somewhere. Perhaps Paula
can comment on whether I've misunderstood this requirement, and whether she
considers it a real problem, or just a perceived one that will go away when
everyone starts using and archiving CIFs at PDB and in other places.

Brian
Prev by Date: (11) Restraints; naming data blocks and external files
Next by Date: (13) CIF and PDB: report from Paula
Index(es):
- Date
Discussion List Archives

(12) Schedule; STAR extensions and naming conventions, etc.