(14) Continuing discussions on (10)-(13)

To: [email protected]
Subject: (14) Continuing discussions on (10)-(13)
From: [email protected] (Brian McMahon)
Date: Fri, 19 Nov 93 13:24:04 GMT
Dear Colleagues

This is the usual Friday mailing. Paula's report on PDB matters was a bumper
bonus issue!

(Dis)Agreements
---------------

(12)D4.1 Restraints
-------------------

B> I support [this] proposal.

P> I am sorry that I have to disagree with the solution that the way around the 
P> restraints problem is giving each of the software developers a set of data 
P> names. I agree with George that the whole point of this exercise is allowing 
P> a CIF to store the information that is routinely published with a structure 
P> report or summarized in the PDB entry - in the case of Protin/Prolsq a table 
P> of Parameter being Restrained, Target Value for RMS Error for this Parameter 
P> from Ideal Values/Value of RMS Error for this Parameter in the Refinement 
P> Model, etc.
P> 
P> The way we have implemented this in the mmCIF draft is to make 
P> _refine_ls_restr_type the data name, and the issue that we are trying to
P> deal with is how to restrain the *values* that that data name can take in a 
P> parsable and manageable manner. Being a Prolsq user, when I put together the 
P> first draft of that data item, I just listed things like 'bond_d' and 
P> 'angle_d' as example values.  But these are of course Prolsq specific 
P> restraint concepts (i.e. the fact that Prolsq restrains angles by
P> restraining the distance between the 1 and 3 atoms, not by restraining
P> the angle itself).
P> 
P> Perhaps the compromise position is to give each software package (or the 
P> developer of that package) a prefix, not for data names, but for data 
P> *values*.  For instance, my prolsq enumeration list would now look something 
P> like 'p_bond_d' and 'p_angle_d', the X-plor entries to the list would look 
P> like 'x_bond_d' etc, and so on for each package.
P> 
P> In the current draft, I put the Prolsq enumeration list into an example, not 
P> an enumeration list, because I didn't think the science was solid enough to 
P> carve in stone.  I still feel that way, but various forces have been beating 
P> on me until I am willing to admit that it does no good to specify values at 
P> all if you can't use the same software that validates other enumeration
P> lists to validate these values.

(12)D4.2 Dictionary introductions
---------------------------------
P> I vote yes.

B> I am against this proposal. As I stated before, the _name_[]
B> nomenclature is even more confusing than _name_appendix. So it
B> fixes nothing, and makes reading a CIF dictionary even more confusing.
B> I do support _name_.intro or anything else where the ``tag'' means
B> something to a someone who has not seen the COMCIFS dialog.

There are two facets of the introductory sections which we haven't treated
separately, but perhaps we should. First, the general structure of such items
(with _type null, _category dictionary_definition and a standard _definition
field) needs to be established. I am not convinced in my bones that this is
the most elegant solution from a philosophical viewpoint, but I think it can
be made to work by parsers which are fully aware of the convention. So far,
there has been no dispute over this.

The second issue, what we call the sections, is more contentious but probably
more amenable to imposing an arbitrary convention. The printed dictionary can
use any of a number of typographic tricks to highlight the special nature of
these sections. The cifdic file itself needs to be explained in more detail
to novice users - we have recently had a helpful but unschooled author adding
data names to our standard CIF template as "data_blah" instead of "_blah".
Such explanation as is supplied can identify the nature and purpose of
"_[]" (or ".intro") blocks. There seem to be three points of relevance here.
We wanted a sorting aid - more or less any non-alphabetic character will do,
so ".intro" and "_[]" tie on this (and presumably ".intro_mm" and _[mm]" etc).
We need something short. Ab initio, I would have considered this no real
problem, but I note that Paula does have some long category names (the
longest I could find was 25 characters), so one is pushing things to have
".intro" tagged on to this (but one MIGHT make exceptions in the CIF
Dictionary). The third factor is individual taste. 

Perhaps we can leave this open for interested parties to concede to another's
point of view, then - failing gentlemanly agreement - devise and test voting
procedures on this.

(12)A8.1  Comments
------------------
B> I support [this] proposal.

P> I also vote yes here, but I would like to comment on Brian's comment that he 
P> sometime plans to add _comments entries for information that doesn't belong 
P> anywhere else. My addition here is that in order to be consistent with style
P> in the core, these ought to be _special_details extensions;  we used this 
P> convention through the mmCIF dictionary, although we did have to shorten 
P> _special_details to _details in many cases to stay inside the 32-character 
P> limit.

There's already a precedent for "_details" in the Core (e.g.
_refine_ls_abs_structure_details).


D10.2 Privileged constructs
---------------------------
P> I am happy to hear that ? and . have not gotten enshrined in STAR. 
P> I do think that some such concepts (and perhaps others) are needed in CIF,
P> but I am not at all sure of the best mechanism for doing this.
P> Undoubtedly the answer lies in some form of DDL, but a useful suggestion
P> eludes me just now.

D10.3 global_
-------------

P> Brian makes important points here about the problems of using global_
P> declarations, even if they are only used in dictionaries.  And proposes 
P> various solutions to solve the concatenation problem. My response is why use 
P> them at all, even in dictionaries.  I was perfectly happy with the mmCIF 
P> dictionary when every data name block said either _list yes or _list no and 
P> wondered what in the world had happened when all of the _list no items went 
P> away.  Took me a while to find that _list no had been moved to global_.  The 
P> important thing about the dictionary is that it be instantly clear to the 
P> (relatively) naive user what is going on with each data name definition.  
P> True, a clever dictionary browser will know to add the value declared in 
P> global_ to each data name it displays (if the global_ value has not been 
P> superseded locally).  But this disenfranchises the dictionary browser who 
P> happens to be a human being and who is on page 63 and has to continually 
P> remember that something was declared globally on page 2.
P> 
P> Of course, using the global declarations does save space in the dictionary, 
P> but I would argue that the roughly 100 lines saved by taking _list no out 
P> (most of the 600 data names in the dictionary are _list yes) are meager 
P> compared with the complications (especially programming ones) that are 
P> introduced by using globals.  This is going to be a tough sell anyway - lets 
P> not make it any more sophisticated than it has to be.  To say it again, the 
P> only benefit that I see coming from the use of globals in dictionaries is 
P> saving space, and I think that that is too small a gain for the problems
P> that are introduced when you start using globals.

Yes, well, I would go with that, I think. It has been pointed out by our
old friend Peter Murray-Rust that the present draft dictionaries have
"_dictionary_version" in a global_ statement, which means that his dictionary
browser prints the entire dictionary history for every data name accessed
- arguably not what was wanted!

D10.4 DDL
----------

P> I would not like to see the official DDL diverge for CIF and MIF. True, some
P> current DDL is not applicable to CIF given that CIF is only a subset of STAR,
P> but I would hope that anything that was important enough to be worth adding
P> to the "CIF" DDL would be equally applicable to STAR at large.
P> 
P> However, I take Brian's point that something like _type could easily have a 
P> larger enumeration list.  A new value for type of "date" has already been 
P> suggested, and I think that this is a very good mechanism for imposing rigor 
P> on this class of data values.

D10.5 Categories
----------------

P> I also agree that Syd's suggestion is workable.  And I also agree that is is 
P> within our mandate to oversee all of the changes in the core - implementation
P> of the new DDL included (not that I see any problem with what has been done).

B> Syd's idea for categories (if I read it correctly) seems a step 
B> backward: Require that each loop contain only one category, but a
B> category can be split into several loops. The result from this is that
B> all entries that one ever *might* want to include in a loop_ *must*
B> now have the same category. This largely defeats the advantage of
B> having categories in the first place. (It is a step toward my
B> tongue-in-cheek suggestion at the mmCIF meeting that we put all CIF
B> items into a single category.)
B> 
B> Have any of the relational database gurus seen this idea? I would
B> guess that they would like it less than allowing a loop to contain
B> multiple categories. The purpose of defining a category is to
B> establish relationships between sets of entries. By splitting a
B> category into two loops with different numbers of entries in each
B> loop, a category no longer establishes a relationship, and thus has no
B> value! By placing two categories into the same loop, one is establishing 
B> an implicit 1:1 relationship between two sets of related entries
B> (where in in other sets of data there may be a many:1 relationship).
B> The view from the RDB side (as I understand it) is that this is lazy -
B> better to put the categories in different lists and then establish
B> pointers. My guess is that under the current proposal, categories will
B> be useless to the RDB folks. So why even have them?
B> 
B> I am afraid that I would prefer to see the opposite here. All entries
B> in a category *must* be in the same loop (or may be scalars defined
B> outside the loop). It is valid to combine categories into a single
B> loop. If this is done, it will be possible for a smart CIF parser to
B> generate loops separated by category and even generate the _child _parent 
B> pointers, where these relationships are present in the dictionary.
B> However, with my understanding of the current proposal, this will never be
B> possible.
B> 
B> This discussion does raise two DDL questions: 
B> (1) Would having hierarchical levels for categories help?
B> 	'Atom_site.xyz' and 'Atom_site.tfactor',...  
B> and
B> (2) I think that each category should have its own entry in the
B> dictionary to establish the parent-child relationship. Currently, to
B> find the relationship between categories atom_site_aniso_label and 
B> atom_site_label one must search through all of the entries that are in
B> each category to find the matching parent-child pointers (hoping that
B> they are valid).
B> 
B> I know that this is taking the discussion here a step backwards, and
B> into the depths of DDL definitions, which are not properly in our
B> domain, but we need to make a choice. Either have dictionaries that 
B> enforce parent-child and category relationships in a flexible and
B> complete manner or simply not enforce them at all.

D10.6 Restricted character set for datanames
--------------------------------------------
P> I can see the virtue of not setting limits on CIF beyond the agreed upon 
P> general rules.  However, just because we have license to inflict "horrors" 
P> upon data names doesn't mean that we ought to do it. I can easily change the
P> two offenders in mmCIF.

(The powder dictionary has two with '%' and at least 42 with '/' !)

P> But this discussion raises (for me) the issue of style. And Peter's point of
P> "I know the CIF dic will never change, but couldn't we suggest it never 
P> changed to something like _stol (which is used elsewhere)."  I think that we,
P> as COMCIFS, should take a hard look at the issue of style with respect to
P> new dictionaries.  If _stol is going to be used for sin(theta)/lambda in one 
P> place, then it ought to be used for all such instances of that concept.
P> 
P> The rules of CIF are that data name and their definitions are stable over 
P> time. But my understanding is that CIF can evolve in the sense that a genuine
P> mistake can be rectified by abandoning a data name and creating a new name 
P> that solves the problem.  This would be the solution here.  Abandon the 
P> something_sint/lambda data and add a new data name _something_stol.  I really
P> think we ought to look at this issue rigorously before inflicting any more 
P> inconsistencies on an unsuspecting (and, in my case, wary) public.


D11.1/2 Block and file names
----------------------------
B> I think we are confusing two issues here. One is "the 'official' way
B> of answering the question 'How do I locate the data for structure
B> ABCDE?'" The other "is how do I refer from a CIF block to something
B> that is not defined within the current block?" The former is a
B> database question related to how one catalogs a library of CIFs and is
B> not in the domain of COMCIFS. The second is a question of how one
B> structures information within CIF and is very much within our domain. 
B> By all means we should define CIF items if they will assist
B> cataloging, but I would prefer to wait until the various database
B> organizations make proposals.
B> 
B> Unless one believes that these inter-block and inter-file references
B> are  only of local value, we have the problem of translating a local
B> reference  into something that survives when the information is
B> exchanged (which is the goal of CIF). For powder diffraction and for
B> dictionaries (and probably elsewhere, too) these references are of
B> much more than local value. I would prefer that a single method be
B> adopted here rather than leaving it to each community (Paula & Syd --
B> comments?)
B> 
B> I agree with David that there should not be rules for generating _data
B> block names. But, I note that database_code_* entries are unlikely to be
B> useful for referencing external information. For new materials these
B> codes will not exist. For standards, there will be thousands of CIFs
B> with the same codes. 
B> 
B> If file names will be used to refer to dictionaries, why bother
B> including a <> prefix? The meaning will need to be defined. One may as
B> well just say that file names used in CIF will need to be resolved
B> by looking them up in the CIF directory. Putting a directory reference
B> into the brackets would be attractive, but opens far more problems. 
B> I personally think that devising standardized file names is a poor
B> choice, as one must resort to something that incorporates *all* of the
B> weakness found in every significant platform. 
 
My suggestion of <> has the meaning "look in the standard directory", where
the location of that standard directory is implementation-dependent, but at
least other applications running on that platform can be taught the location.
It just distinguishes between official and local files by location. In a C
program, "#include <stdlib.h>" means "get the standard header file stdlib.h
from the place where all the standard files are kept"; but "#include stdlib.h"
includes a file with the name stdlib.h in an implementation-dependent way
(but usually from the current directory). Maybe we shouldn't concern
ourselves with such suggestions at all - these are conventions which users
might employ legally within the current standard. However, I have been
reading the reference work on the Standard C library, and one comes across
many details that would be useful to know, but are declared "implementation
dependent" - sometimes, legitimately, because the C Standard Committee cannot
know how data is stored on individual machine architectures; but sometimes
because the "wrong" choice has already been made in different implementations,
and the Committee cannot or will not reverse the situation. It would be nice
not to hit the second problem too often with CIFs!

B> Also, I don't see why there must be a CIF- or STAR-defined end-of-file 
B> character for the usage that IUCr makes in concatenating CIFs into a
B> single file for transmission to Cambridge. Any method that you agree 
B> with Cambridge (or anyone else) for exchanging CIF's need not be part
B> of any standard. If you and they agree to use LZ compression, this
B> does not require that LZ compression need be part of the CIF
B> standard. It will simplify life if one treats the information that an
B> author has chosen to include in a file as having some meaning. If one
B> chooses to abstract and compile information (electronic or printed) it
B> should be done with references to the original sources.
B> 
B> Powder Notes: _pd_dataset_id points from one CIF *file* to another. 
B> Concatenation of files could make life confusing, but not impossible. 
B> Dividing a multi-block CIF into a separate files might cause problems
B> with _pd_dataset_id, but I am not sure.  However, _pd_phase_id does
B> point to blocks by block name and assumes that the contents of a
B> multi-block CIF will never be divided into separate files and that
B> block names will not be changed. If you do insist upon renaming
B> blocks, I will require that an _id be defined for every block
B> in a CIF (probably _pd_block_id) and that this name be used
B> for all inter-block references.
B> 
B> I hope I have made the following points here:
B> (1) There is a need to exchange information that includes both inter-CIF
B>     (file to file references) and intra-CIF references (block to block).
B> (2) The powder dictionary *must* define these relationships.
B> (3) File names are purely local constructs and have no value upon
B>     export.
B> (4) I would prefer to see a common method used for dictionaries and
B>     for powder files than see divergent methods develop as other
B>     usages arise.

D12.1 Proposed schedule
-----------------------
B> Looks OK to me

D13.1 External reference files
------------------------------
B> I sympathize with Paula's desire to manage the enumeration lists
B> separately from the rest of the dictionary. As I see it, however we
B> implement defining allowed values for a CIF data item we are still
B> defining the CIF standard -- a function currently fulfilled by the CIF
B> dictionaries. Assuming the following two statements are true:
B> 
B> (1) We can expand enumeration lists for existing CIF definitions as needed.
B>   -- this was certainly my expectation for some Powder entries and we 
B>     are doing this in the main dictionary.
B> 
B> (2) We cannot remove enumeration values from lists for existing CIF 
B>     definitions.
B>   -- as I see it, this would invalidate old CIF's and thus break Syd's 
B>     first law of CIF.
B> 
B> Thus if we have external enumeration lists, the lists are really still
B> part of the dictionary. We have just increased the complexity of the
B> dictionary structure through incorporation of a syntax for file
B> inclusion. It would be better to put this time into development of
B> better dictionary editing tools that would ease the task of management
B> of enumeration lists.

I would agree that we could handle keywords as enumeration lists within the
Dictionary. We would need to issue a new edition of the Dictionary every time
a new keyword was added. However, equally we would need to issue a new
edition of the keywords "external reference file" in such circumstances -
and, ideally, update the Dictionary to indicate which version of the keywords
ERF was now current. There are attractions to externalising the keywords file
(in the WDC project, we have 1750 keywords, which would make for a long
listing in a printed Dictionary!).

Brian
Prev by Date: (13) CIF and PDB: report from Paula
Next by Date: (15) Restraints, name space, schedule ...
Index(es):
- Date
Discussion List Archives

(14) Continuing discussions on (10)-(13)