(15) Restraints, name space, schedule ...

To: [email protected]
Subject: (15) Restraints, name space, schedule ...
From: [email protected] (Brian McMahon)
Date: Fri, 26 Nov 93 14:02:00 GMT
Dear Colleagues

Matters awaiting consensus
--------------------------

A15.1 Standard prefixes for local extensions - formerly (12)D4.1 Restraints
---------------------------------------------------------------------------

D> Paula's comments are appropriate to the matter of this discussion, but
D> A4.1 does not directly address the problem of restraints.  I think A4.1
D> can be approved, as it addresses the problem of local cif extensions
D> regardless as to whether this is an appropriate was to handle restraints.

Hence the obscure heading for this section! Let us take the view that this
particular resolution [that COMCIFS supply and manage reserved data name
prefixes for major software suppliers] is now half-way through the acceptance
stage (despite its being given a reference number appropriate to this mailing).
This throws the more general debate on how to handle *restraints* back into
the arena (see David's comments further to this below).

A4.2 dictionary introductions
-----------------------------
D> Although I do not feel strongly, I concede that Brian T has a point.  In
D> looking through the printed dictionary, _name_[] is very uninformative.  I
D> would prefer to see _name_.intro if it is possible to find the space.  It
D> would be unwise though to start making exceptions to the 32 character rule.

There was some support for the square brackets notation from Paula in an
informal message:

P> I keep hearing the debate about .intro vs. _[]. I really don't like the idea
P> of introducing a new name delimiter (. instead of _). The thing I really
P> like about the [] construction is that it is so short - we fight against 32
P> characters all of the time, and this would make our life easier.

It may be argued that the '.' is not *formally* a new separator character (or
that '[' and ']' make *two* new delimiters!). May we leave this on the table
a little longer - I hope to get a chance to experiment with some new
typography in the Dictionary next week, and we can see whether we can make
the intention more obvious to the neophyte browser.


Back now to the active discussion topics:

D4.1 Restraints
---------------
D> Paula expresses unhappiness about this method of addressing restraints,
D> but I am not sure that I think her solution any better.  It would result
D> in extensive and specialised enumeration lists which would be expected to
D> become obsolete as the field progresses, and we have not yet decided
D> whether enumeration lists can be changed by decree, or whether they are
D> as immutable as the rest of the cifdic. Furthermore, with Paula's
D> solution special parsers would be needed to extract the information.  We
D> would be better off defining the terms directly as cif datanames rather
D> than doing it second hand via enumeration lists.
D> 
D> Can we hear any further arguments on this point?  If no alternatives 
D> acceptable to Paula are forthcoming, I will call for a vote to resolve
D> this issue.

D10.3 global
------------
D> This construct is, apparently, only allowed in a cif dictionary, which
D> makes the discussion a lot simpler.  It currently contains two kinds of
D> information, 1) an introduction to the file including the file history and
D> 2) a set of default settings.  When this is printed in the dictionary,
D> only 1) appears; all the default settings are hidden from the user (the
D> computer would, of course, recognise them, but this is of little use to
D> the person working from the hard copy who would not even know that
D> anything had been defined).

It is, of course, possible to modify ciftex to recognise global_'s and
print the default information in each entry: but the argument is surely that
all CIF software that might need to access dictionaries would need to be
modified in just such a way to interpret global_'s; and is the pay-off
worth the effort?

D>                              The history of the file is clearly useful
D> information, and one should have a file history that is part of the cifdic. 
D> This feature could even be usefully incorporated into regular cifs where
D> it could be expended to explain the relationships of various datablocks
D> within the cif.  I am inclined to agree with Paula that 2) serves little
D> purpose except to make everyone's life more complex.  In any case these
D> are two separate functions that are best handled differently.  2) could
D> best be handled by means of an 'include' construction, if we ever figure out
D> how to make this work.

(14)10.4 DDL
------------
D> This is not really related to DDL, but seeing the suggestion that _type
D> might include 'data' in its enumeration list reminded me that in defining
D> dates, we should be wary of using 93.11.22, since the centenary of x-ray
D> diffraction is approaching and crystallography is a lot older than that. 
D> We ought to allow, at least optionally, for 1993.11.22.

Well, 1993-11-22, actually! This touches on a couple of points: we should
choose sensible standards for quantities that permit this (and the mm parsing
gurus whould argue that they should be verifiable electronically - i.e. as
separate types or with explicit validation rules in the DDL); and we should
maintain style across the different dictionaries ( a point already made by
Paula in different contexts).

Since the issue of data typing has arisen off and on through our discussions,
this may be an appropriate point to start a new discussion thread on the
specific point:

D15.1 Extension of data types
-----------------------------
Does the Committee wish to introduce new data types? Recall that at present
only 'numb' and 'char' types are supported. The down side to this proposal is
that parsers need to extend the types of data they can recognise; the up side
is that validation is facilitated.


D11.1/2 Block and file names (and D13.1 External reference files)
-----------------------------------------------------------------

P> I agree with Brian T. that we are talking about too many things at the same 
P> time.  From my perspective this discussion might benefit from being thought 
P> about in terms of the actual problems that we are trying to solve with the 
P> mmCIF effort.  There are three things that we would like to do:
P> 
P> a) Develop a naming convention that will allow all of the different databases
P> to play nice together with respect to the universe of archived data. The same
P> convention could also allow locally developed dictionaries to function 
P> properly within the world of other local dictionaries and with the sanctioned
P> dictionaries.
P> 
P> b) Develop a way for a CIF to point to another CIF in much the way that code 
P> can point to include files.  Our primary goal here is to provide a mechanism 
P> for referencing external files that contain information that will be common
P> to all (or at least many) CIFs (for example, full chemical descriptions of
P> the amino acids).
P> 
P> c)  Develop a way for a dictionary, within the definition of a data item, to 
P> point to a list of "pseudo"-enumerated values.
P> 
P> Right now, c) is the most pressing problem that my committee (in consultation
P> with the PDB) is confronting, but b) is right behind it in urgency.  a) is a 
P> real issue, but not something that I (at least) think needs to be worked out 
P> just now.
P> 
P> So let me expand my thoughts on c).  Phil Bourne has proposed a DDL extension
P> that would manage this, but in fairness to Phil I don't want to present his 
P> proposal to COMCIFS until it has a chance to evolve a bit more.  My concern
P> is with the content of these "pseudo"-enumerated values, and I would like to
P> use structure keywords to illustrate my thinking.
P> 
P> My concern with carving keywords in stone is that science changes.  When one 
P> of us makes a statement like this, they are usually talking about the list of
P> keywords getting larger, but I would argue that we ought to provide also for 
P> keywords going away.  As an example, when the structure of entry (entirely 
P> hypothetical) 1xyz is submitted to the PDB in 1995, the best descriptor for 
P> the protein is "oxidoreductase", and so a keyword "oxidoreductase" is used.  
P> But by 1999, there are 253 oxidoreductase entries, and so a simple
P> designation of oxidoreductase is no longer sufficiently atomized. 
P> Oxidoreductase is thus subdivided into Oxidoreductase (heme), Oxidoreductase
P> (iron-sulfur), Oxidoreductase (other), and Oxidoreductase (unspecified). 
P> The point of all of this is that at that point Oxidoreductase is *removed*
P> from the list of valid  keywords - it would no longer be valid to keyword a
P> compound in this class without specifying a subclass (even if that subclass
P> is other or unspecified).
P> 
P> From a software perspective, I don't see this being a problem.  When a user 
P> prepares a CIF, he/she does so with a tool that is reading the current list
P> of valid keywords, and the user simply selects from that list. In those cases
P> where the file had been prepared by hand and the author had specified 
P> oxidoreductase, the validating software would say, Oops, oxidoreductase is
P> not in the current list of valid keywords, here are eight closely related
P> keywords to choose from as replacements (with of course the option of going
P> back to the whole list if necessary).
P> 
P> With respect to breaking old CIFs by removing oxidoreductase from the list, I
P> am sure that we can come up with some mechanism like flagging a keyword 
P> that has been retired as retired - old CIFs would still validate, but with a 
P> warning that they were using retired values for keywords.  The data archive 
P> would be expected to update entries containing retired values at the 
P> retirement date - this sounds like a big job, but with a properly keyworded 
P> entry it could be done relatively automatically.
P> 
P> Of course, this is only one example, but a major point of keeping these 
P> lists outside of the main dictionaries is that they can evolve outside of 
P> the constraints of CIF dictionary approval (and, I would argue, outside the 
P> carved in stone assumptions of CIF).  The second major point is that these 
P> lists are going to be long - Brian M. has already made this point with
P> respect to the World Directory project, but it deserves being made again
P> and again. 

In a way, there are too many ingredients mixed in to this discussion to allow
any useful progress to be made on specific points, but I think it has been
useful to allow open debate on a variety of topics that are linked by
underlying needs of data addressing outside the CIF. 

D12.1 Proposed schedule
-----------------------
D> We do not need a formal agreement on this, but I have to provide an
D> estimate as to when the cifdic.p will be approved.  If I do not hear any
D> objections to the schedule I proposed in the next week or so,  I shall
D> assume that we are agreed that it represents a reasonable timetable that
D> we should attempt to adhere to.

Paula had some remarks on this suggestion:

P> I find the proposed schedule realistic, but profoundly depressing. The mmCIF 
P> effort is under extraordinary pressure to get something to the Protein Data 
P> Bank that can be put into production - right now. I hear rumblings that make 
P> me very unhappy that we have gone so slowly that we will be overtaken by a
P> more aggressive effort coming out of Europe, associated with the
P> establishment of the new EBI center in Cambridge.
P> 
P> I've also been under a lot of pressure to start disseminating the dictionary 
P> widely to my community, and I have been extremely reluctant to do this until 
P> we have something more polished than what we have now.  However, the mmCIF 
P> effort really needs to be accepted by the community, and I don't think it is 
P> realistic to share Dave's opinion that "I do not, generally, expect much 
P> change between the draft and the definition documents, and changes to the 
P> data names would be kept to the absolute minimum."  On the contrary, we have 
P> already made a number of sweeping changes based on input throughout this year
P> from the small subset of the community that has seen the draft, and I expect 
P> more changes when the draft is circulated widely.  I think that we will only 
P> succeed in the marketing part of our job if we remain open minded about 
P> responding to the suggestions and recommendations of the the community at 
P> large.
P> 
P> Ergo, I would rather see a more free-form process, where the product of phase
P> I is circulated to the community and that COMCIFS oversees the revision 
P> process.  When things stop changing, and there would be a time limit here, 
P> COMCIFS then steps in and makes sure that everything looks exactly right.  
P> That document is then recirculated, now with the expectation that little
P> will change.

Paula and I have talked a little about a possible accelerated timescale if
this procedure is approved by the Committee. She will be in Chester early in
January, and hopes to be in a position to hand over at that time a finished
draft for distribution to the community. There could be a call for responses
from the community with a cut-off date of the ACA meeting (which is in Atlanta
June 26 - July 1 next year). This does bypass phase 2 of David's scheme, but
in this case it's worth recalling that Syd has already done much work with
Paula on this over the years, and I shall be working on it too in the near
future; and that there is little point in the Committee being excessively
attentive to a draft that may well be significantly changed by input from the
community. There is of course no reason why we should not be reviewing the
draft document ourselves during its public release.

Regards
Brian
Prev by Date: (14) Continuing discussions on (10)-(13)
Next by Date: (16) Restraints again; dates; _su (?)
Index(es):
- Date
Discussion List Archives

(15) Restraints, name space, schedule ...