(18) New Consultants, _include_file, matters arising

To: [email protected]
Subject: (18) New Consultants, _include_file, matters arising
From: [email protected] (Brian McMahon)
Date: Wed, 19 Jan 94 12:42:53 GMT
Dear Colleagues

Welcome to two new Consultants, who have already made many indirect
contributions to our discussions: Peter Murray-Rust of Glaxo
([email protected]) and Phil Bourne of Columbia University
([email protected]).

Agreement
---------
A16.1   All dates in a CIF should be expressed as yyyy-mm-dd. Passed nem. con.

==================== Current discussion topics

D4.1 Restraints
---------------
D>    Paula's arguments make sense.  I assume that we now have a mechanism 
D> for 'including' the enumeration lists but these lists will be as cast in 
D> stone as if they were part of the dictionary, so we must be careful how 
D> we define new terms.

A4.2 Intro sections
-------------------
B> My last try: how about _.doc? Same # of characters as _[pd]. Otherwise,
B> I will pout, but will use _[pd]. 

Trouble comes when you have _chemical_conn_.doc in the core dictionary, but
_chemical_conn_.doc.mm (or _mm) in an extension dictionary (in which Paula
writes "we don't use this stuff"). Let's stick with [], irritating as it is
(it turns out not to look very good in print, but ciftex can make it go away
altogether - just as it can to .doc, _intro, _apx or whatever we chose!).

A10.6 Restricted character sets
-------------------------------
B> We need a rule either way. Since CIF has not made any character set
B> restrictions so far, why not continue likewise. I can accept a restriction
B> on characters, as long as it is a formal part of the CIF standard.

Perhaps a brief summary is in order here, for the benefit of Peter and Phil
(who sparked off this chain of discussion!). We have been considering whether
to disallow any characters from CIF data names, on the grounds that parsing
software may have problems with "special characters". The proposed agreement
is that any printable ASCII character (except "white space") be permitted in
a data name. The parser, recognising a data name by a leading underscore,
will collect all characters until white space is encountered, and not seek to
interpret any characters in that stream. If an application does wish to
interpret the name (one of Peter's applications tries to write files with the
same names as the CIF data name) it must guard against conflict with the
operating system or environment.

A15.1 Standard prefixes
-----------------------
B> I am in support of standard item name prefixes. 

D>    Paula rightly points out that a real cif reader will ignore all names 
D> that it does not recognise.  She implies that all local names will thus 
D> be ignored.  But this is not correct.  It is quite possible for a local 
D> name to duplicate an official name, either through the oversight of the 
D> local programmer or because a likely name adopted by the local programmer 
D> has been subsequently adopted (with a subtly different meaning) by 
D> comcifs.  This is a very likely scenario.  The use of local_ as a prefix 
D> would ensure that a real cif reader would automatically ignore local 
D> names because it would clearly not accept any name that started with 
D> local_ from now to the end of time.  The problem of two different 
D> laboratories defining say local_bond_length with two different meanings 
D> is not a problem, because it is the responsibilty of the user to make 
D> sure that s/he only switches on the local_ read when the program is 
D> reading files created and maintained in the local laboratory.  I tend to 
D> agree that other prefixes (e.g. shelx_) should be treated as a special 
D> case of local_, not for use outside of the shelx system of programs.  
D> This does mean that we have to be prepared to define names for any data 
D> item that will be needed by more than one laboratory, which is a little 
D> scary.  We should, however, agree that certain words (such as shelx_ and 
D> local_) are reserved prefixes.

D15.1 New types
---------------
D>    The combination of Paula's and Brian M's comments suggest that we need 
D> a new type_condition for date.  It would be useful to adopt this soon 
D> since dates are used in many places - cifdic.pd being particularly rich 
D> in dates.  As far as I can see they are all compatible with international 
D> usage and should optionally allow the time to be given.  I cannot 
D> immediately lay hands on the international convention but it is in the 
D> form yyyy.mm.dd-hh:mm:ss where the string can be truncated from either 
D> end.  The adoption of this as a type would make its introduction in 
D> various places quite standard.  However, the enumerations of 
D> type_condition is in the DDL so perhaps this is outside the jurisdiction 
D> of comcifs.  Perhaps Syd can take note and give his comments.

This IS slightly different from the point we have just agreed (that dates
should take the form yyyy-mm-dd) in that it formalises the convention into a
data type. Brian Toby raises the query:

B> We really need Date & Boolean types. How can this be done with the new
B> DDL _type_conditions?

I presume that data items conforming to these new types would be assigned
_type char and _type_conditions bool and date (i.e the enumeration list for
_type_conditions needs to be extended). Now, the interesting thing is that in
current practice the definition field for _type_conditions would then be
expanded to contain sentences of the form "'bool' signals that the truth
value of the data item may be tested; acceptable values are 'true' and
'false'". A better approach in the long run is to have a field which describes
the constraints on the data types in a machine-readable form, using regular
expressions or some robust metasyntax, such as:
   data_type_conditions
     loop_ _enumeration  _enumeration_constraint _enumeration_detail
       none    *                           'only types in _type apply'
       su      [0-9]*.*[0-9]*([0-9])       'numbers may include su's'
       seq     [a-Z0-9.]*[,:][a-Z0-9.]*    'data may be sequence of values'
       incl    <*[a-Z0-9._]*>*             'name of "include" STAR file'
       xdat    <*[a-Z0-9._]*>*             'name of external data file'
       bool    (true|false)                'bivariate Boolean truth value'
       date    nnnn-nn-nn[:nn-mm-ss]       'calendar date'

This is only a sketch, and a non-workable one at that, for it demonstrates
that 'standard' regular expressions are inadequate. But I believe that this
is one of the things Peter is interested in exploring, and may profitably be
grafted on to the basic DDL at a later stage.

D16.1 esd/su
------------
For information, I append edited highlights of some correspondence between
David and Howard on this matter. How do the current dictionary authors feel
about implementing the change of name?

D>      I am trying to persuade Syd that the terminology of ddl should 
D> substitute 'su' for 'esd'.  He is naturally concerned that this 
D> recommendation may be a passing fad and that ddl and cifdictionaries 
D> would look stupid in the future using a symbol that no one has ever heard 
D> of.  I, however, am convinced that su will supercede esd and that we are 
D> in danger of building in an archaic terminology into the files.  Syd 
D> seems to be willing to make the substitution if he can be convinced that 
D> this is really the way of the future and that IUCr will adopt this 
D> terminology. ...  Would it be possible to give Syd reassurance on the
D> matter? You did raise the issue in a comcif circular ... It may be that
D> we should come to some formal decision to adopt this terminology as a
D> committee.  What is the position of the Union?

H>   The original report [Statistical Descriptors in Crystallography Acta Cryst
H> (1989) A45, 63-75 - Schwarzenbach et al] was the work of Subcommittee of the
H> IUCr Commission on Crystallographic Nomenclature appointed 27 Feb. 1985. The
H> report was accepted on 9 May 1988 by the Commission and on 2 September 1988
H> by the Executive Committee.
H> 
H>   On 4 March 1993 a Working Group of the Commission on Crystallographic
H> Nomenclature was appointed.  ... the working group was called
H> upon to consider an International Standards Organisation (approved) document
H> called "Guide to the Expression of Uncertainty in Measurement". We are now up
H> to draft 3 of the report of our considerations on the ISO document. Dieter
H> [Schwarzenbach] is at work on draft 4. ...  Once the working group has put
H> its report in its "final" form, it has to be approved first by the IUCr
H> Commission on Crystallographic Nomenclature and then by the Executive 
H> [Committee] itself.  Here is the current form of the abstract of our report:
H> 
H>  <<The Working group has examined recent recommendations for evaluating 
H> uncertainty in measurement issued by ISO. This new report presents the
H> concepts of standard uncertainty, and of Type A and Type B evaluations of
H> standard uncertainties. It expands the earlier dictionary of statistical
H> terms. The recommendations propose replacing the term "estimated standard
H> deviation" (e.s.d.) by standard uncertainty (s.u.) and request a complete
H> description of the experimental and computational procedure used to obtain
H> all results submitted to IUCr publications.>>
H> 
H> I have made a combined report in pseudo-Acta-Cryst format from the 1989 Acta
H> Report and the current Working Groups report where (amongst other things) 
H> references to e.s.d. have been changed into s.u. ... Soon the other members
H> of the working group will be able to give their opinion. 
H> 
H> All of this sounds very formal but it is intended to be useful.

D17.2 Revised DDL
-----------------
Responses to Syd's enhancements to the DDL seem generally favourable. I can
commend to any interested parties an "essay" on DDL by Jan Zelinka which has
been deposited in the gopher hole at Columbia (cuhhca.hhmi.columbia.edu, port
70). This is a very informative description of the requirements for structuring
data to be loaded into relational (or other) database applications. Jan suggests
some further modifications to the DDL (which I suspect Syd won't be prepared
to buy) that align the DDL terms more closely with RDB record keys. What
seems to me of most interest is that this maps more closely onto RDB
terminology, but does not, I think, add to the information content of a
file described by existing DDL terms. On the other hand, it reduces the
generality of terms such as _list_reference, which can be applied to
hierarchical data structures (i.e. STAR files with nested loops). Whether or
not this potentially greater generality in Syd's formulation will allow the
same terminology to be used in loading up both relational and hierarchical
data structures is a very interesting question, and one that might be worth
investigating in another arena.

B> The category issue is resolved (provided that categories in the previous
B> main CIF dictionary are combined so that it is possible to combine
B> appropriate items in a loop or spread them across multiple loops).

Paula and I have been working together to align the category assignment
across the core and mm dictionaries, and I really hope that we can soon
unleash these revised versions on the world. When this happens, I invite Brian
to look carefully at the categories assigned to check that they meet this
requirement.

B> I am at an impasse with Syd on _include_file & data types xdat 
B> and incl. I feel that for the syntax to be appropriate for inclusion
B> into an archive and interchange format, several problems need be
B> addressed.
B>   1) the usage must not be operating system dependent
B>   2) it must be possible to determine the name of an include file
B> 	from information inside the file (comments don't count -- 12A8.1)
B>   3) we need some provision for uniqueness in naming include files.
B> 
B> Points 1 & 2 are easy to resolve (except perhaps for xdat files, which
B> may not be portable, anyway.) Point 3 is the hard part, but is also
B> very important. Imagine that you have an archive of 10,000 CIF files
B> and 50 files from 3 different labs refer to include files called
B> 'local_defaults.cif'. How will you identify the correct include file
B> to use for each CIF? What if 'local_defaults.cif' has revision each time 
B> the lab submits a CIF to Acta? Guess those CIFs are not very useable!
B> 
B> I realize that Paula needs the _include_file syntax now for the mm
B> dictionary and there does not seem to be any hope of resolving point 3
B> at this time. So my suggested compromise is to restrict use of
B> _include_file and incl to inside dictionaries (as we have for use of 
B> _global). Also, not allow use of xdat at all in CIF. When we resolve 
B> the uniqueness problem, we can allow _include_files in "normal" CIF
B> files. Since there are a limited number of dictionaries, we should be
B> able to handle the file name issue the old-fashioned way.

To an extent, this is retreading old ground. But I have a couple of comments.

One is that 'include' files can indeed be physically included where there is
danger of ambiguity. Hence, a laboratory might have an _include_file pointing
to details of its (unvarying) experimental setup, to prevent incessant
retyping of the same information. The file they send to Chester should suck
in that information and have it explicit. It might be that we could make a
special arrangement if that laboratory was for some reason unable to do this
- they would send their CIF + include file, and our local software would do
the 'including', so that the file we sent to Cambridge would be complete and
entire of itself.

The other is that 'official' include (and maybe xdat) files could be
identified by a string looking something like <iucr/mm/restr.lst>, where
the <> mean "official - follow the IUCr rules for retrieving the file",
"iucr" is the flag for the body with authority over the file (e.g. a list
of standard data for amino acids might be retrieved from a PDB repository by
accessing <pdb/amino/alanine.cif>), "mm" is a hierarchical identifier (there
could be more if needed: mm/protein, mm/nucleic_acid) and "restr.lst" is
a file pointer. It is of course possible that this string might point to a
file whose physical location within an ftp directory is "iucr/mm/restr.lst",
but not required. (And if anyone complains, I'll as happily go for
[iucr|mm|restr.lst] !) Without the surrounding <> or [], the string is not
portable, but might be used as a local file pointer (this is analogous to the
situation with "local" data names). But here's the interesting part. The
restraints in X-plor may be described by a dictionary written local to that
program, which is deposited as an include file with Chester (or PDB, or
wherever), as <private/xplor/restr.lst>. The IUCr claims no authority over
the contents of the file (hence "private"), but has assigned a reserved
identifier to the particular program author ("xplor"). The mechanisms set up
to allow a conformant CIF reader to retrieve "official" include files will
work to retrieve these other files.

I would expect this to work in practice thus: an intelligent CIF reader
sees an include_file pointer like "<iucr/mm/restr.lst>", recognises (from the
delimiters) that it is supposed to be able to do something with this, looks
for the file in its standard library area  as defined when the application
was compiled and installed (if it's Unix, it might search for
"/usr/local/cif/iucr/mm/restr.lst", if DOS for "C:CIFLIB\IUCR\MM\RESTR.LST",
on a VAX for "user:[CIF.LIB.IUCR.MM]restr.lst;*", on a Mac for
"Useful stuff for CIFs:iucr:mm:restr.lst" and so on - it's even
possible to map the subfields (iucr, mm etc) to different names if that's
considered desirable). If it can't find the file there, it can retrieve it,
again following well defined rules - anon-ftp to ftp.iucr.ac.uk, "get
mm/restr.lst"; or "send mm/restr.lst" in an e-mail message to
[email protected].

This scheme doesn't address 'versioning' - maintenance of successive versions
of an include_file. Provided the file is genuinely only ever extended, this
isn't a problem - the latest version will always encompass all earlier
versions. Suppose the CIF reader has a 'cacheing' philosophy: if the file in
the standard local library area is older than x months, it will retrieve the
current version from the official depository. A command-line switch when
starting up the application will force it to download the current version
anyway.

If the CIF reader doesn't find the angle brackets around the include_file
pointer, it does whatever is appropriate (which may include ignoring the
directive).

Any mileage in these ideas?

I've used my privileged position as moderator to outline my own ideas in
some depth here. They should be read in conjunction with Syd's comments on
this matter in a message to me of a few days ago:

S> The latest DDL change, which allows for included data, was in response
S> to Paula's request in the first instance, but also because of earlier
S> concerns about the hierachy dictionary definitions and how these should
S> be loaded into applications. The clear solution to the dictionary problem
S> was to define a special DDL command _include_file which would signal that 
S> a named file be inserted to replace the _include_file tuple. This file
S> MUST conform to STAR syntax and satisfy the uniqueness requirements of
S> the parent file. The proposed solution to the CIFmm needs was to enable
S> any data item to be classified (via the _type_conditions DDL command) as
S> specifying an external data file. This file need not conform to the STAR
S> syntax but would be accessed as part of the data structure [I am a bit 
S> uneasy how this will look or how it will work -- and I hope Paula and 
S> Phil have setup some actual applications!].
S> 
S> Brian Toby thought the include files to be a good idea but wanted to
S> impose very strict rules on the structure of the filenames. He suggested
S> for example that the table of element symbols in the MIF definitions be
S> names as 'IUCr|1994-01-10|09:51|elemsymb.c93' rather than 'elemsymb.c93'.
S> You will remember that we discussed this naming in Chester and 
S> agreed that "generic" naming of files (within dictionaries) was more 
S> desirable than highly specific version/site/chrono filenames that would 
S> require continual updating of the parent files.  I do agree with BT,
S> however, that the include files themselves need to be clearly
S> identified in terms of the current status and version. What I have 
S> counter-proposed is that the 'incl' type files (as opposed to the 'xdat'
S> files) contain an additional DDL item "_file_version_id" which would 
S> specify  this type of data (perhaps with the structure that BT suggests).
S> 
S> While I am strongly opposed to replacing the generic naming of 'incl'
S> files (which can contain _file_version_id info) I am really unable to make
S> that judgement for the 'xdat' files (which may or may not support the STAR
S> file definition of _file_version_id). Because of their much more application
S> specific nature, "xdat" files may need to use the type of naming that BT 
S> suggests. My suspicion is that they will not but I will be really guided by
S> the comments of Paula and Phil -- if possible wrt a specific file. 
S> 
S> A cautionary note. The concept of a highly specific filename does have some
S> immediate attractions in terms of clearly identifiable uniqueness, but please
S> do not overlook the software implications of a complex name structure and 
S> the name dependency wrt to the parent file (this will provide a large element
S> of uniqueness anyway).

Regards
Brian
Prev by Date: (17) Intellectual Property Rights, new DDL, matters arising
Next by Date: (19) DDL types, filename handles and other matters
Index(es):
- Date
Discussion List Archives

(18) New Consultants, _include_file, matters arising