(11) Restraints; naming data blocks and external files

To: COMCIFS@uk.ac.iucr
Subject: (11) Restraints; naming data blocks and external files
From: bm@uk.ac.iucr (Brian McMahon)
Date: Mon, 8 Nov 93 11:10:47 GMT
Dear Colleagues

Please forgive the recent flood of circulars from here. The CIFtools workshop
seemed a useful spur for placing these items on the discussion table, but
there is no particular urgency implied in forwarding them so thick and fast.

D4.1 Restraints
---------------
In response to my query to George Sheldrick in (8)D4.1 ("would it be possible
in principle to devise ... descriptors ... [to] enable a user to repeat the
refinement"), George has answered thus:

G> The answer is "I'm not sure", but the result would be much
G> less intuitive and elegant than the solution
G> based on a text field called _shelx_input_deck.  Quite frankly, I fail
G> to see the point of this torturous exercise; it would not enable the
G> refinement to be repeated with any other program, and it would involve
G> an enormous amount of effort to no useful purpose.  You have to remember
G> that the SHELX input file has been specifically designed for the purpose,
G> and is subject to none of the constraints of CIF, in particular those
G> concerned with lists.  The reason why I would like to be able to define
G> a few descriptors starting with _shelx_ is NOT for use by my own programs,
G> but so that information of general interest to databases and other
G> users of the resulting CIF files can be included.  In particular I have
G> in mind the summary of the restraints, which is present (as text) in most
G> recent PDB entries, but is to some extent specific to the refinement
G> program which has been used.  This implies that there needs to be a
G> procedure for registering descriptors such as '_shelx_', '_tnt_', 
G> '_xplor_' etc.; '_local_' would be of no use to me.

The question was prompted by the increasing number of suggestions that CIFs
may be used within databases, molecular modelling programs, and so forth -
an extreme interpretation of its "universality". I was interested in the
feasibility of using it as a program-specific worksheet, but readily agree
that this is not in general a useful path to follow. 

The consensus seems to be that Paula's approach to handling this problem
in the macromolecular dictionary is appropriate and useful.

======

In my review of the Tarrytown workshop, and in other places, I have already
alluded to the "name space" problem. The problem is to enable the retrieval
of data in CIF format from the crystallographic universe of knowledge. There
are two aspects to this - providing pointers to other data blocks within the
same file, and locating other files.

There have already been extensive discussions between several of us on this
matter (Syd, for instance, need read no further!). In one sense, there is no 
real problem: naming conventions may be established to suit local requirements.
The question is whether an "official" convention should be established. If
so, this Committee should, presumably, preside over such conventions. Brian
Toby has asked me to raise this matter (see his remarks below), and the
mmCIF'ers are quite agitated over it. The intra-file problem might already be
resolved: please read section D11.1 carefully, and contribute your thoughts on
the suggestions contained therein. The inter-file case hasn't been discussed
so thoroughly, but obviously has many points of similarity.

I let Brian Toby introduce the discussion:
B> Also under the category of DDL work. I mentioned at the MM meeting
B> that the current PD dictionary has two types of pointers that may have
B> general applicability:
B>   _pd_phase_id	is used to point from one data_ block to another. 
B>   _dataset_id	is used to point from one CIF (file) to another and
B> 		has a standardized definition to avoid "name space"
B> 		issues.
B> 
B> The MM folks I spoke with felt that this will be generally useful and 
B> should be defined in the DDL.


D11.1 Naming of data blocks
---------------------------
The discussions on this were well summarised by a lengthy message from Syd to
Phil Bourne (architect of the Tarrytown meeting and a member of Paula's
mmCIF working group) earlier this year, which I reproduce below with just a few
additional comments.

SRH] Discussions on the naming of data blocks go back to 1989! Much much water
SRH] has flowed under this bridge. The final conclusions were that this is an
SRH] archival or application consideration. PDB will handle this naming in the
SRH] way that best suits the PDB; the same applies to the CCDC or the IUCr (even
SRH] though the same data blocks may be involved). Provided that the application
SRH] knows whence the data came, then the rules of data block naming will be
SRH] understood.

Our interest here is whether we should (if we can) define conventions that
IUCr, PDB etc will all adopt, so that you can guarantee that the
application will understand the data block naming.

SRH] Of course, it is quite possible that IUCr, PDB, ICDD and CCDC will agree on
SRH] a data block naming construction -- but I wouldn't hold my breath! My
SRH] general inclination is to keep the data block name brief with a clearly
SRH] recognisable code (e.g. PDB_93_A_83) and embed the detailed information
SRH] within the data block.
SRH] Here is an example of how the powder people want to handle the internal id.
SRH] 
SRH] data_pd_dataset_id
SRH]     _name                       '_pd_dataset_id'
SRH]     _type                        char
SRH]     _list                        both
SRH]     loop_ _example
SRH]                             Si-std|D500#1234-987|B.Toby|91-15-09|16:54
SRH]                             SEPD7234|IPNS-SEPD|B.Toby|91-15-09|16:54
SRH]     _definition
SRH] ;           Used to assign a unique character string to a set of data. 
SRH]             This code is assigned by the originator of the dataset may
SRH]             be used to catalogue the dataset. 
SRH]
SRH]             Since CIF's will be modified, additional dataset ID's
SRH]             should be assigned to differentiate the revised CIF from
SRH]             the original. The previous and new ID's can then be looped.
SRH]             The original dataset ID should always be retained, but
SRH]             if there are multiple revisions, it is not necessary that
SRH]             ID's for all intermediate revisions be retained.
SRH]
SRH]             The format for the ID code is:
SRH]                 <sample_id>|<instr._name>|<creator_name>|<date>|<time>
SRH]
SRH]              <sample_id>    is an arbitrary name assigned by the
SRH]                             originator of the initial dataset.
SRH]              <instr._name>  is a unique name [so far as possible] for
SRH]                             the data collection instrument, preferably
SRH]                             containing the instrument serial number for
SRH]                             commercial instruments.
SRH]              <creator_name> is the name of the person who collected the
SRH]                             diffraction data or who prepared the CIF.
SRH]              <date>         is the date the CIF was created or modified.
SRH]              <time>         is the time the CIF was created or modified.
SRH] 
SRH]             As new names are assigned to the CIF, the original <sample_id>
SRH]             and <instr._name> should be retained, but the <creator_name>
SRH]             may be changed and the <date> and <time> will always change.
SRH]             The <date> and <time> will usually match either the 
SRH]             _audit_creation_date or _audit_update_record entries.
SRH] 
SRH]             Within each section of the code, the following characters 
SRH]             may be used: 
SRH]                           A-Z a-z 0-9 # & * . : , - _ + / ( )
SRH] 
SRH]             The sections are separated with vertical bars '|' which are
SRH]             not allowed within the sections. Blank spaces may also 
SRH]             not be used.  Capitalization be may used within the ID code
SRH]             but should not be considered significant -- searches for 
SRH]             dataset ID names should be case insensitive.
SRH] 
SRH]             Dates must follow the format 'yy-mm-dd' ...
SRH]             Times must have the format 'hh:mm' ...
SRH] ;

Note that this convention has been suggested to provide a mechanism for
pointing to files, and perhaps properly belongs to the discussion below.
However, I leave it here because it does illustrate the aversion to loading
the data block name with too much information. Compare this with Brian's
other novel mechanism, the use of _pd_phase_id to point to specific data
block names:

data_pd_phase_id
    _name                      '_pd_phase_id'
    _category                    pd_phase   
    _type                        char
    _list                        yes
    _list_mandatory              yes
    _list_uniqueness           '_pd_phase_id'
    _list_link_child           '_pd_refln_phase_id'
    _definition
;              A code identifying each crystal phase contributing to the
               diffraction data. Crystallographic data for this phase, if
               stored in the current file, with be stored in a data_ block
               with a block code identical to this code.
;

So here is an example where a CIF Dictionary author is seeking to impose an
internal cross-referencing convention. It seems simple and sensible enough.
Is everyone happy with this?

I can see one problem, which is elaborated below in a different context. In
our debate on block code names, we allowed for the renaming of data blocks,
so long as this was accompanied by an audit trail of the past and current
names. It is difficult to allow this AND adopt Brian's convention, unless
the _pd_phase_id value is also changed to match the currently assigned block
code. Could be done, of course, but seems messy.

Back to Syd's review:

SRH] ---------------------------------------------------------------------------
SRH] From syd Sat Oct  5 16:16:20 1991
SRH] To: pre10@phoenix.cambridge.ac.uk
SRH] Subject: data block names
SRH] 
SRH] I will try to lay out as simply and as briefly as possible what the
SRH] block name requirements are. I hope to convince you that "the only
SRH] purpose of a data block name is to identify the data contained therein".
SRH] 
SRH] Local block names
SRH] 
SRH] At the *user* level (laboratory) the data name will probably (almost
SRH] certainly) be a simple structure code allocated by the CIF generating
SRH] software. For example, if the structure code is 'cc3' the program CIFIO
SRH] in Xtal will generate a CIF 'cc3.cif' containing a single data block headed
SRH] by 'data_cc3'. In the laboratory it is likely that other cif-type archive 
SRH] files, such as 'cc3.arc', will be generated at the same time with much
SRH] more information than in 'cc3.cif' (which contains only those data items
SRH] required by Acta C). So the code 'cc3' is a very important identifier to
SRH] the user and for the laboratory archiving system -- it has been allocated 
SRH] by the laboratory for their own special local reasons!

So the creator of a CIF is not bound to any rules for data block naming.

SRH] If the single structure 'cc3' is to be submitted to a journal as a CIF, it 
SRH] will probably be sent by email as a single file headed by 'data_cc3'.
SRH] This data block can contain the data AND the manuscript (most test CIF's
SRH] sent to Acta so far have been in this form). It IS possible that the user
SRH] may separate the data from manuscript -- in which case there may be two
SRH] data blocks in the email headed by 'data_cc3' and 'data_manuscript'.

We deliberately chose not to impose this structure on authors (acting in the
experimental spirit of "accept the problems raised by the great generality of
CIF"). It means we need to use some neat tricks to generate our papers, since
we don't know in advance the structure of the file. This shows that you don't
always need to have strict rules on such things - yet admittedly it would
make things easier if we knew that the text of the paper would ALWAYS be in
"data_manuscript".

SRH] If the structure 'cc3' is submitted to Acta for joint publication with 
SRH] another related structure 'cc7', then the most common practice will be
SRH] to submit this as three data blocks 'data_manuscript', 'data_cc3' and
SRH] 'data_cc7'. The information in the first block will tie together the two
SRH] structures but it is quite possible that data in 'data_cc3' and 'data_cc7'
SRH] will NOT reference each other (why should they?).
SRH] 
SRH] If you asked the user to systematize the block names, what would he/she do?
SRH] In the above case one might suggest 'data_cc3/cc7_manuscript' and
SRH] 'data_cc3/cc7_structure_1_of_2' etc. but this isn't very sensible  -- and
SRH] what happens when there are twenty related structures? The user is not
SRH] interested in this; he doesn't want to do it; he doesn't know how to do it
SRH] and he won't do it (the she's won't either!). The user is only interested
SRH] in the fact that these structures are archived as 'cc3' and 'cc7'. 
SRH] 
SRH] Journal block names
SRH] 
SRH] The journal receives the above manuscript/cc3/cc7 email submission. It 
SRH] will be stored initially as a single file. The journal must assign its own 
SRH] internal filename. The Co-editor has not seen it yet so there is not a 
SRH] Co-editor code at this point in time. 
SRH] 
SRH] Clearly it is at the journal stage that systematization of the CIF
SRH] identification is needed. It is not crystal clear to me that this need
SRH] be at the 'block name' as well as the 'filename' level. Probably both, but
SRH] my guess is that only experience will tell. Remember also that in the
SRH] above scenario that hardcopy and CIF(s) are yet to go to the Co-editor.
SRH] The Co-editor will probably store the CIF rec'd by email as a single file
SRH] named according to the Co-editor code (e.g. 'hl0012.25may91'). It is 
SRH] quite probable that the author(s) will be requested to make changes to 
SRH] the submitted material. For example there are errors in the primary
SRH] data of 'cc7' which necessitate the regeneration and resubmission of the
SRH] 'data_cc7' data block. This must go to Chester directly for checking and
SRH] be merged with their existing files.
SRH] 
SRH] So far there have been at least three levels of processing a set of data
SRH] blocks -- the user, the journal staff and the Co-editor. At some stage
SRH] it is likely that back references of these file will need to be accessed.
SRH] It is also obvious that inter-dependencies of data blocks, such as that
SRH] of 'data_manuscript' on the data blocks 'data_cc3' and 'data_cc7', must be
SRH] recorded within the linking data block (e.g. 'data_manuscript'). 
SRH] 
SRH] Database block names
SRH] 
SRH] Finally we get to the archiving of this data. This again will happen at 
SRH] several levels. The journals will archive the submitted CIF's, and the 
SRH] databases will archive and distribute subsets of this data. I am 
SRH] certain that this will involve different file and data naming procedures.
SRH] This is primarily because it is at this stage that data blocks will need
SRH] to be concatenated into multi-block files. It is here that the systemat-
SRH] ization of the block codes will be crucial -- and it is here that most of 
SRH] my suggestions refer. Quite frankly I think these naming conventions are 
SRH] strictly the business of the archiver -- but with the proviso that each
SRH] data block contains an up-to-date record of this and past block names
SRH] and file names.
SRH] 
SRH] 
SRH] This is my brief synopsis of the naming requirements. I DO understand that
SRH] the matter of systematization is absolutely crucial to CCDC, as it will 
SRH] be to other databases and to the journals. I am again certain that each of 
SRH] these organisations will adopt their own naming conventions and we are
SRH] wasting our time if we think we can impose conventions at this level.
SRH] 
SRH] What we can do is supply mechanisms that encourage each level of CIF
SRH] processing to record the current archiving procedure. In this way EACH
SRH] data block will retain a record of what it was called (i.e. the block
SRH] code) and where it was stored (i.e. the filename) for the purpose of back
SRH] referencing.
SRH] 
SRH] So I suggest that the following are essential:
SRH] 
SRH] (1) That data blocks that have dependencies on other data blocks (e.g.
SRH]     'data_manuscript' blocks) contain data items that explicitly
SRH]     specify this dependency. E.g.
SRH] 
SRH]     loop_
SRH]         _audit_linked_data_block
SRH]         _audit_linked_data_block_comments   
SRH]   
SRH]        cc3  'structure xyz related to pqr referred to manuscript'
SRH]        cc7  'structure pqr related to xyz referred to manuscript'
SRH] 
SRH]     Note that neither 'data_cc3' nor 'data_cc7' data blocks need contain
SRH]     these data items.
SRH] 
SRH] (2) That each data block should record its current block code and filename.
SRH]     Note that this is strictly for 'back-reference' purposes. This could
SRH]     be done within the _audit_update_record but it is more likely to be
SRH]     used if there are specific data items for this purpose. More important,
SRH]     specific data items will make the automatic appending of the data much
SRH]     easier for generating software, such as CIFIO and CIFER. This is how I
SRH]     would like to see it done.
SRH] 
SRH]     loop_
SRH]         _audit_data_block_entry
SRH]         _audit_data_block_name
SRH]         _audit_data_block_file
SRH]         _audit_data_block_locality
SRH]         _audit_data_block_comments
SRH] 
SRH] 91:03:17  cc3       cc3.cif         RSC,ANU,Canb. 'generated by CIFIO'
SRH] 91:03:26  cc3       AC5533          Acta,Chester  'with manuscript and cc7'
SRH] 91:03:28  cc3       AC5533.cc3      Acta,Chester  'data check completed'   
SRH] 91:04:02  cc3       HL0012.02apr91  Hall,Perth    'with manuscript & cc7'  
SRH] 91:06:12  cc3       AC5533.pub      Acta,Chester  'paper accepted'   
SRH] 91:06:28  C910477   ACsep91         Acta,Chester  'structure archived'
SRH] 91:09:03  C910477   acta.oct91      CCDC,Cambr.   'acta data rec'd'    
SRH] 91:09:23  B77-8854  CCDC91-6.4      CCDC,Cambr.   'data checked & archived'
SRH] 
SRH] OK, I hope that this helps us converge quickly on a decision about this. We
SRH] all need to get on with the business of actually generating and processing
SRH] these files. 
SRH] 
SRH] Cheers, Syd.
SRH] ----------------------------------------------------------------------
...
SRH] The idea of referencing data blocks from within data blocks was raised 
SRH] before by Brian McMahon for the IUCr archives. 
SRH] 
SRH] There appears to be two ways to go about 'linking' related data blocks.
SRH] 
SRH] 1) The first is to define a hierarchical block code that identifies the
SRH] relationship between data blocks. Examples of these are:
SRH] 
SRH]       data_HA543_manuscript_only
SRH]       data_HA543_structure_1_of_2
SRH]       data_HA543_structure_2_of_2
SRH] 
SRH] There may also be some good reasons to systematise the structure code (e.g.
SRH] 'HA543') so that it relates to the publication, or to the order of
SRH] archiving. For example Chester may take the incoming CIF's and
SRH] systematise the data block codes to a scheme of its own. E.g.
SRH] 
SRH]       data_ACC_91_7_manuscript     #<< Acta C 1991 paper 7
SRH]       data_ACC_91_7_data_1_of_2    #<< Acta C 1991 paper 7
SRH]       data_ACC_91_7_data_2_of_2    #<< Acta C 1991 paper 7
SRH] or
SRH]       data_IUCr_91_785_manuscript     #<< IUCr archived data no. 785 1991
SRH]       data_IUCr_91_785_data_1_of_2    #<< IUCr archived data no. 785 1991
SRH]       data_IUCr_91_785_data_2_of_2    #<< IUCr archived data no. 785 1991
SRH] 
SRH] 2) The second approach is to specify the links between data blocks within
SRH] the data. The most obvious place to do this is in the audit section. E.g.
SRH] 
SRH]       loop_
SRH]           _audit_linked_data_block
SRH]           _audit_linked_data_block_description
SRH] 
SRH]       HA543_structure_1_of_2    'structure HA543 Ag derivative'
SRH]       HA543_structure_2_of_2    'structure HA543 Au derivative'
SRH] 
SRH] This information could appear in each data block or, possibly, only in the 
SRH] leading one (thus signaling that further related data blocks follow).

There is a bit of repetition in this, but I think it demonstrates the lines
of approach. Phil Bourne's response to this was fairly positive:

PEB] There are 3 immediate problems:
PEB]    (i) Associating data blocks either of related data or of structure 
PEB]        data to data on standard and non-standard groups
PEB]   (ii) Associating data blocks with real file names
PEB]  (iii) Defining what dictionaries the data block conforms to
PEB] 
PEB] (i) and (ii) are of immediate concern to us! Let's take one at a time.. 
PEB] 
PEB] ** (i) I would suggest that both of your previous suggestions be adopted...
PEB] 
PEB] First
PEB] -----
PEB] The first is to define a hierarchical block code that identifies the
PEB] relationship between data blocks. Examples of these are:
PEB] [this is a local thing and should be the policy of those responsible for
PEB]  the archive]
PEB] 
PEB]       data_HA543_manuscript_only
PEB]       data_HA543_structure_1_of_2
PEB]       data_HA543_structure_2_of_2
PEB]  ...
PEB] 
PEB] Second
PEB] ------
PEB] The second approach is to specify the links between data blocks within
PEB] the data. The most obvious place to do this is in the audit section. E.g.
PEB] 
PEB]       loop_
PEB]           _audit_linked_data_block
PEB]           _audit_linked_data_block_description
PEB] 
PEB]       HA543_structure_1_of_2    'structure HA543 Ag derivative'
PEB]       HA543_structure_2_of_2    'structure HA543 Au derivative'
PEB] 
PEB] ** (ii)
PEB] 
PEB] The data blocks still need to be associated with specific files and
PEB] additional details about that file eg who maintains it.. 
PEB] 
PEB] This could be done along the lines you suggested:
PEB] 
PEB]     loop_
PEB]         _audit_data_block_date
PEB]         _audit_data_block_name
PEB]         _audit_data_block_file
PEB]         _audit_data_block_locality
PEB]         _audit_data_block_special_details
PEB] 
PEB]    93:03:17   standards   pdb.std     pdb    'CIF file of standard groups'

Syd's remarks at the end of this review are also worth quoting:

SRH] Well, that's some of the history, warts and all!  An interesting footnote
SRH] to all of this, is that none of these data items were put forward for
SRH] eventual inclusion in the Core '91 definitions. Perhaps they are needed,
SRH] and perhaps they are not. The above discussions may aid in reaching a 
SRH] conclusion on this without too much repetition.

So, should the _audit_data_block_ names now go in the Core Dictionary? The
_audit_linked_ ones?  

Should we insist that data block names can NEVER be changed? (A corollary to
this is that files could not be concatenated, since block names may not recur
within a file - and that's not fixable by an "unglobal_" type solution!)

D11.2 File names
----------------
How to refer from within a CIF to another file? File naming and location are
not portable, so it's very tricky to devise a useful method of doing this.

Why should you want to? In Brian T.'s powder dictionary, _pd_dataset_id (which
was described above) can be used to catalogue the file; and one would wish to
retrieve the file based on this catalogue identifier. I suspect this is
possible only through local conventions (or conventions shared within
common-interest groups) - a table would relate the dataset_id's to filenames
on the local system, thus:
    Si-std|D500#1234-987|B.Toby|91-15-09|16:54    /usr/data/std/blah.blah
    SEPD7234|IPNS-SEPD|B.Toby|91-15-09|16:54      /usr/data/sep/bleh.bleh
and any relevant applications software needs to know how to do the lookup.

However, I can see two cases where the facility to point to external
"standard" filenames would be important: in identifying dictionaries against
which to test the validity of data items, and to point to keywords or
standard data files. Hence, we might have a data item "_conformance" which
indicates which CIF dictionaries should be read for validation, thus:
     loop_  _conformance   cifdic.C93   cifdic.M94   cifdic.mylab

How does the validation software know where to find cifdic.C93 etc? There
seems to be some similarity to modern compilers, where a preprocessor reads a
line such as "#include <stdio.h>" and knows from the angle brackets to look
in a standard directory (/usr/include, or whatever has been defined as
appropriate for the system on which the compiler resides). Shall we introduce
the convention that <cifdic.C93> means "the file named cifdic.C93 in the
standard directory for storing CIF defining data as established by the system
manager"? And perhaps, in the macromolecular dictionary entry for 
data_struct_keywords, there might be a convention such as 
     _enumeration  @<mmstruct.kwd>
where the "@" (or some other suitable symbol) has the same function as the
"#include" statement to the compiler - i.e. "read in at this point and expand
the contents of the file... (structmm.kwd in the standard directory)".

This latter suggestion has certain formal difficulties. If the file contains
more than a single term, the enumeration should be looped. So
"loop_  _enumeration  @<mmstruct.kwd>"  is indicated (and presumably is OK
even for a degenerate loop with only a single term present). But what if the
file contains _enumeration's and _enumeration_detail's?
     "loop_  _enumeration   _enumeration_detail    @<mmstruct.kwd>"
may be correct when the contents of the file are expanded, but is parsed
without file expansion as <loop_> <data name> <data name> <one char term only>

Paula: have you thought in detail about how you want to implement this?


Regards
Brian
Prev by Date: (10) STAR changes, DDL, dataname character sets
Next by Date: (12) Schedule; STAR extensions and naming conventions, etc.
Index(es):
- Date
Discussion List Archives

(11) Restraints; naming data blocks and external files