[Date Prev][Date Next][Date Index]
(11) Restraints; naming data blocks and external files
- To: COMCIFS@uk.ac.iucr
- Subject: (11) Restraints; naming data blocks and external files
- From: bm@uk.ac.iucr (Brian McMahon)
- Date: Mon, 8 Nov 93 11:10:47 GMT
Dear Colleagues Please forgive the recent flood of circulars from here. The CIFtools workshop seemed a useful spur for placing these items on the discussion table, but there is no particular urgency implied in forwarding them so thick and fast. D4.1 Restraints --------------- In response to my query to George Sheldrick in (8)D4.1 ("would it be possible in principle to devise ... descriptors ... [to] enable a user to repeat the refinement"), George has answered thus: G> The answer is "I'm not sure", but the result would be much G> less intuitive and elegant than the solution G> based on a text field called _shelx_input_deck. Quite frankly, I fail G> to see the point of this torturous exercise; it would not enable the G> refinement to be repeated with any other program, and it would involve G> an enormous amount of effort to no useful purpose. You have to remember G> that the SHELX input file has been specifically designed for the purpose, G> and is subject to none of the constraints of CIF, in particular those G> concerned with lists. The reason why I would like to be able to define G> a few descriptors starting with _shelx_ is NOT for use by my own programs, G> but so that information of general interest to databases and other G> users of the resulting CIF files can be included. In particular I have G> in mind the summary of the restraints, which is present (as text) in most G> recent PDB entries, but is to some extent specific to the refinement G> program which has been used. This implies that there needs to be a G> procedure for registering descriptors such as '_shelx_', '_tnt_', G> '_xplor_' etc.; '_local_' would be of no use to me. The question was prompted by the increasing number of suggestions that CIFs may be used within databases, molecular modelling programs, and so forth - an extreme interpretation of its "universality". I was interested in the feasibility of using it as a program-specific worksheet, but readily agree that this is not in general a useful path to follow. The consensus seems to be that Paula's approach to handling this problem in the macromolecular dictionary is appropriate and useful. ====== In my review of the Tarrytown workshop, and in other places, I have already alluded to the "name space" problem. The problem is to enable the retrieval of data in CIF format from the crystallographic universe of knowledge. There are two aspects to this - providing pointers to other data blocks within the same file, and locating other files. There have already been extensive discussions between several of us on this matter (Syd, for instance, need read no further!). In one sense, there is no real problem: naming conventions may be established to suit local requirements. The question is whether an "official" convention should be established. If so, this Committee should, presumably, preside over such conventions. Brian Toby has asked me to raise this matter (see his remarks below), and the mmCIF'ers are quite agitated over it. The intra-file problem might already be resolved: please read section D11.1 carefully, and contribute your thoughts on the suggestions contained therein. The inter-file case hasn't been discussed so thoroughly, but obviously has many points of similarity. I let Brian Toby introduce the discussion: B> Also under the category of DDL work. I mentioned at the MM meeting B> that the current PD dictionary has two types of pointers that may have B> general applicability: B> _pd_phase_id is used to point from one data_ block to another. B> _dataset_id is used to point from one CIF (file) to another and B> has a standardized definition to avoid "name space" B> issues. B> B> The MM folks I spoke with felt that this will be generally useful and B> should be defined in the DDL. D11.1 Naming of data blocks --------------------------- The discussions on this were well summarised by a lengthy message from Syd to Phil Bourne (architect of the Tarrytown meeting and a member of Paula's mmCIF working group) earlier this year, which I reproduce below with just a few additional comments. SRH] Discussions on the naming of data blocks go back to 1989! Much much water SRH] has flowed under this bridge. The final conclusions were that this is an SRH] archival or application consideration. PDB will handle this naming in the SRH] way that best suits the PDB; the same applies to the CCDC or the IUCr (even SRH] though the same data blocks may be involved). Provided that the application SRH] knows whence the data came, then the rules of data block naming will be SRH] understood. Our interest here is whether we should (if we can) define conventions that IUCr, PDB etc will all adopt, so that you can guarantee that the application will understand the data block naming. SRH] Of course, it is quite possible that IUCr, PDB, ICDD and CCDC will agree on SRH] a data block naming construction -- but I wouldn't hold my breath! My SRH] general inclination is to keep the data block name brief with a clearly SRH] recognisable code (e.g. PDB_93_A_83) and embed the detailed information SRH] within the data block. SRH] Here is an example of how the powder people want to handle the internal id. SRH] SRH] data_pd_dataset_id SRH] _name '_pd_dataset_id' SRH] _type char SRH] _list both SRH] loop_ _example SRH] Si-std|D500#1234-987|B.Toby|91-15-09|16:54 SRH] SEPD7234|IPNS-SEPD|B.Toby|91-15-09|16:54 SRH] _definition SRH] ; Used to assign a unique character string to a set of data. SRH] This code is assigned by the originator of the dataset may SRH] be used to catalogue the dataset. SRH] SRH] Since CIF's will be modified, additional dataset ID's SRH] should be assigned to differentiate the revised CIF from SRH] the original. The previous and new ID's can then be looped. SRH] The original dataset ID should always be retained, but SRH] if there are multiple revisions, it is not necessary that SRH] ID's for all intermediate revisions be retained. SRH] SRH] The format for the ID code is: SRH] <sample_id>|<instr._name>|<creator_name>|<date>|<time> SRH] SRH] <sample_id> is an arbitrary name assigned by the SRH] originator of the initial dataset. SRH] <instr._name> is a unique name [so far as possible] for SRH] the data collection instrument, preferably SRH] containing the instrument serial number for SRH] commercial instruments. SRH] <creator_name> is the name of the person who collected the SRH] diffraction data or who prepared the CIF. SRH] <date> is the date the CIF was created or modified. SRH] <time> is the time the CIF was created or modified. SRH] SRH] As new names are assigned to the CIF, the original <sample_id> SRH] and <instr._name> should be retained, but the <creator_name> SRH] may be changed and the <date> and <time> will always change. SRH] The <date> and <time> will usually match either the SRH] _audit_creation_date or _audit_update_record entries. SRH] SRH] Within each section of the code, the following characters SRH] may be used: SRH] A-Z a-z 0-9 # & * . : , - _ + / ( ) SRH] SRH] The sections are separated with vertical bars '|' which are SRH] not allowed within the sections. Blank spaces may also SRH] not be used. Capitalization be may used within the ID code SRH] but should not be considered significant -- searches for SRH] dataset ID names should be case insensitive. SRH] SRH] Dates must follow the format 'yy-mm-dd' ... SRH] Times must have the format 'hh:mm' ... SRH] ; Note that this convention has been suggested to provide a mechanism for pointing to files, and perhaps properly belongs to the discussion below. However, I leave it here because it does illustrate the aversion to loading the data block name with too much information. Compare this with Brian's other novel mechanism, the use of _pd_phase_id to point to specific data block names: data_pd_phase_id _name '_pd_phase_id' _category pd_phase _type char _list yes _list_mandatory yes _list_uniqueness '_pd_phase_id' _list_link_child '_pd_refln_phase_id' _definition ; A code identifying each crystal phase contributing to the diffraction data. Crystallographic data for this phase, if stored in the current file, with be stored in a data_ block with a block code identical to this code. ; So here is an example where a CIF Dictionary author is seeking to impose an internal cross-referencing convention. It seems simple and sensible enough. Is everyone happy with this? I can see one problem, which is elaborated below in a different context. In our debate on block code names, we allowed for the renaming of data blocks, so long as this was accompanied by an audit trail of the past and current names. It is difficult to allow this AND adopt Brian's convention, unless the _pd_phase_id value is also changed to match the currently assigned block code. Could be done, of course, but seems messy. Back to Syd's review: SRH] --------------------------------------------------------------------------- SRH] From syd Sat Oct 5 16:16:20 1991 SRH] To: pre10@phoenix.cambridge.ac.uk SRH] Subject: data block names SRH] SRH] I will try to lay out as simply and as briefly as possible what the SRH] block name requirements are. I hope to convince you that "the only SRH] purpose of a data block name is to identify the data contained therein". SRH] SRH] Local block names SRH] SRH] At the *user* level (laboratory) the data name will probably (almost SRH] certainly) be a simple structure code allocated by the CIF generating SRH] software. For example, if the structure code is 'cc3' the program CIFIO SRH] in Xtal will generate a CIF 'cc3.cif' containing a single data block headed SRH] by 'data_cc3'. In the laboratory it is likely that other cif-type archive SRH] files, such as 'cc3.arc', will be generated at the same time with much SRH] more information than in 'cc3.cif' (which contains only those data items SRH] required by Acta C). So the code 'cc3' is a very important identifier to SRH] the user and for the laboratory archiving system -- it has been allocated SRH] by the laboratory for their own special local reasons! So the creator of a CIF is not bound to any rules for data block naming. SRH] If the single structure 'cc3' is to be submitted to a journal as a CIF, it SRH] will probably be sent by email as a single file headed by 'data_cc3'. SRH] This data block can contain the data AND the manuscript (most test CIF's SRH] sent to Acta so far have been in this form). It IS possible that the user SRH] may separate the data from manuscript -- in which case there may be two SRH] data blocks in the email headed by 'data_cc3' and 'data_manuscript'. We deliberately chose not to impose this structure on authors (acting in the experimental spirit of "accept the problems raised by the great generality of CIF"). It means we need to use some neat tricks to generate our papers, since we don't know in advance the structure of the file. This shows that you don't always need to have strict rules on such things - yet admittedly it would make things easier if we knew that the text of the paper would ALWAYS be in "data_manuscript". SRH] If the structure 'cc3' is submitted to Acta for joint publication with SRH] another related structure 'cc7', then the most common practice will be SRH] to submit this as three data blocks 'data_manuscript', 'data_cc3' and SRH] 'data_cc7'. The information in the first block will tie together the two SRH] structures but it is quite possible that data in 'data_cc3' and 'data_cc7' SRH] will NOT reference each other (why should they?). SRH] SRH] If you asked the user to systematize the block names, what would he/she do? SRH] In the above case one might suggest 'data_cc3/cc7_manuscript' and SRH] 'data_cc3/cc7_structure_1_of_2' etc. but this isn't very sensible -- and SRH] what happens when there are twenty related structures? The user is not SRH] interested in this; he doesn't want to do it; he doesn't know how to do it SRH] and he won't do it (the she's won't either!). The user is only interested SRH] in the fact that these structures are archived as 'cc3' and 'cc7'. SRH] SRH] Journal block names SRH] SRH] The journal receives the above manuscript/cc3/cc7 email submission. It SRH] will be stored initially as a single file. The journal must assign its own SRH] internal filename. The Co-editor has not seen it yet so there is not a SRH] Co-editor code at this point in time. SRH] SRH] Clearly it is at the journal stage that systematization of the CIF SRH] identification is needed. It is not crystal clear to me that this need SRH] be at the 'block name' as well as the 'filename' level. Probably both, but SRH] my guess is that only experience will tell. Remember also that in the SRH] above scenario that hardcopy and CIF(s) are yet to go to the Co-editor. SRH] The Co-editor will probably store the CIF rec'd by email as a single file SRH] named according to the Co-editor code (e.g. 'hl0012.25may91'). It is SRH] quite probable that the author(s) will be requested to make changes to SRH] the submitted material. For example there are errors in the primary SRH] data of 'cc7' which necessitate the regeneration and resubmission of the SRH] 'data_cc7' data block. This must go to Chester directly for checking and SRH] be merged with their existing files. SRH] SRH] So far there have been at least three levels of processing a set of data SRH] blocks -- the user, the journal staff and the Co-editor. At some stage SRH] it is likely that back references of these file will need to be accessed. SRH] It is also obvious that inter-dependencies of data blocks, such as that SRH] of 'data_manuscript' on the data blocks 'data_cc3' and 'data_cc7', must be SRH] recorded within the linking data block (e.g. 'data_manuscript'). SRH] SRH] Database block names SRH] SRH] Finally we get to the archiving of this data. This again will happen at SRH] several levels. The journals will archive the submitted CIF's, and the SRH] databases will archive and distribute subsets of this data. I am SRH] certain that this will involve different file and data naming procedures. SRH] This is primarily because it is at this stage that data blocks will need SRH] to be concatenated into multi-block files. It is here that the systemat- SRH] ization of the block codes will be crucial -- and it is here that most of SRH] my suggestions refer. Quite frankly I think these naming conventions are SRH] strictly the business of the archiver -- but with the proviso that each SRH] data block contains an up-to-date record of this and past block names SRH] and file names. SRH] SRH] SRH] This is my brief synopsis of the naming requirements. I DO understand that SRH] the matter of systematization is absolutely crucial to CCDC, as it will SRH] be to other databases and to the journals. I am again certain that each of SRH] these organisations will adopt their own naming conventions and we are SRH] wasting our time if we think we can impose conventions at this level. SRH] SRH] What we can do is supply mechanisms that encourage each level of CIF SRH] processing to record the current archiving procedure. In this way EACH SRH] data block will retain a record of what it was called (i.e. the block SRH] code) and where it was stored (i.e. the filename) for the purpose of back SRH] referencing. SRH] SRH] So I suggest that the following are essential: SRH] SRH] (1) That data blocks that have dependencies on other data blocks (e.g. SRH] 'data_manuscript' blocks) contain data items that explicitly SRH] specify this dependency. E.g. SRH] SRH] loop_ SRH] _audit_linked_data_block SRH] _audit_linked_data_block_comments SRH] SRH] cc3 'structure xyz related to pqr referred to manuscript' SRH] cc7 'structure pqr related to xyz referred to manuscript' SRH] SRH] Note that neither 'data_cc3' nor 'data_cc7' data blocks need contain SRH] these data items. SRH] SRH] (2) That each data block should record its current block code and filename. SRH] Note that this is strictly for 'back-reference' purposes. This could SRH] be done within the _audit_update_record but it is more likely to be SRH] used if there are specific data items for this purpose. More important, SRH] specific data items will make the automatic appending of the data much SRH] easier for generating software, such as CIFIO and CIFER. This is how I SRH] would like to see it done. SRH] SRH] loop_ SRH] _audit_data_block_entry SRH] _audit_data_block_name SRH] _audit_data_block_file SRH] _audit_data_block_locality SRH] _audit_data_block_comments SRH] SRH] 91:03:17 cc3 cc3.cif RSC,ANU,Canb. 'generated by CIFIO' SRH] 91:03:26 cc3 AC5533 Acta,Chester 'with manuscript and cc7' SRH] 91:03:28 cc3 AC5533.cc3 Acta,Chester 'data check completed' SRH] 91:04:02 cc3 HL0012.02apr91 Hall,Perth 'with manuscript & cc7' SRH] 91:06:12 cc3 AC5533.pub Acta,Chester 'paper accepted' SRH] 91:06:28 C910477 ACsep91 Acta,Chester 'structure archived' SRH] 91:09:03 C910477 acta.oct91 CCDC,Cambr. 'acta data rec'd' SRH] 91:09:23 B77-8854 CCDC91-6.4 CCDC,Cambr. 'data checked & archived' SRH] SRH] OK, I hope that this helps us converge quickly on a decision about this. We SRH] all need to get on with the business of actually generating and processing SRH] these files. SRH] SRH] Cheers, Syd. SRH] ---------------------------------------------------------------------- ... SRH] The idea of referencing data blocks from within data blocks was raised SRH] before by Brian McMahon for the IUCr archives. SRH] SRH] There appears to be two ways to go about 'linking' related data blocks. SRH] SRH] 1) The first is to define a hierarchical block code that identifies the SRH] relationship between data blocks. Examples of these are: SRH] SRH] data_HA543_manuscript_only SRH] data_HA543_structure_1_of_2 SRH] data_HA543_structure_2_of_2 SRH] SRH] There may also be some good reasons to systematise the structure code (e.g. SRH] 'HA543') so that it relates to the publication, or to the order of SRH] archiving. For example Chester may take the incoming CIF's and SRH] systematise the data block codes to a scheme of its own. E.g. SRH] SRH] data_ACC_91_7_manuscript #<< Acta C 1991 paper 7 SRH] data_ACC_91_7_data_1_of_2 #<< Acta C 1991 paper 7 SRH] data_ACC_91_7_data_2_of_2 #<< Acta C 1991 paper 7 SRH] or SRH] data_IUCr_91_785_manuscript #<< IUCr archived data no. 785 1991 SRH] data_IUCr_91_785_data_1_of_2 #<< IUCr archived data no. 785 1991 SRH] data_IUCr_91_785_data_2_of_2 #<< IUCr archived data no. 785 1991 SRH] SRH] 2) The second approach is to specify the links between data blocks within SRH] the data. The most obvious place to do this is in the audit section. E.g. SRH] SRH] loop_ SRH] _audit_linked_data_block SRH] _audit_linked_data_block_description SRH] SRH] HA543_structure_1_of_2 'structure HA543 Ag derivative' SRH] HA543_structure_2_of_2 'structure HA543 Au derivative' SRH] SRH] This information could appear in each data block or, possibly, only in the SRH] leading one (thus signaling that further related data blocks follow). There is a bit of repetition in this, but I think it demonstrates the lines of approach. Phil Bourne's response to this was fairly positive: PEB] There are 3 immediate problems: PEB] (i) Associating data blocks either of related data or of structure PEB] data to data on standard and non-standard groups PEB] (ii) Associating data blocks with real file names PEB] (iii) Defining what dictionaries the data block conforms to PEB] PEB] (i) and (ii) are of immediate concern to us! Let's take one at a time.. PEB] PEB] ** (i) I would suggest that both of your previous suggestions be adopted... PEB] PEB] First PEB] ----- PEB] The first is to define a hierarchical block code that identifies the PEB] relationship between data blocks. Examples of these are: PEB] [this is a local thing and should be the policy of those responsible for PEB] the archive] PEB] PEB] data_HA543_manuscript_only PEB] data_HA543_structure_1_of_2 PEB] data_HA543_structure_2_of_2 PEB] ... PEB] PEB] Second PEB] ------ PEB] The second approach is to specify the links between data blocks within PEB] the data. The most obvious place to do this is in the audit section. E.g. PEB] PEB] loop_ PEB] _audit_linked_data_block PEB] _audit_linked_data_block_description PEB] PEB] HA543_structure_1_of_2 'structure HA543 Ag derivative' PEB] HA543_structure_2_of_2 'structure HA543 Au derivative' PEB] PEB] ** (ii) PEB] PEB] The data blocks still need to be associated with specific files and PEB] additional details about that file eg who maintains it.. PEB] PEB] This could be done along the lines you suggested: PEB] PEB] loop_ PEB] _audit_data_block_date PEB] _audit_data_block_name PEB] _audit_data_block_file PEB] _audit_data_block_locality PEB] _audit_data_block_special_details PEB] PEB] 93:03:17 standards pdb.std pdb 'CIF file of standard groups' Syd's remarks at the end of this review are also worth quoting: SRH] Well, that's some of the history, warts and all! An interesting footnote SRH] to all of this, is that none of these data items were put forward for SRH] eventual inclusion in the Core '91 definitions. Perhaps they are needed, SRH] and perhaps they are not. The above discussions may aid in reaching a SRH] conclusion on this without too much repetition. So, should the _audit_data_block_ names now go in the Core Dictionary? The _audit_linked_ ones? Should we insist that data block names can NEVER be changed? (A corollary to this is that files could not be concatenated, since block names may not recur within a file - and that's not fixable by an "unglobal_" type solution!) D11.2 File names ---------------- How to refer from within a CIF to another file? File naming and location are not portable, so it's very tricky to devise a useful method of doing this. Why should you want to? In Brian T.'s powder dictionary, _pd_dataset_id (which was described above) can be used to catalogue the file; and one would wish to retrieve the file based on this catalogue identifier. I suspect this is possible only through local conventions (or conventions shared within common-interest groups) - a table would relate the dataset_id's to filenames on the local system, thus: Si-std|D500#1234-987|B.Toby|91-15-09|16:54 /usr/data/std/blah.blah SEPD7234|IPNS-SEPD|B.Toby|91-15-09|16:54 /usr/data/sep/bleh.bleh and any relevant applications software needs to know how to do the lookup. However, I can see two cases where the facility to point to external "standard" filenames would be important: in identifying dictionaries against which to test the validity of data items, and to point to keywords or standard data files. Hence, we might have a data item "_conformance" which indicates which CIF dictionaries should be read for validation, thus: loop_ _conformance cifdic.C93 cifdic.M94 cifdic.mylab How does the validation software know where to find cifdic.C93 etc? There seems to be some similarity to modern compilers, where a preprocessor reads a line such as "#include <stdio.h>" and knows from the angle brackets to look in a standard directory (/usr/include, or whatever has been defined as appropriate for the system on which the compiler resides). Shall we introduce the convention that <cifdic.C93> means "the file named cifdic.C93 in the standard directory for storing CIF defining data as established by the system manager"? And perhaps, in the macromolecular dictionary entry for data_struct_keywords, there might be a convention such as _enumeration @<mmstruct.kwd> where the "@" (or some other suitable symbol) has the same function as the "#include" statement to the compiler - i.e. "read in at this point and expand the contents of the file... (structmm.kwd in the standard directory)". This latter suggestion has certain formal difficulties. If the file contains more than a single term, the enumeration should be looped. So "loop_ _enumeration @<mmstruct.kwd>" is indicated (and presumably is OK even for a degenerate loop with only a single term present). But what if the file contains _enumeration's and _enumeration_detail's? "loop_ _enumeration _enumeration_detail @<mmstruct.kwd>" may be correct when the contents of the file are expanded, but is parsed without file expansion as <loop_> <data name> <data name> <one char term only> Paula: have you thought in detail about how you want to implement this? Regards Brian
- Prev by Date: (10) STAR changes, DDL, dataname character sets
- Next by Date: (12) Schedule; STAR extensions and naming conventions, etc.
- Index(es):