[Date Prev][Date Next][Date Index]
(49) Remaining issues: block id, disorder, diffrn
- To: COMCIFS@iucr.ac.uk
- Subject: (49) Remaining issues: block id, disorder, diffrn
- From: bm
- Date: Wed, 25 Sep 1996 14:00:08 +0100
Dear Colleagues I shall send out in an accompanying email the latest working draft of the extended core dictionary, in which I have tried to pull together all the currently open threads. There are one or two areas where I am still not happy (which I discuss below); but I would like you to consider the draft as complete except for these areas, and study it with a view to giving your formal proposal. What I intend to do is send another circular next week, summarising any discussions on the problems that are still unresolved, and addressing any structural problems that have come to light in the interim. I shall then ask you to submit your formal vote of approval at the end of next week; so it is important that we resolve all current problems to our satisfaction in that time. Here are my remaining reservations: D44.2 Disorder -------------- I promised that I would have some discussions with John Davies on the _atom_site_disorder_assembly and *_group items, and I have now done so. Recall the intent behind these items (and here I correct a slip of the keyboard in the example I posted in circular 44: the *_group and *_assembly lines in the loop header should be reversed, so that the example is now as given below). Where a disordered cluster of atoms (identified by a unique value of _atom_site_disorder_assembly) can be represented by two (or more) alternative conformations, the members of each separate conformation are assigned the same value of _atom_site_disorder_group. loop_ _atom_site_label # *_assembly 'M' is a disordered methyl _atom_site_occupancy # with configurations 'A' and 'B': _atom_site_disorder_assembly # _atom_site_disorder_group # H11B H11A H13B # . | . C1 1 . . # . | . H11A .5 M A # . | . H12A .5 M A # C1 --------C2--- H13A .5 M A # / . \ H11B .5 M B # / . \ H12B .5 M B # / . \ H13B .5 M B # H12A H12B H13A John points out that there are two approaches to describing disorder in the crystal - that of the crystallographer and that of the chemist. The crystallographer wishes to interpret a density map, and ends up with a list of sites in the unit cell that are occupied by clumps of electron density. He identifies them with atoms of one or more chemical species (a particular site can be occupied by silver or silicon atoms in a non-periodic fashion); and with complete or partial occupancy. So long as he can identify all populated sites in this way, he is content. The chemist wishes to identify sensibly-bonded chemical entities in the cell, and so will interpret the contents as a superposition of distinct moieties, each such moiety having some occupancy. Consider the following hypothetical molecule. In the diagram below, suppose the group of atoms labelled as 1-2-3-4 is sometimes disordered to the conformation 1-2A-3A-4, 3B and sometimes (but less frequently) / \ to 1-2-3B-4. Atoms 1 and 4 have 100% occupancy, / \ atoms 2, 2A and 3A have 50% occupancy, atom 3 / \ 40% occupancy and atom 3B 10% occupancy (so ---1------2-----3-----4--- that the 1-2-3B-4 conformation occurs 10% \ / of the time, and 1-2-3-4 40% of the time). \ / This can be described in a CIF with the data 2A ------ 3A names that refer to the positions and occupancies, and one can even add detailed text descriptions to help a human reader understand what is going on: loop_ _atom_site_label _atom_site_occupancy _atom_site_description 1 1.0 'ordered site, forms with 4 the static nodes of a disordered chain' 2 0.5 'site in disordered chain 1-2-3-4, bonding to 3 or 3B' 3 0.4 'site in disordered chain 1-2-3-4, split with site 3B' 4 1.0 'ordered site, forms with 1 the static nodes of a disordered chain' 2A 0.5 ...etc 3A 0.5 3B 0.1 but _atom_site_description is not of much use to a software application seeking to extract the contents of one moiety. The chemist's problem is resolved by labelling the three distinct moieties. In CIF we could set up another table which encodes the following information: Name of moiety Contents A 1, 2, 3, 4 (+other atoms in the ordered part of the molecule) B 1, 2A, 3A,4 " C 1, 2, 3B, 4 " One could then go on to build descriptions of the moieties and their relative abundance: loop_ _moiety_name _moiety_occupancy _moiety_description A .4 'disordered conformation of main compound' B .5 'disordered conformation of main compound' C .1 'disordered conformation of main compound' D 1.0 'solvate' Because of the structure of CIF (which isn't very tolerant of an indeterminate number of values for an item), the "name of moiety" table would be rather verbose, but presumably easy enough to write from an application that allowed the user to input chemical knowledge in disentangling the disordered components. In principle one could carry back into the _atom_site_ list the details of which moiety or moieties each occupied site was associated with (that is, instead of building a "name of moiety"/"contents" table), but again there are some technical problems in doing this in CIF, related to the fact that each atom may belong to several chemical moieties. In other words, a table containing the entries loop_ _atom_site_label _atom_site_occupancy _atom_site_associated_moiety 1 1.0 ABC 2 0.5 AC 3 0.4 A 4 1.0 ABC 2A 0.5 B 3A 0.5 B 3B 0.1 C COULD be interpreted in such a way as to give you a description of molecule A (or B, or C...), though this runs counter to our distaste for having separate parsing rules for separate data names. I note that the mmCIF dictionary allows a much richer description of the structural elements associated with the results of the refinement, and can describe an ensemble of disordered fragments. However, I'm not sure that it's appropriate to bring the full richness of that description to bear on small molecules. Another point that needed clarification was the use of negative values of the *_group code. This arises from the practice in SHELXL of controlling the connectivity by grouping disordered atoms into different "PART"s. In general, each PART number will map to a value of _atom_site_disorder_group. However, the mapping can be one-to-many, for the PARTs are not intended as descriptions of the disordered cluster per se, but are rather clusters of atoms in the refinement that obey different constraints designed to model the effects of disorder. Hence, automatic bond generation is inhibited between atoms with different PART numbers, unless one of them is 0 (this is the value associated with the main ordered fragment of the molecule). Now, if you imagine a group disordered on a symmetry element, the group can be described by a set of atoms all belonging to the same PART (because the disordered positions will be generated when the symmetry element is applied); but a connectivity algorithm runs the risk of building spurious bonds between the atoms listed in the PART and their symmetry-generated mates. SHELXL adopts the conventions that such PARTs should be given a negative PART number, as a flag to switch off special position constraints and bonds to symmetry-generated atoms. PROPOSED ACTION --------------- Action on all this? I propose to include David's new definitions for _atom_site_disorder_assembly and *_group in (45)D44.2, with a different description of the negative part numbers, thus: data_atom_site_disorder_assembly _name '_atom_site_disorder_assembly' _category atom_site _type char _list yes _list_reference '_atom_site_label' loop_ _example _example_detail A 'disordered methyl assembly with groups 1 and 2' B 'disordered sites related by a mirror' S 'disordered sites independent of symmetry' _definition ; A code which identifies a cluster of atoms that show long range positional disorder but are locally ordered. Within each such cluster of atoms, _atom_site_disorder_group is used to identify the sites that are simultaneously occupied. This field is only needed if there is more than one cluster of disordered atoms showing independent local order. ; data_atom_site_disorder_group _name '_atom_site_disorder_group' _category atom_site _type char _list yes _list_reference '_atom_site_label' loop_ _example _example_detail 1 'unique disordered site in group 1' 2 'unique disordered site in group 2' -1 'symmetry-independent disordered site' _definition ; A code that identifies a group of positionally disordered atom sites that are locally simultaneously occupied. Atoms that are positionally disordered over two or more sites (e.g. the H atoms of a methyl group that exists in two orientations) can be assigned to two or more groups. Sites belonging to the same group are simultaneously occupied, but those belonging to different groups are not. A minus prefix (e.g. "-1") is used to indicate sites disordered about a special position. ; This will not give a complete description of complex disordered structures, but will satisfy the many cases where a functional group is simply disordered into two or more distinct conformations. I suggest we leave to a later revision of the dictionary any attempt to describe the ensemble of moieties in a manner different from the mmCIF STRUCT categories. D45.7 _audit_link_block_code ---------------------------- G> D45.7. I agree with your solution of transferring audit_link_block_code G> to a block_id. What I do not know (or remember) is the scheme of G> prefixes. Anyway is a good solution. There seems to be a consensus on identifying data blocks and links between blocks through the _audit_link_ category and the identifier "_block_id", which I have in fact named as "_audit_block_code", as the "AUDIT" category seems the right place for it. The current definitions are listed below, for convenience. The problem is in assigning a unique value to _audit_block_code. I can see three ways to proceed: (1) Require only uniqueness within the current file. Then the linking mechanism is just a formal way of stating the relationship between the data blocks in a file - by no means a useless function, since the standard does not impose any particular function on any individual data block. (2) Formulate an algorithm that guarantees the generation of a unique string. I am open to suggestions on how this may be achieved in a machine/OS/program-independent way. The simplest solution seems to be Brian Toby's, of setting up a registry of unique prefixes and assigning them to individuals, then relying on the individuals to ensure the uniqueness of every data block they generate. (3) Adopt a naming philosophy along the lines of the URL/URI/URN rules that we see on the web. The notion here is that an identifier can take the form (protocol:)(/location)(#fragment) and that partial forms of the identifier can be employed for cross-references of restricted scope. How might this work? Say we have the following values of _audit_link_block_code (I use '|' for the fragment separator, in view of the special meaning to STAR of the '#' character): |TOZ (a) file2|TOZ (b) ../peer/file3|TOZ (c) /top/mid/file4|TOZ (d) http://x.y.z/somewhere/file5|TOZ (e) (a) means "data block with _audit_block_code value of TOZ in the same file", and is portable (subject to the file being retained as intact). (b) means block TOZ in a file named file2 at the same level of the storage hierarchy (usually this means "in the same directory", but it's intended to be a bit more general than that). (c) means block TOZ in file3 in a neighbouring container (directory) at the same level in the hierarchy; (d) means block TOZ within a file, file4, at a specific location in the hierarchy that is currently accessible (normally this means on the same file system). (b), (c) and (d) are progressively less portable, in the sense that the links will work only if the files linked to are carried around in the same relative positions in the namespace. For instance, if two files, file1 and file2, are submitted to ACTA, both published in the same issue of the journal, and both archived in the same directory, then link (b) would work. Link (e) is again entirely portable, in that it identifies the host machine on the Internet and the data access protocol required to retrieve the file. This scheme has the merit of allowing links on a local system, if only this is required; but is extensible as the links need to be made public. The values of the "block codes" need to be "unique" only within the scope permitted by the form of the identifier - the http: value is unique across the Internet, but is only assigned by the public archive hosting that file. PROPOSED ACTION --------------- Unless I hear compelling arguments to the contrary, I intend to implement solution (1) at this point; it is possible that it might be extended to a scheme such as (3) in the future. data_audit_block_code _name '_audit_block_code' _category audit _type char _example TOZ_1991-03-20 _definition ; A code intended to identify uniquely the current data block. ; data_audit_link_[] _name '_audit_link_[]' _category category_overview _type null loop_ _example _example_detail # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; loop_ _audit_link_block_code _audit_link_block_description . 'discursive text of paper with two structures' morA_(1) 'structure 1 of 2' morA_(2) 'structure 2 of 2' ; ; Example 1 - multiple structure paper, as illustrated in A Guide to CIF for Authors (1995). IUCr: Chester. ; # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ; loop_ _audit_link_block_code _audit_link_block_description . 'publication details' KSE_COM 'experimental data common to ref./mod. structures' KSE_REF 'reference structure' KSE_MOD 'modulated structure' ; ; Example 2 - example file for the one-dimensional incommensurately modulated structure of K~2~SeO~4~. ; # - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - _definition ; Data items in the AUDIT_LINK category record details about the relationships between data blocks in the current CIF. ; data_audit_link_block_code _name '_audit_link_block_code' _category audit_link _type char _list yes _list_mandatory yes _definition ; The value of _audit_block_code associated with a data block in the current file related to the current data block. The special value '.' may be used to refer to the current data block for completeness. ; data_audit_link_block_description _name '_audit_link_block_description' _category audit_link _type char _list yes _list_reference '_audit_link_block_code' _definition ; A textual description of the relationship of the referenced data block to the current one. ; D45.8 Inheritance across data blocks ------------------------------------ Gotzon sees another benefit to the _audit_link_ mechanism: G> On the other hand audit_link_block_code would define the scope where G> the parent/child link should work. I mean that the DDL requirement G> "Identifies a data item....and which must be present in the same data G> block..." would be "substituted" for certain CIF applications by: G> "Identifies a data item....and which must be present in the same G> (logical) data block...", where a logical data block is composed by those G> physical data blocks that are included in the audit_link_block_code list G> (uniqueness of block_id seems to be crucial). This is a formal problem which I leave as an open issue at this point - it doesn't immediately affect the current revision to the dictionary itself. D48.1 _diffrn_ categories ------------------------- I have adopted most of the changes suggested by David in his proposed reworking of the various DIFFRN_ categories. Following some detailed correspondence with him, the final version is slightly different from the proposals in circular 48, and I would appreciate it if you would review these sections carefully. There are cases (e.g. _diffrn_radiation_detector and _diffrn_detector_device; likewise _diffrn_radiation_detector_dtime and _diffrn_detector_dtime) where essentially the same information is repeated in two categories, according as one or multiple instruments or experiments are described. This is somewhat unsatisfactory; but the overall logic of the DIFFRN categories is at least better argued than used to be the case. ---------- Regards Brian
- Prev by Date: (48) Extended Core: interblock links; DIFFRN categories
- Next by Date: (50) Call for approval of CIF Core dictionary version 2.0
- Index(es):