[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Further discussion of proposal #2

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2
From: "Herbert J. Bernstein" <[email protected]>
Date: Mon, 27 Jun 2016 20:45:05 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=mime-version:in-reply-to:references:from:date:message-id:subject:to;bh=UeZPBIauHSt3za36eFKQgZRJHSI+kZLbfJBZ7qdtIag=;b=BPUHv5WSOaU7iSUSvQ9Ps17szKB3dQ1EMPfEDGCK+3EDgwzlAzjTUaAYtKkKM5d7Y+vMGBCISpJZKvktkzCDUBkPir7gDyT96OKolri3OUCTJ1YK4Ir/hNYAihk+uv+MLyEG3Ny2OppfhiMam3Y2xQNo9u9n73Wi6tVJRCzv+8LjM/B9PNjlVcP2noa6zt2GiL+GJG7rwV1DWP06oPpXDZ3+Zp1CSmUGddDUSh7KAH3C3+OANvaMcmjGkGUSY6JvREd8UPQUIZBRUEJLB0hED/nr2ZTw2uiGyvGGV+8s6yh9jehkcyTpnpIy7IeadOhN5XgiArJqsY282WP+eRxrKA==
In-Reply-To: <BY2PR0401MB0936761420ED47EB0D70EE8DE0210@BY2PR0401MB0936.namprd04.prod.outlook.com>
References: <CAM+dB2c4XhGDZQ7PBAHhUfmTXc7X7H2PboWBH3s1dapp0Gh_KQ@mail.gmail.com><BY2PR0401MB093685CBF929AF951C0626FCE0570@BY2PR0401MB0936.namprd04.prod.outlook.com><CAM+dB2fu4OXVqGf1X=s+Q0dRD1ftzPUvQ2+iEG-V5rhZKASCow@mail.gmail.com><BY2PR0401MB09365FA4D577C10A19A261BEE02A0@BY2PR0401MB0936.namprd04.prod.outlook.com><CAM+dB2d16xXsRO8ZsBO62T_KcEz4GhOV02gdTyk5dq0hv7o0Hg@mail.gmail.com><BY2PR0401MB0936761420ED47EB0D70EE8DE0210@BY2PR0401MB0936.namprd04.prod.outlook.com>

save_variant
    _category.description
;             Data items in the VARIANT category record
              the details about sets of variants of data items.
              
              There is sometimes a need to allow for multiple versions of the
              same data items in order to allow for refinements and corrections
              to earlier assumptions, observations and calculations.  In order
              to allow data sets to contain more than one variant of the same
              information, an optional ...variant data item as a pointer to
              _variant.variant has been added to the key of every category,
              as an implicit data item with a null (empty) default value.
              
              All rows in a category with the same variant value are considered 
              to be related to one another and to all rows in other categories
              with the same variant value.  For a given variant, all such rows
              are also considered to be related to all rows with a null variant
              value, except that a row with a null variant value is for which
              all other components of its key are identical to those entries
              in another row with a non-null variant value is not related the
              the rows with that non-null variant value.  This behavior is 
              similar to the convention for identifying alternate conformers 
              in an atom list.
              
              An optional role may be specified for a variant as the value of
              _variant.role.  Possible roles are null, "preferred", 
              "raw data", "unsuccessful trial".
              
              variants may carry an optional timestamp as the value of
              _variant.timestamp.
              
              variants may be related to other variants from which they were
              derived by the value of _variant.variant_of
              
              Further details about the variant may be specified as the value
              of _variant.details.
              
              In order to allow variant information from multiple datasets to
              be combined, _variant.diffrn_id and/or _variant.entry_id may
              be used. 
              
;
    _category.id                   variant
    _category.mandatory_code       no
     loop_
    _category_key.name             '_variant.variant'
                                   '_variant.diffrn_id'
                                   '_variant.entry_id'
     loop_
    _category_group.id             'inclusive_group'
                                   'variant_group'
     loop_
    _category_examples.detail
    _category_examples.case
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
;   Example 1 - Distinguishing between a raw beam center and a refined beam 
       center inferred after indexing.  Detector d1 is composed of 
       four CCD detector elements, each 200 mm by 200 mm, arranged 
       in a square, in the pattern

                   1     2
                      *
                   3     4

       Note that the beam centre is slightly displaced from each of the
       detector elements, just beyond the lower right corner of 1,
       the lower left corner of 2, the upper right corner of 3 and
       the upper left corner of 4.  For each element, the detector
       face coordiate system, is assumed to have the fast axis
       running from left to right and the slow axis running from
       top to bottom with the origin at the top left corner.
       
       After indexing and refinement, the center is shifted by .2 mm
       left and .1 mm down.
        
        
;
;

        loop_
        _variant.variant
        _variant.role
        _variant.timestamp
        _variant.variant_of
        _variant.details
            . "raw data" 2007-08-03T23:20:00 . .
            indexed "preferred" 2007-08-04T01:17:28 .
              "indexed cell and refined beam center"
              
        loop_
        _diffrn_detector_element.detector_id
        _diffrn_detector_element.id
        _diffrn_detector_element.reference_center_fast
        _diffrn_detector_element.reference_center_slow
        _diffrn_detector_element.reference_center_units
        _diffrn_detector_element.variant
        d1     d1_ccd_1  201.5 201.5  mm  .
        d1     d1_ccd_2  -1.8  201.5  mm  .
        d1     d1_ccd_3  201.6  -1.4  mm  .
        d1     d1_ccd_4  -1.7   -1.5  mm  .
        d1     d1_ccd_1  201.3 201.6  mm  indexed
        d1     d1_ccd_2  -2.0  201.6  mm  indexed
        d1     d1_ccd_3  201.3  -1.5  mm  indexed
        d1     d1_ccd_4  -1.9   -1.6  mm  indexed
;
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

     save_

save__variant.details
    _item_description.description
;              A description of special aspects of the variant.
;
    _item.name                  '_variant.details'
    _item.category_id             variant
    _item.mandatory_code          no
    _item_type.code               text
    _item_examples.case
;                                indexed cell and refined beam center
;
     save_

save__variant.diffrn_id
    _item_description.description
;             This item is a pointer to _diffrn.id in the
                  diffrn category.
;
    _item.name                  '_variant.diffrn_id'
    _item.category_id             variant
    _item.mandatory_code          implicit
    _item_type.code               code
     save_
    
save__variant.entry_id
    _item_description.description
;             This item is a pointer to _entry.id in the
                  entry category
;
    _item.name                  '_variant.entry_id'
    _item.category_id             variant
    _item.mandatory_code          yes
    _item_type.code               code
     save_

    
save__variant.role
    _item_description.description
;             The value of _variant.role  specified a role
              for this variant.  Possible roles are null, "preferred", 
              "raw data", and "unsuccessful trial".
    
              A null value for _variant.role leaves the
              precise role of the variant unspecified.  No inference should
              be made that the variant with the latest time stamp is
              preferred.
;
    _item.name                  '_variant.role'
    _item.category_id            variant
    _item.mandatory_code         no
    _item_type.code              uline
     loop_
    _item_enumeration.value
    _item_enumeration.detail
     "preferred"
;    A value of "preferred" indicates that rows of any categories specifying
     this variant should be used in preference to rows with the same key 
     specifying other variants or the null variant.  It is an error to specify
     two variants that appear in the same category with the same key as being 
     preferred, but it is not an error to specify more than one variant as 
     preferred in other cases.
;
     "raw data"
;    A value of "raw data" indicates data prior to any corrections, 
     calculations or refinements.  It is not necessarily an error for raw data
     to also be a variant of an earlier variant.  It may be replacement raw 
     data for earlier data believed to be erroneous.
;
     "unsuccessful trial"
;    A value of "unsuccessful trial" indicates data that should not be used
     for further calculation.
;
     save_


save__variant.timestamp
    _item_description.description
;              The date and time identifying a variant.  This is not 
               necessarily the precise time of the measurement or calculation
               of the individual related data items, but a timestamp that 
               reflects the order in which the variants were defined.
;
    _item.name                 '_variant.timestamp'
    _item.category_id          variant
    _item.mandatory_code       no
    _item_type.code            yyyy-mm-dd
     save_

    
save__variant.variant
    _item_description.description
;             The value of _variant.variant must uniquely identify
              each variant for the given diffraction experiment and/or entry
                    
              This item has been made implicit and given a default value of 
              null.
;

     loop_
    _item.name
    _item.category_id
    _item.mandatory_code
             '_variant.variant'                    variant             implicit
             '_variant.variant_of'                 variant             implicit
             '_array_data.variant'                 array_data          implicit
             '_array_element_size.variant'         array_element_size  implicit
             '_array_intensities.variant'          array_intensities   implicit
             '_array_structure.variant'            array_structure     implicit
             '_array_structure_list.variant'       array_structure_list
                                                                       implicit
             '_array_structure_list_axis.variant'  array_structure_list_axis
                                                                       implicit
             '_array_structure_list_section.variant'  array_structure_list_axis_section
                                                                       implicit
             '_axis.variant'                       axis                implicit
             '_diffrn_data_frame.variant'          diffrn_data_frame   implicit
             '_diffrn_detector.variant'            diffrn_detector     implicit
             '_diffrn_detector_axis.variant'       diffrn_detector_axis
                                                                       implicit
             '_diffrn_detector_element.variant'    diffrn_detector_element
                                                                       implicit
             '_diffrn_measurement.variant'         diffrn_measurement  implicit
             '_diffrn_measurement_axis.variant'    diffrn_measurement_axis
                                                                       implicit
             '_diffrn_radiation.variant'           diffrn_radiation    implicit
             '_diffrn_refln.variant'               diffrn_refln        implicit
             '_diffrn_scan.variant'                diffrn_scan         implicit
             '_diffrn_scan_axis.variant'           diffrn_scan_axis    implicit
             '_diffrn_scan_frame.variant'          diffrn_scan_frame   implicit
             '_diffrn_scan_frame_axis.variant'     diffrn_scan_frame_axis
                                                                       implicit
             '_diffrn_scan_frame_monitor.variant'  diffrn_scan_frame_monitor
                                                                       implicit
             '_map.variant'                        map                 implicit
             '_map_segment.variant'                map_segment         implicit


    _item_default.value           .
    _item_type.code               code
     loop_
    _item_linked.child_name
    _item_linked.parent_name
             '_array_data.variant'                 '_variant.variant'
             '_array_element_size.variant'         '_variant.variant'
             '_array_intensities.variant'          '_variant.variant'
             '_array_structure.variant'            '_variant.variant'
             '_array_structure_list.variant'       '_variant.variant'
             '_array_structure_list_axis.variant'  '_variant.variant'
             '_axis.variant'                       '_variant.variant'
             '_diffrn_data_frame.variant'          '_variant.variant'
             '_diffrn_detector.variant'            '_variant.variant'
             '_diffrn_detector_axis.variant'       '_variant.variant'
             '_diffrn_detector_element.variant'    '_variant.variant'
             '_diffrn_measurement.variant'         '_variant.variant'
             '_diffrn_measurement_axis.variant'    '_variant.variant'
             '_diffrn_radiation.variant'           '_variant.variant'
             '_diffrn_refln.variant'               '_variant.variant'
             '_diffrn_scan.variant'                '_variant.variant'
             '_diffrn_scan_axis.variant'           '_variant.variant'
             '_diffrn_scan_frame.variant'          '_variant.variant'
             '_diffrn_scan_frame_axis.variant'     '_variant.variant'
             '_diffrn_scan_frame_monitor.variant'  '_variant.variant'
             '_map.variant'                        '_variant.variant'
             '_map_segment.variant'                '_variant.variant'

     save_


save__variant.variant_of
    _item_description.description
;             The value of _variant.variant_of gives the variant
              from which this variant was derived.  If this value is not given,
              the variant is assumed to be derived from the default null 
              variant.

              This item is a pointer to _variant.variant in the
              VARIANT category.
;
    _item.name                  '_variant.variant_of'
    _item.category_id             variant
    _item.mandatory_code          no
    _item_type.code               code
     save_

On Mon, Jun 27, 2016 at 6:31 PM, Bollinger, John C <[email protected]> wrote:

Dear James and Colleagues,

I hesitate to offer a comparison with imgCIF’s Variants, as I’m having trouble finding any actual specifications for it. The closest I’ve come is some relatively old discussion on imgcif-l. What I have seen makes me think that Variants address a different, but related, problem: handling certain situations where we discover a need for a single attribute to take multiple values. From what I’ve seen, Variants provides a generic way to identify and describe any number of values of such an attribute, entirely in the data, as an alternative to defining new, independent attributes. If the attribute being multiplexed is a child key, then this could produce a localized effect similar to a star schema.

The star schema proposal is similar in that it approaches the problem from a relational viewpoint, but the details differ. Perhaps the biggest difference is that the star schema approach keeps all the relationship definitions in the dictionary, unless you count use of the (optional) audit_schema category, whereas Variants appears to push some of that out into data files. That’s fine, just different.

In preparation for an example, let’s suppose that we follow mmCIF by naming the default hub category ENTRY. We would give a child key referencing ENTRY to each category that must be single-valued with respect to a "normal" data set (because that’s the kind of data that ENTRY represents). mmCIF’s in fact does just that. We must ensure that we do not end up with, say, multiple instances of CELL_LENGTH referencing the same ENTRY, for allowing that would introduce just the kind of ambiguity we want to avoid. One way to do that would be to make these child keys also be their categories’ category keys (leaving aside for the moment how we classify those categories). We would furthermore assign default values for the keys to enable them to be omitted from data files. Some Loop categories would need such child keys added to their own category keys as well.

The exception in our current core dictionary is SPACE_GROUP, because it is defined as a Loop but used as a Set with respect to any given ENTRY. There we define the relationship in the other direction: ENTRY gets a child key referencing SPACE_GROUP. Note here that we do not need to prevent multiple ENTRY instances from referring to the same SPACE_GROUP.

Most categories that are already Loops also need to be tied somehow to an ENTRY. How that happens depends on their own key structure: in many cases, the Loop’s category key will need to be expanded with a child key referencing ENTRY, but if a Loop has a category key referencing a parent category, and the parent’s category key, if any, does not reference ENTRY, then the child’s will not reference ENTRY either. This would apply to SPACE_GROUP / SPACE_GROUP_SYMOP and probably to several of the PUBL_* categories, for example.

Taking your second example first, adding a new category that does not serve as a hub category does not require changes to any other category. No changes are needed because category instances associated with the same hub category recognize each other by virtue of their separate and independent associations with that hub, as opposed to relying on each other to be global or to have keys associating them directly. This applies, bidirectionally, to new categories just the same as it does to existing ones.

As for your first example, providing multiple instances of the CELL_* categories could be done multiple ways, but here are three of the more likely:

(1) provide multiple instances of ENTRY

This case can be exercised without any new definitions at all, but the different instances of the CELL_* categories would need to explicitly provide cell_*.entry_id values. Probably the other expressed categories that have child keys referencing ENTRY would need to express those child keys explicitly as well, but that depends to some extent on what information they are intended to convey. We might do this if we wanted to describe several structures in the same data block, for example. This has the advantage that we can use as much or as little as we want of each ENTRY – cell, space group, atom sites, even experimental details.

(2) define a new hub category

This is the route that would be taken when providing for multiple cells in order to describe data that are not adequately modeled either by an ENTRY or by a collection of them. In this case, we would need to define new keys associating the appropriate categories (which would not necessarily be all of them) with the new hub category. At this point I’m not actually seeing where we would find a need to aggregate multiple cells directly, instead of aggregating ENTRYs, but I can’t rule it out.

(3) give the CELL_* categories a surrogate key, and use a variants-like approach to associate additional cells () with ENTRY.

I’m supposing here that this would be used only for cells that are “secondary” in some sense. For example, if the cell parameters were measured at multiple temperatures, but only one data set and structure determination were performed. (If full structure determinations were done at multiple temperatures then that might be better handled by providing multiple ENTRYs, instead). This approach might require also new DDLm semantics allowing us to specify that cell_*.entry_id, although not (any longer) a category key, must not take duplicate values.

As for an example, although there is some room for variation within the star-schema approach, this is a cut at what I have primarily suggested:

====

save_ENTRY

_definition.id                          ENTRY

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

    Represents a chemical or biological structure and associated experimental details.

;

_name.category_id                       CIF_CORE

_name.object_id                         ENTRY

_category.key_id                        '_entry.id'

save_

save__entry.id

_definition.id                          '_entry.id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Identifies and distinguishes specific entries within a given

     data block.

;

_name.category_id                       entry

_name.object_id                         id

_type.purpose                           Key

_type.source                            Assigned

_type.container                         Single

_type.contents                          Text

_enumeration.default                    ''

save_

save__entry.sg_id

_definition.id                          '_entry.sg_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Identifies the space group with respect to which an entry's atom

    Sites and geometry tables are intended to be interpreted.

;

_name.category_id                       entry

_name.object_id                         sg_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_space_group.id'

_enumeration.default                    ''

save_

save_ATOM_SITE

_definition.id                          ATOM_SITE

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

     The CATEGORY of data items used to describe atom site information

     used in crystallographic structure studies.

;

_name.category_id                       ATOM

_name.object_id                         ATOM_SITE

_category.key_id                        '_atom_site.key'

loop_

_category_key.name

         '_atom_site.entry_id'

         '_atom_site.label'

save_

save__atom_site.entry_id

_definition.id                          '_atom_site.entry_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Associates an atom site with the entry to which it pertains.

;

_name.category_id                       atom_site

_name.object_id                         entry_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_entry.id'

_enumeration.default                    ''

save_

save__atom_site.key

_definition.id                          '_atom_site.key'

loop_

_alias.definition_id

         '_atom_site.key'

_definition.update                      2012-11-20

_description.text

;

     Value is a unique key to a set of ATOM_SITE items

     in a looped list.

;

_name.category_id                       atom_site

_name.object_id                         key

_type.purpose                          Key

_type.source                            Related

_type.container                         List

_type.contents                          'Text,Code'

loop_

_method.purpose

_method.expression

         Evaluation          '              _atom_site.key = [_atom_site.entry_id,_atom_site.label]'

save_

====

If a new hub category OTHER were added that also had atom sites associated with it, this is a set of the revised definitions that might be needed to the above items and new definitions in the above categories (definitions for OTHER omitted):

save_ATOM_SITE

_definition.id                          ATOM_SITE

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

     The CATEGORY of data items used to describe atom site information

     used in crystallographic structure studies.

;

_name.category_id                       ATOM

_name.object_id                         ATOM_SITE

_category.key_id                        '_atom_site.key'

loop_

_category_key.name

         '_atom_site.entry_id'

         '_atom_site.other_id'

         '_atom_site.label'

save_

save__atom_site.key

_definition.id                          '_atom_site.key'

loop_

_alias.definition_id

         '_atom_site.key'

_definition.update                      2012-11-20

_description.text

;

     Value is a unique key to a set of ATOM_SITE items

     in a looped list.

;

_name.category_id                       atom_site

_name.object_id                         key

_type.purpose                          Key

_type.source                            Related

_type.container                         List

_type.contents                          'Text,Text,Code'

loop_

_method.purpose

_method.expression

         Evaluation          '              _atom_site.key = [_atom_site.entry_id,_atom_site.other_id,_atom_site.label]'

save_

save__atom_site.other_id

_definition.id                          '_atom_site.other_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Associates an atom site with the OTHER to which it pertains.

;

_name.category_id                       atom_site

_name.object_id                         other_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_other.id'

_enumeration.default                    .

save_

Of course this is fairly speculative, because much depends on the nature of the relationship between OTHER, ATOM_SITE, ENTRY, and other categories.

John

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

[email protected]

(901) 595-3166 [office]

www.stjude.org

From: ddlm-group [mailto:[email protected]] On Behalf Of James Hester
Sent: Tuesday, June 21, 2016 12:51 AM
To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2

Dear John,

Your idea of a hub and star layout sounds like it has potential, and perhaps is similar to the Variant category used in later versions of imgCIF, but I'm not sure that I've grasped it fully. Do you think you could flesh out an example? To make it concrete, how about giving some datanames and sketchy DDLm definitions showing how you expect it would work:

(1) If cell_parameters are looped - how would the atom_site category in particular change and what new definitions are required?

(2) If we then introduce a category as yet unknown in cif_core, such as Variant - how would the atom_site category change?

thanks,

James.

On 21 June 2016 at 07:47, Bollinger, John C <[email protected]> wrote:

Dear All,

My apologies for the elements of review in what follows. Writing them helped me organize my thoughts, so I hope that reading them will help communicate those thoughts.

As Herbert reminds us, for just about any category that might appear in a data file, one can imagine an experiment, a construct, a model, etc. whose description requires multiple instances of that category. As James observes, however, many categories in our current dictionaries so rarely require such treatment that we have gotten along fine with the DDL1 and DDLm core dictionaries not, technically, permitting multiple instances of those categories to be presented in the same data file at all. In mmCIF, on the other hand, substantially all categories are loopable in principle, with many of them associated together indirectly via the ENTRY category and its _entry.id attribute. Inasmuch as _entry.id "identifies the data block", however, that amounts to a distinction without much difference.

But mmCIF’s ENTRY category is nevertheless instructive. Formally, many categories defined as Sets in the DDLm core are associated with each other in mmCIF not by having global nature but by referring to the same ENTRY. This arrangement is similar to what is called a "star schema" in data warehousing: instead of a multitude of individual entities being global (which cannot generally be accommodated in a data warehouse) or all having direct relationships declared with a large number of other entities, they are instead all related to a single central entity; the relationships can be visualized as emanating in a star-like pattern from that central entity. In such a data warehouse, the central entity often represents a point in time; it constitutes the dimension along which all the other entities can jointly and concertedly vary.

So suppose we took the ENTRY idea from mmCIF, but allowed a block to contain multiple ENTRYs? As far as I can determine, that’s consistent with the machine-readable parts of the definitions of ENTRY and _entry.id anyway, though it seems inconsistent with their prose descriptions. In that way, a data file could be valid against mmCIF and nevertheless describe, say, multiple CELLs, without there being any ambiguity about which CELL went with which REFLNS. That’s similar to what we want to be able to do, but it doesn’t quite get us everywhere we want to go. The problem that we are grappling with can be viewed as how to deal with a situation wherein we want or need a different pattern of relationships between categories than the one described by the relationships with ENTRY.

James’s proposal #2 approaches the problem from a different angle. It acknowledges that there is more than one possible pattern of categories and relationships characterizing a data set, and it designates these as "schemas", which is indeed an apt term. It uses the category label 'Set' or maybe 'Global' (which I prefer for this purpose) to define a pattern of 1:1 relationships that serves as a functional substitute for mmCIF’s explicit relationships between ENTRY and other categories; it introduces a mechanism for declaring that a given data file in fact complies with a different schema than the default; and it provides a mechanism aimed at helping software determine whether and to what extent it can correctly interpret the file’s contents. At that high level, I don’t disagree with any of it, but we’ve gone several rounds over the details. Our main sticking point is related to how the relationships among categories should be described in dictionaries -- especially those that to date have been implicit in categories being defined as Sets.

Now suppose we combine the high-level idea of providing for multiple schemas with the mmCIF star schema structure. The DDLm core can model each distinct schema as a simple category and the hub of its own star schema, like mmCIF’s ENTRY. Existing categories can participate in more than one of these where appropriate, though initially there would be only one. Converting the existing DDLm core to this structure would involve creating one new key in each current Set category (mmCIF already has these keys), and possibly child keys in other categories. It does not necessarily affect existing data files at all, because we can define default values for the various keys. In this way, all needed keys can be explicitly defined, with a much more modest overall number of keys than if relationships were expressed directly among all categories, and consequently with much less impact when new categories are added.

This also provides a fairly clean way to deal with SPACE_GROUP, and with any future categories that present a similar problem. Whereas with categories such as CELL we could enforce the restriction of one CELL per hub instance by making CELL’s category key be a child key referencing the hub category, we could reverse that for SPACE_GROUP and any similar category: give the hub category a child key referencing SPACE_GROUP.

To wrap it all together and make it easier for software authors to deal with, we can add _audit.schema or something like it. One variation that occurs to me would be to have _audit_schema.name and _audit_schema.multiplicity, with the former taking as its values the names of schema hub categories, and the latter taking values from an enumerated set describing whether that category is present and if so, whether it is restricted to a single value. This would provide a fairly easy mechanism by which data files could advertise their structure to consumers, and for software to gauge whether they can handle the data.

Best regards,

John

From: ddlm-group [mailto:[email protected]] On Behalf Of James Hester
Sent: Monday, June 20, 2016 2:36 AM
To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2

Dear John et. al.

To summarise at the top, my principal objection to the 'default key' proposal is that it produces more complex dictionaries (more keys) with interactions that are initially surprising to a casual reader.

Now in detail:

I think our goal here is to come up with semantics that can (i) replicate DDL1/DDL2 'global category' behaviour and (ii) allow these global categories to become multi-packeted, with simultaneous loss of 'globality'. 'Global' categories (what I have referred to previously as 'Set' categories) are just a tool for simplification of dictionaries, and so the more complex we make their operation, the less benefit they provide. Likewise, the mainstream behaviour of feature (i) should be as easy as possible to use.

Proposal #2 as it currently stands (the 'Set' proposal) envisaged that the 'globality' of a category would be removed when using datanames defined within a separate dictionary (mostly key datanames), and software should use the _audit.schema dataname and potentially _audit.conform to shield itself from the change in meaning that this entails.

The 'default keys' proposal that John has outlined instead envisages making almost all 'Set' categories into 'Loop' categories, defining keys for them, and giving those keys default values. John has suggested that this does not now involve a change in DDLm, because the semantics of having a default key are clear - the dataname can be left out if there is only one packet. However, a 'global' category with only one packet does *not* (currently) act like a 'Loop' category with only one packet, because (unlike a single-packet 'Loop' category) the values appearing as non-key datanames in the 'global' category may be assumed when interpreting values from all other datanames in all other loops. 'Global' categories really are different to 'Loop' categories for this reason, regardless of whether or not a key dataname is provided.

This difference between 'Global' and 'Loop' categories could be removed completely if all of the global category child keys were defined in parallel. In this case, the 'Global' category no longer acts 'Globally' but only in those categories for which a child key is defined. This 'simplification' comes at the expense of a whole lot of keys - in some categories, a key for every 'Set' category currently defined. At this point we have lost the practical simplification that we had obtained from 'Set' categories to start with.   So, either you accept a change in DDLm (additional consequences of a default key) and define the child keys at a future date in another dictionary, or you keep DDLm unchanged and include the child keys in the main dictionary immediately, throwing out the considerable simplification afforded by having global values. I would be against the latter option as it introduces a bunch of rarely-used key definitions into the main dictionary and is likely to be confusing to a casual programmer.

(We could of course alternatively adopt the blanket rule that values appearing in a single-packet loop act globally with identical 'disappearing key' behaviour. While this is true enough mathematically, it now becomes permissible to drop keys that have up until now been required even for single packet loops and loops with foreign keys that point to those single-packet loops, and this would break current software. So I exclude this as an option, even if it is an elegant rule.)

So, given that we are stuck with two types of 'Loop' category, I would prefer communicating this clearly up front in the _definition.class tag, rather than relying on the presence of a default key value. What I think might communicate better than the current 'Set' definition, however, is a change from 'Set' to 'Global' (or 'Overall'), with a definition something like:

Global
;

    A special type of 'Loop' category. When single-valued, (i.e.
    key-value pairs or single-row loops) datanames from a 'Global'
    category provide overall values for use in interpreting any
    other values in a datablock. Global categories may only be
    looped where a key has been defined.

;

I'm not sure if this is more likely to meet with approval.

I have added some more comments in John's email below.

On 18 June 2016 at 08:31, Bollinger, John C <[email protected]> wrote:

Dear James and Colleagues,

Comments in line below.

On Thursday, June 16, 2016 9:23 PM, James Hester wrote:
> I'm not at all concerned about tweaking DDLm. The proposed update to DDLm is a clarification and an extension, because the semantic interpretation of existing files would be unchanged. Is there any particular reason you are concerned about such measured changes to DDLm? From my point of view DDLm is the lowest-impact area of the framework - very few people actually care *how* we express the meaning of a dataname, as long as that meaning doesn't change, and those that do care deeply about DDL in general (in my experience, databases) have not done any work on DDLm yet.

Perhaps my concerns are misplaced, but it seems to me that the DDLs are the locations of greatest semantic leverage in our framework. On one hand, that means that we can make a large impact with changes there, but on the other hand it means that even small changes there can have large unintended side effects. Indeed, although I am unaware of any explicit assertion to this effect previously, it seems to me that we should have at least the same commitment to the stability of definitions in our DDL dictionaries that we do to the stability of definitions in our data dictionaries. But perhaps we can relax that a bit for DDLm, given that its use is still small.

Very little DDLm software has been written, and mostly by those in this group. A lot of thought and negotiation (I believe) has gone into DDLm, so we should not be too cavalier with our changes. Now is the best time to make them rather than later when we might hope for more widespread adoption.

> I'm not opposed to the concept of a default key value per se, I'm just unclear as to why you are arguing that this needs to be defined in a cif_core 'Set' category as opposed to an add-on dictionary.

I'm arguing that a category that has a key and permits multiple values per item is a de facto Loop, and that it is best to in fact define such a category as a Loop so that that is clear. In that case its key must be expressed in the dictionary that defines the category. It would also be acceptable to classify such a category with some new label, but in that case I still think it would be most sensible to define the key in the same dictionary that defines the category itself.

See my comments at the top of the email. I have provided a new label and definition, which indicates that the category can be looped, and under what conditions multiple packets may be expected. Perhaps this is acceptable?

I'm also arguing against the "magic keys" aspect of Proposal #2. I don't like magic, a.k.a. special cases, in specifications or in software, and I have presented a viable alternative in the form of default key values.

The reason for the special case 'Set' category is the considerable simplification it offers. We trade complexity of behaviour in one place for simplicity elsewhere. And we are ultimately stuck with it because of DDL1.

I'm furthermore arguing that even if we do give keys to Sets, wherever a category key or child key is itself defined is the proper place for any applicable default value for that key to be defined. The default value is an attribute of the definition of the key item, so I see only negatives to physically separating the two.

Absolutely, I wouldn't argue with this.

>> Let's consider the SPACE_GROUP category, since it sparked this whole discussion. I append a cut at what I think we should do with it (only frames containing modifications are presented); I think I have marked all the changes and additions within via CIF comments. I rarely wrangle dictionaries, so I apologize for any errors I have committed. The key defaulting presented within formalizes how, when, and why SPACE_GROUP's category key and the associated child key in SPACE_GROUP_SYMOP can be omitted from data files. To the best of my knowledge, nothing within relies on any DDLm changes.
>
> I think I understand your proposal to be using the existence of a default key value to signal that the key may be omitted in a single-value loop, *and* that child key datanames in other loops that would otherwise contain them may be omitted in this case.

I guess you can describe it as a "signal". I view it as deeper and more organic: where an explicit parent or child key may be omitted from data files, that is a direct consequence of the fact that it has a default value. That dictionary-driven software should handle such omissions naturally is also a consequence. These items can be omitted because they still take well-defined and suitable (default) values in that case.

I don't think I'm suggesting any change to the defined meaning of _enumeration.default; I'm just applying its existing meaning to the problem at hand in a way that we have not done before. The significance pertains not to _enumeration.default itself, but to its combination with a category key. That's not a change, it's a discovery. Even so, the underlying idea is not actually new. One can view it as a specific case of the same thing expressed by DDL2's _item.mandatory_code taking the value 'implicit'.

See my comments at the beginning for why I think there is more than just logical consequences going on here i.e. there is global behaviour.

> I'm not clear whether you propose that these changes should happen in cif_core, or in an add-on dictionary.

For space_group, the dictionary changes should be applied to the core, in order to make the DDLm core consistent with our other dictionaries. I am generally inclined to put future (re-)keyings of core categories directly into the core dictionary as well, but that's a weaker opinion. Furthermore, I think there may be a way to do this so that we avoid an explosion of child keys, but I haven't worked all the way through that yet.

Your proposal on child keys would be interesting as I argue above that an explosion of child keys is a drawback and essentially removes the advantage gained by having global categories.

> In any case, I agree that this can be made precisely semantically equivalent to the 'Set' proposal, due to the fact that a default key value makes no sense in general and so the meaning of a default value for a key may be overloaded as you have done, with no implications elsewhere. This is still a change to DDLm, because the presence of _enumeration_default in certain definitions now has new implications (not that I'm opposed in principle to changing DDLm).

I agree that the "magic keys" aspect of Proposal #2 and the default keys approach I have presented both enable categories to have keys that are not expressed explicitly in data files. The former does it by fiat; the latter does it in a manner consistent with DDLm's existing semantics, even if our dictionaries have not exercised DDLm in quite that way before.

I agree that a default key value is not necessarily sensical for every present or conceivable category, but I disagree that I am overloading any definition, or that I am proposing a change to DDLm.

I would see no problem in a separate dictionary defining the key and default value for a 'Global' category as I've defined above. Essentially, conformance to this separate dictionary erases the 'Global' nature of the category and turns it into a normal 'Loop' category with default key, so that datafiles created according to the original specification remain valid with the new dictionary - we have in fact elegantly expanded the ontology.

Default key values do not make sense for categories that rely on natural keys, as does mmCIF's atom_type category, for example. Atom_type's key, _atom_type.symbol, is the chemical symbol for the element whose characteristics are described; it is a natural key because it has significance beyond distinguishing one atom_type from another. In other words, it is not just a key, but also part of the data.

On the other hand, space_group does not use a natural key, but rather a surrogate key -- one whose values have no inherent meaning other than to distinguish between different space_groups presented in the same data file. If only one space_group is presented then any key for it will do, because the keys are arbitrary. A default value for such a key is perfectly sensible.

Now, consider this: what kind of key will any new category have if that category requires one or more existing Set categories to become looped? We have previously discussed possibilities such as twin_component and variant, but as far as I can tell, these do not afford any clear, non-trivial, natural keys. Addition of any category that relies on a natural key would require existing sets to be looped only if that category's key is inherently single-valued with respect to those sets. I'm in fact having trouble seeing the circumstances under which it would make sense to add a new category that has a natural key and that requires existing sets to be looped. But even if we did discover a new category with a non-trivial, natural, candidate key, we always have the option of choosing a surrogate key instead. Indeed, that's what was done with space_group -- _space_group.name_Hall is a candidate key, I think, but we chose a surrogate key instead. If we choose surrogate keys then default key values present no semantic problem.

I agree with this - I'm not arguing that default key values are somehow bad or present problems, only that the 'global' behaviour is not captured.

> My preference would still be for the 'Set' proposal, because the semantics are wrapped up in a single enumerated value, at category level, rather than arising from an interaction between attributes of a particular dataname inside that category. I do not see any other distinguishing features. I believe that for programmers, dictionary authors, and casual dictionary readers, the 'Set' proposal is more accessible, as the particular special behaviour of the category is flagged explicitly and concisely, in the category definition, and described in a single place in the DDLm attribute dictionary.

[...]
Moreover, even if we did provide magic key behavior for Sets, I am not convinced that all the constituencies named would necessarily consider that a win, because it weakens the concept of a Set. There is a tremendous difference between "the items in a Set category take only one value each" and " the items in a Set category *ordinarily* take only one value each", especially when "ordinarily" really means when the data describe a particular kind of thing to which we have ascribed special status. In many respects, programming for, using, or interpreting the latter (the magic keys version) are all more difficult than programming for, using, or interpreting the former (the current version).

OK, point taken, I did say my objection wasn't critical.

> You will notice there is semantic convenience in referring to a category as a 'Set' category, rather than 'a category that has a default key value defined'. If you propose changing the cif_core dictionary rather than using an add-on dictionary, then the 'Set' proposal involves zero changes, whereas the default_value proposal involves a single extra key definition and adjustment to the definitions for each 'Set' category. Both these objections are not particularly critical, of course.

The semantic convenience described comes at the cost of weakening the concept of a 'Set', and as a result, the comparison presented involves inequivalent expressions. The magic keys analog of 'a category that has a default key value defined' is 'a Set that has a category key defined'; these don't seem very different in weight to me. If we suppose that a Set may have one or more keys defined in a different dictionary than the one in which the Set itself is defined, then additionally we may not even be certain which kind of Set we're talking about, and if that ever changes then we cannot be confident of being able to recognize that from the dictionary at hand. That is of course where _audit.schema and audit_conform come in, but I am not much liking the idea that an applicable item definition, taken in context of its dictionary, may not completely define the given item.

That is where we started - if we are to allow datanames to used with global meaning and in multi-packet loops, then we are talking about different meanings, and only something like _audit.schema can insulate software from that. If we are to exclude changes in meaning, we have to define all child keys up front and then we need _audit.schema even more than before, as _audit.conform won't help. In the 'all child keys defined up front' scenario, we completely abandon global categories and _audit.schema becomes the signal as to when a datablock can be interpreted as for the old cif_core.

> Ultimately, this is going to be a matter of taste as the semantics can be made identical, and so I don't know quite what else you or I can say to convince each other on this point. We may have to rely on our colleagues to decide.

We do seem to have both settled into our positions. Would it sway you at all if I successfully devised a solution to the child key proliferation problem? I have some ideas in that direction that I haven't fleshed out yet.

It could indeed sway me as I think this is at the core of my objection. If we could effectively define all the child keys, while at the same time keeping the key definitions from swamping out the meat of the dictionary, and allow for the appearance of future 'used to be global' categories like twinning and variants adding their own child keys, then it would be worth serious thought. I'm pretty sure dREL can be brought along with whatever variation you propose.

>> Note, by the way, that I think the particular changes presented, or something very like them, are needed regardless of what we choose for the general case, because the DDL1 core and mmCIF are already structured this way.
> I was perhaps too diplomatic or long-winded in previous messages. The incorporation of space_group into cif_core as a looped category was a mistake that we must *not* perpetuate. We either correct it by dropping it from DDLm cif_core, which is impossible due to widespread DDL1 usage (as a 'Set' category), or we fix the semantics. So, in the case of space_group we can feel ourselves bound only by widespread current usage, not by the contradictory semantics of the DDL1 version.

I accept that the deprecation of SYMMETRY and SYMMETRY_EQUIV in favor of SPACE_GROUP and SPACE_GROUP_SYMOP was a mistake, but whatever fix we contemplate should adhere to our policy of keeping definitions stable, at least as well as we are able to make it do. Moreover, how to deal with SPACE_GROUP is a somewhat separate issue, because it involves definitions that already exist, as opposed to definitions that we may write in the future. It makes for a reasonable test case for our future direction, but it may be that a different solution is more suitable here than whatever we decide to do in the future, when we have no legacy definitions to deal with.

Our policy of keeping definitions stable is not an end in itself, but a logical requirement born of the need to guarantee that software that is already written remains valid. If everybody is using unlooped SPACE_GROUP to read and write structures I don't see any issue in fiddling with the meaning, as long as any changes are consistent with that expectation of an unlooped value.

I'll have more to say about these particular cases, in a separate message.

[...]

________________________________

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)

Prev by Date: Re: [ddlm-group] Further discussion of proposal #2

Next by Date: [ddlm-group] Discussion of hub-spoke proposal

Prev by thread: Re: [ddlm-group] Further discussion of proposal #2

Next by thread: [ddlm-group] =?utf-8?q?=28no_subject=29?=

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Further discussion of proposal #2