[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Discussion of hub-spoke proposal

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: [ddlm-group] Discussion of hub-spoke proposal
From: James Hester <[email protected]>
Date: Wed, 29 Jun 2016 11:45:03 +1000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=mime-version:from:date:message-id:subject:to;bh=DlOEu0MvFKzm6693Sq+NdgEfhRZz1dLXzDVlWQCi0Ko=;b=xLn1M/dbgU+TZOuZWpFGBg8zj3lOFcPUgXpF1An60vFW74LrhhyLpeGSDyondEL71alk7Q/MJgmXqfJTd1ZzuMXDb5ea9pc63+BU5KTukScgejCc5aOWHxjE/Lsjbzf/2uGxwOtosIBY3baCzz0S/kDmTEyLg0lPzy95GTLLmX/mmTVywPo7jOXvrnuSAc5oc0QXwKnXbJew4Nxg3SB3XTmey7aIVVolYQeQAtsuBxHTprIMxn0d75z0OekQZQj7c0q2P2vAyw3o3/zWXYsG/TYQbZqDE00IhDJPprbh8yLAQYuL67HraatgMkcjWFfrFFE3zlRm59pAwDkEf6qloA==

Dear John and others,

John has provided a first cut of a 'hub and spokes' approach (NB to avoid potential confusion with the 'Star' format I've switched to 'Hub and Spoke'). I am not entirely sure that I've understood all of the variations the John speaks about, so the following can be seen as a stab at clarification.

The core points as I understand them:

(1) A 'Hub' category is defined with a default key

(2) Every 'Set' category is given a single key dataname that is a child of a 'Hub' key

(3) 'Loop' categories are given a single additional key dataname that is also a child of the 'Hub' key

This works as follows:

- Datafiles produced according to our current dictionaries are valid as all of the keys described above simply take default values and may be left out of datafiles. In particular, in the default scenario the 'Hub' key can only be single-valued and this constrains all 'Set' categories to be single-valued as their 'Hub' child keys can then only take a single value.

- where a 'Set' category becomes looped:

(i) the 'Hub' child key in that 'Set' necessarily becomes multi-valued, meaning that the 'Hub' category must also now be present in the datafile listing the values of the 'Hub' keys. When packets in different 'Set' categories have a one-to-one correspondence, the same 'Hub' child key value can be used to show this, otherwise distinct 'Hub' key values are necessary

(ii) those Loop categories that are affected by a multi-valued Set category have 'Hub' child keys set to values that match the appropriate packet of that 'Set' category.

If I have misunderstood John's proposal I trust he will correct me.

Given the above understanding the most immediate problem is how we would deal with a Loop category that relies on global values from multiple 'Set' categories. If the particular packets in those 'Set' categories have different 'Hub' key values for even one of the loop category packets, it becomes impossible to specify a 'Hub' key for the Loop packet. For concreteness, suppose we have multiple twins and multiple space groups, with no necessary correspondence between the twin individual and the particular space group (e.g two different settings are presented for every twin individual). Values in the 'refln' category will depend on both the twin individual chosen and the space group. There is no unique 'Hub' key and so we cannot cover this situation.

I think I understand the 'Other' example category in John's email to be providing, in this case, a separate 'Hub' which would solve this particular situation. Extrapolating, however, this means that there would be a new 'Hub' category for every Loop category that depends on a unique combination of global values from more than one category, which is unwieldy and leads to further multiplication of keys, this time in a more complex scheme, so doesn't appear to have won us anything. I suggest that we can be more economical in the hub and spoke paradigm as follows, which I think is how John envisaged the SPACE_GROUP category working:

(1) 'Set' categories are given their own default-valued key.

(2) A 'Hub' category is defined with a single dataname acting as the key, and all other datanames in this 'Hub' category are child keys of the Set category keys defined in (1).
(3) Loop categories are given a single additional, default-valued key dataname which points to the 'Hub' key (same as (3) in the above scheme)

This scheme then works as follows:

- Datafiles produced according to our current dictionaries are valid as all of the keys described above simply take default values and may be left out of datafiles
- A datafile which introduces multiple values for a 'Set' category:
(i) lists those multiple values in that 'Set' category using the 'Set' key created in point 1 above
(ii) provides a 'Hub' key value for all those combinations of its 'Set' key values with other non-default 'Set' key values that are used by Loop categories, listing these key values in the 'Hub' category loop
(iii) when creating values in a 'Loop' category affected by any of the newly-looped 'Set' categories, sets the Loop's Hub child key defined at point (3) above to point to the row of the 'Hub' category corresponding to the particular values of the 'Set' categories that are relevant to the current Loop category packet.

In terms of how this would be used in e.g. a conversion of a fractional coordinate to Cartesian coordinates, the software (or dREL) would use the hub child key in 'atom_site' to find an entry in the 'Hub' category. This entry contains the 'Set' category key values relevant to that particular atom_site row, and these are then used to obtain the values that are necessary - so the hub.space_group_id key indexes into the space group category, and the hub.cell_id indexes into the cell parameters. One drawback of this scheme is that a value must be provided in each 'Hub' category loop packet for every 'Set' category key, regardless of whether the loop that uses that particular 'Hub' value has any dependence on that 'Set' category. Because of this, the default value must have a special notation so that software can understand when a particular key is irrelevant to a particular loop - 'dot' would suit here.

Some comments on both these variations:

Datafiles
=======

(1) These proposals meet the criterion of ensuring that current datafiles remain valid

(2) The presence of a multi-packet 'Hub' category fulfills the same role as _audit.schema in protecting software from misinterpretation. To a workable approximation, a simple text search for the hub category master key dataname would be sufficient to distinguish old-style files from new-style files.

Dictionaries
=========

(3) It is notable in the schemes (as I have interpreted them) that we are unable to specify which 'Set' categories influence which 'Loop' categories, as we are simply providing a Hub key and a Hub category that contains all defined 'Set' category keys. The 'Set' - 'Loop' link is up to the datafile writer. I do not feel comfortable that CIF software authors will come to identical conclusions on how to model the relatively complex situations we are talking about here: indeed, the job of the dictionaries is to describe usage in sufficient detail that all conforming CIF writers and readers agree on interpretation.

dREL
====

(4) It is highly desirable that a dREL routine does not change each time another 'Set' category becomes looped, as it would require new routines to be written for each combination of looped 'Set' categories. Under the current proposals, for us to preserve a given dREL method unchanged, dREL would have to give 'hub' keys and 'hub' categories special treatment, so that anytime a 'Set' category was accessed in dREL, the 'Hub' key value for the current packet is used to index into the appropriate packet of the 'Set' category. We should therefore create a new class of Category, (e.g. 'Hub') specifying this behaviour. Note that this logic cannot be explicitly laid out in dREL methods (even if we wanted to) because we cannot know ahead of time what other Set categories might appear. For example, in the 'Variant' scenario, when doing atom_site calculations we just want to pick the set of atom_sites corresponding to our current unit cell variant, but when we wrote the dREL method 'variants' did not exist. I have added an appendix with further analysis of this.

In summary, the present proposal requires the creation of a new category class (e.g. 'Hub'), but DDLm is otherwise unaffected. dREL semantics for the 'dot' operator need to be changed, but that is also true of Proposal #2. My key objection is that we are ceding the modelling of complex cases to individual software authors, thus risking a failure of the standard to ensure correct communication. This is almost (but not quite) saying 'as long as _audit.schema is non-standard, loop whatever you want and figure it out between yourselves'. So I remain in favour of proposal #2 with 'Global' replacing 'Set'.

James.

Appendix: How dREL should work under proposals #2 and 'Hub and spoke'
============================================================

Consider the following piece of dREL code from the current draft cif_core dictionary for calculating site multiplicity. This code contains a loop over a different category, as well as accessing a "Set" value (space_group.multiplicity), so it is likely to be sensitive to our decisions.

     With a as atom_site

        mul =   0
        xyz =   a.fract_xyz

          Loop s as space_group_symop {

             sxyz =   s.R * xyz + s.T
             diff =   Mod( 99.5 + xyz - sxyz, 1.0) - 0.5

             If ( Norm ( diff ) < 0.1 ) mul += 1
       }
    _atom_site.site_symmetry_multiplicity = _space_group.multiplicity / mul

Suppose now that we have multiple space groups *and* multiple variants, so that the key for _atom_site consists of the label and either (#2) the variant and the space group or (H&S) the hub key. Suppose also that the category space_group_symop has only two keys: the symop number and either (#2) the space group or (H&S) the hub key. First, under current dREL rules this code will probably fail miserably, as it will apply symmetry operators from both space groups and then attempt to access a unique value for _space_group.multiplicity.

So I propose that we adopt the following simple dREL rules, which are really just deconstructing the category structure back to the original schema:

(i) any loops over categories filter those categories on the current values of "sibling" keys (after hub category examination for H&S).

(ii) any access to 'Set' categories implicitly uses the current value of any related (sibling or parent) keys (after hub category examination for H&S)

The above dREL executes as follows:
(1) the code is executed for each packet in _atom_site, so, at execution, all key values are defined. For the H&S proposal, the dREL engine indexes into the hub category using the hub child key and sets all Set category child keys to the given values: these are then placed in scope to recreate the situation of proposal #2.

(2) space_group_symop shares a sibling key with atom_site: both have child keys of space_group. Therefore, under rule (i) the 'Loop' only handles those packets of space_group_symop that have the same value of space_group_id as the current _atom_site packet. 'Mul' is calculated identically to the single space group case (i.e. correctly)

(3) the reference to space_group.multiplicity uses the row of space_group indexed by the only related key, space_group_id, for the current site packet

If we decide that we have 'variants' for space_group (and therefore necessarily space_group_symop) as well, then the above loop and dereference would additionally restrict their selection using the value of the variant key.

====

On 28 June 2016 at 08:31, Bollinger, John C <[email protected]> wrote:
[...]

In preparation for an example, let’s suppose that we follow mmCIF by naming the default hub category ENTRY. We would give a child key referencing ENTRY to each category that must be single-valued with respect to a "normal" data set (because that’s the kind of data that ENTRY represents). mmCIF’s in fact does just that. We must ensure that we do not end up with, say, multiple instances of CELL_LENGTH referencing the same ENTRY, for allowing that would introduce just the kind of ambiguity we want to avoid. One way to do that would be to make these child keys also be their categories’ category keys (leaving aside for the moment how we classify those categories). We would furthermore assign default values for the keys to enable them to be omitted from data files. Some Loop categories would need such child keys added to their own category keys as well.

The exception in our current core dictionary is SPACE_GROUP, because it is defined as a Loop but used as a Set with respect to any given ENTRY. There we define the relationship in the other direction: ENTRY gets a child key referencing SPACE_GROUP. Note here that we do not need to prevent multiple ENTRY instances from referring to the same SPACE_GROUP.

Most categories that are already Loops also need to be tied somehow to an ENTRY. How that happens depends on their own key structure: in many cases, the Loop’s category key will need to be expanded with a child key referencing ENTRY, but if a Loop has a category key referencing a parent category, and the parent’s category key, if any, does not reference ENTRY, then the child’s will not reference ENTRY either. This would apply to SPACE_GROUP / SPACE_GROUP_SYMOP and probably to several of the PUBL_* categories, for example.

Taking your second example first, adding a new category that does not serve as a hub category does not require changes to any other category. No changes are needed because category instances associated with the same hub category recognize each other by virtue of their separate and independent associations with that hub, as opposed to relying on each other to be global or to have keys associating them directly. This applies, bidirectionally, to new categories just the same as it does to existing ones.

As for your first example, providing multiple instances of the CELL_* categories could be done multiple ways, but here are three of the more likely:

(1) provide multiple instances of ENTRY

This case can be exercised without any new definitions at all, but the different instances of the CELL_* categories would need to explicitly provide cell_*.entry_id values. Probably the other expressed categories that have child keys referencing ENTRY would need to express those child keys explicitly as well, but that depends to some extent on what information they are intended to convey. We might do this if we wanted to describe several structures in the same data block, for example. This has the advantage that we can use as much or as little as we want of each ENTRY – cell, space group, atom sites, even experimental details.

(2) define a new hub category

This is the route that would be taken when providing for multiple cells in order to describe data that are not adequately modeled either by an ENTRY or by a collection of them. In this case, we would need to define new keys associating the appropriate categories (which would not necessarily be all of them) with the new hub category. At this point I’m not actually seeing where we would find a need to aggregate multiple cells directly, instead of aggregating ENTRYs, but I can’t rule it out.

(3) give the CELL_* categories a surrogate key, and use a variants-like approach to associate additional cells () with ENTRY.

I’m supposing here that this would be used only for cells that are “secondary” in some sense. For example, if the cell parameters were measured at multiple temperatures, but only one data set and structure determination were performed. (If full structure determinations were done at multiple temperatures then that might be better handled by providing multiple ENTRYs, instead). This approach might require also new DDLm semantics allowing us to specify that cell_*.entry_id, although not (any longer) a category key, must not take duplicate values.

As for an example, although there is some room for variation within the star-schema approach, this is a cut at what I have primarily suggested:

====

save_ENTRY

_definition.id                          ENTRY

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

    Represents a chemical or biological structure and associated experimental details.

;

_name.category_id                       CIF_CORE

_name.object_id                         ENTRY

_category.key_id                        '_entry.id'

save_

save__entry.id

_definition.id                          '_entry.id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Identifies and distinguishes specific entries within a given

     data block.

;

_name.category_id                       entry

_name.object_id                         id

_type.purpose                           Key

_type.source                            Assigned

_type.container                         Single

_type.contents                          Text

_enumeration.default                    ''

save_

save__entry.sg_id

_definition.id                          '_entry.sg_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Identifies the space group with respect to which an entry's atom

    Sites and geometry tables are intended to be interpreted.

;

_name.category_id                       entry

_name.object_id                         sg_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_space_group.id'

_enumeration.default                    ''

save_

save_ATOM_SITE

_definition.id                          ATOM_SITE

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

     The CATEGORY of data items used to describe atom site information

     used in crystallographic structure studies.

;

_name.category_id                       ATOM

_name.object_id                         ATOM_SITE

_category.key_id                        '_atom_site.key'

loop_

_category_key.name

         '_atom_site.entry_id'

         '_atom_site.label'

save_

save__atom_site.entry_id

_definition.id                          '_atom_site.entry_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Associates an atom site with the entry to which it pertains.

;

_name.category_id                       atom_site

_name.object_id                         entry_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_entry.id'

_enumeration.default                    ''

save_

save__atom_site.key

_definition.id                          '_atom_site.key'

loop_

_alias.definition_id

         '_atom_site.key'

_definition.update                      2012-11-20

_description.text

;

     Value is a unique key to a set of ATOM_SITE items

     in a looped list.

;

_name.category_id                       atom_site

_name.object_id                         key

_type.purpose                          Key

_type.source                            Related

_type.container                         List

_type.contents                          'Text,Code'

loop_

_method.purpose

_method.expression

         Evaluation          '              _atom_site.key = [_atom_site.entry_id,_atom_site.label]'

save_

====

If a new hub category OTHER were added that also had atom sites associated with it, this is a set of the revised definitions that might be needed to the above items and new definitions in the above categories (definitions for OTHER omitted):

save_ATOM_SITE

_definition.id                          ATOM_SITE

_definition.scope                       Category

_definition.class                       Loop

_definition.update                      2016-06-27

_description.text

;

     The CATEGORY of data items used to describe atom site information

     used in crystallographic structure studies.

;

_name.category_id                       ATOM

_name.object_id                         ATOM_SITE

_category.key_id                        '_atom_site.key'

loop_

_category_key.name

         '_atom_site.entry_id'

         '_atom_site.other_id'

         '_atom_site.label'

save_

save__atom_site.key

_definition.id                          '_atom_site.key'

loop_

_alias.definition_id

         '_atom_site.key'

_definition.update                      2012-11-20

_description.text

;

     Value is a unique key to a set of ATOM_SITE items

     in a looped list.

;

_name.category_id                       atom_site

_name.object_id                         key

_type.purpose                          Key

_type.source                            Related

_type.container                         List

_type.contents                          'Text,Text,Code'

loop_

_method.purpose

_method.expression

         Evaluation          '              _atom_site.key = [_atom_site.entry_id,_atom_site.other_id,_atom_site.label]'

save_

save__atom_site.other_id

_definition.id                          '_atom_site.other_id'

loop_

_definition.update                      2016-06-27

_description.text

;

     Associates an atom site with the OTHER to which it pertains.

;

_name.category_id                       atom_site

_name.object_id                         other_id

_type.purpose                           Link

_type.source                            Related

_type.container                         Single

_type.contents                          Text

_type.contents_referenced_id                 '_other.id'

_enumeration.default                    .

save_

Of course this is fairly speculative, because much depends on the nature of the relationship between OTHER, ATOM_SITE, ENTRY, and other categories.

John

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

[email protected]

(901) 595-3166 [office]

www.stjude.org

From: ddlm-group [mailto:[email protected]] On Behalf Of James Hester
Sent: Tuesday, June 21, 2016 12:51 AM
To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2

Dear John,

Your idea of a hub and star layout sounds like it has potential, and perhaps is similar to the Variant category used in later versions of imgCIF, but I'm not sure that I've grasped it fully. Do you think you could flesh out an example? To make it concrete, how about giving some datanames and sketchy DDLm definitions showing how you expect it would work:

(1) If cell_parameters are looped - how would the atom_site category in particular change and what new definitions are required?

(2) If we then introduce a category as yet unknown in cif_core, such as Variant - how would the atom_site category change?

thanks,

James.

On 21 June 2016 at 07:47, Bollinger, John C <[email protected]> wrote:

Dear All,

My apologies for the elements of review in what follows. Writing them helped me organize my thoughts, so I hope that reading them will help communicate those thoughts.

As Herbert reminds us, for just about any category that might appear in a data file, one can imagine an experiment, a construct, a model, etc. whose description requires multiple instances of that category. As James observes, however, many categories in our current dictionaries so rarely require such treatment that we have gotten along fine with the DDL1 and DDLm core dictionaries not, technically, permitting multiple instances of those categories to be presented in the same data file at all. In mmCIF, on the other hand, substantially all categories are loopable in principle, with many of them associated together indirectly via the ENTRY category and its _entry.id attribute. Inasmuch as _entry.id "identifies the data block", however, that amounts to a distinction without much difference.

But mmCIF’s ENTRY category is nevertheless instructive. Formally, many categories defined as Sets in the DDLm core are associated with each other in mmCIF not by having global nature but by referring to the same ENTRY. This arrangement is similar to what is called a "star schema" in data warehousing: instead of a multitude of individual entities being global (which cannot generally be accommodated in a data warehouse) or all having direct relationships declared with a large number of other entities, they are instead all related to a single central entity; the relationships can be visualized as emanating in a star-like pattern from that central entity. In such a data warehouse, the central entity often represents a point in time; it constitutes the dimension along which all the other entities can jointly and concertedly vary.

So suppose we took the ENTRY idea from mmCIF, but allowed a block to contain multiple ENTRYs? As far as I can determine, that’s consistent with the machine-readable parts of the definitions of ENTRY and _entry.id anyway, though it seems inconsistent with their prose descriptions. In that way, a data file could be valid against mmCIF and nevertheless describe, say, multiple CELLs, without there being any ambiguity about which CELL went with which REFLNS. That’s similar to what we want to be able to do, but it doesn’t quite get us everywhere we want to go. The problem that we are grappling with can be viewed as how to deal with a situation wherein we want or need a different pattern of relationships between categories than the one described by the relationships with ENTRY.

James’s proposal #2 approaches the problem from a different angle. It acknowledges that there is more than one possible pattern of categories and relationships characterizing a data set, and it designates these as "schemas", which is indeed an apt term. It uses the category label 'Set' or maybe 'Global' (which I prefer for this purpose) to define a pattern of 1:1 relationships that serves as a functional substitute for mmCIF’s explicit relationships between ENTRY and other categories; it introduces a mechanism for declaring that a given data file in fact complies with a different schema than the default; and it provides a mechanism aimed at helping software determine whether and to what extent it can correctly interpret the file’s contents. At that high level, I don’t disagree with any of it, but we’ve gone several rounds over the details. Our main sticking point is related to how the relationships among categories should be described in dictionaries -- especially those that to date have been implicit in categories being defined as Sets.

Now suppose we combine the high-level idea of providing for multiple schemas with the mmCIF star schema structure. The DDLm core can model each distinct schema as a simple category and the hub of its own star schema, like mmCIF’s ENTRY. Existing categories can participate in more than one of these where appropriate, though initially there would be only one. Converting the existing DDLm core to this structure would involve creating one new key in each current Set category (mmCIF already has these keys), and possibly child keys in other categories. It does not necessarily affect existing data files at all, because we can define default values for the various keys. In this way, all needed keys can be explicitly defined, with a much more modest overall number of keys than if relationships were expressed directly among all categories, and consequently with much less impact when new categories are added.

This also provides a fairly clean way to deal with SPACE_GROUP, and with any future categories that present a similar problem. Whereas with categories such as CELL we could enforce the restriction of one CELL per hub instance by making CELL’s category key be a child key referencing the hub category, we could reverse that for SPACE_GROUP and any similar category: give the hub category a child key referencing SPACE_GROUP.

To wrap it all together and make it easier for software authors to deal with, we can add _audit.schema or something like it. One variation that occurs to me would be to have _audit_schema.name and _audit_schema.multiplicity, with the former taking as its values the names of schema hub categories, and the latter taking values from an enumerated set describing whether that category is present and if so, whether it is restricted to a single value. This would provide a fairly easy mechanism by which data files could advertise their structure to consumers, and for software to gauge whether they can handle the data.

Best regards,

John

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Discussion of hub-spoke proposal (Bollinger, John C)

Prev by Date: Re: [ddlm-group] Further discussion of proposal #2

Next by Date: Re: [ddlm-group] Discussion of hub-spoke proposal

Prev by thread: Re: [ddlm-group] Draft audit.schema,looping proposal available on Github

Next by thread: Re: [ddlm-group] Discussion of hub-spoke proposal

Index(es):

Date

Thread

Discussion List Archives

[ddlm-group] Discussion of hub-spoke proposal