Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Discussion of hub-spoke proposal

Dear John B and others,

I had indeed subtly failed to grasp the key elements of Hub and Spoke, and now that I have grasped them (I hope), I think it indeed minimises child key proliferation. My mistake, in case anyone else was confused, was that I had missed the crucial point that hub child keys are always added to categories (regardless of whether they are 'Sete' or 'Loop'), rather than 'Set' category child keys being added to hubs, and John has given good reasons for doing it this way.  Just to summarise my understanding, then, the H&S proposal is that

(1) For each distinct data modelling task, a 'Hub' category is defined
(2) All categories that interact in this data model are given child keys of the 'Hub'
(3) Hubs may be layered.

So the current situation with 'Global' categories can be instead modelled as follows:

(1) Create a notional 'Datablock' hub category'
(2) Give every category the same default child key of 'Datablock'

This in fact precisely mimics the effect of enclosing all the datanames within a datablock, thereby associating the datablock name with every category without specifying which categories actually influence which other categories, and I suggest that this notional category is assumed always to exist (more later).

Now for a worked example as a check: 

Suppose we wish to include twinning.  We assume at first that we have a number of twin individuals which have identical space groups and cell parameters, but different orientations. So, we create a new Hub, 'Twin', which lists twin-specific information (e.g. mass fraction and orientation matrix).  Category 'refln' is given a new 'Twin' child key, so that reflection measurements and calculations from different twin individuals can be listed.  We do not give twin child keys to cell_parameters or space_group, thus restricting them to their single notional 'Datablock' hub key.

Meanwhile, another data modelling exercise has produced a Hub category 'Multi-settings' which allows multiple settings of the same space group to be specified and atoms, reflections, cell parameters etc. can be listed in both.  In this case, atom_site, refln, cell_parameters all have child keys of the 'multi-settings' Hub.

Particularly intrepid data modellers then wish to describe their twin data in several settings.  They can simply overlay the two dictionaries, which define completely separate datanames, and now have a consistent description of their data.  For example, the refln category has two new keys, one for the setting and one for the twin individual.

Finally, they also want to use 'variants' to track the progress of their data reduction. So they create a 'Variant' hub category and add child keys of it to everything.

Next, some refreshed observations which will hopefully be correct this time, and which in some aspects simply reiterate in my own words what John has already said:

(1) If there is to be more than one 'Hub' category, simply searching for a 'Hub' dataname is not a substitute for _audit.schema, as there could potentially be various different such 'Hub' datanames unknown at software creation time.  I suggest solving this using _audit.schema.
(2) The 'datablock' category described above is invariant: it can always take the default value for the parent key, and child keys are present in all categories, with default values throughout.  It essentially can be left out of all datablocks. It therefore makes sense to me to therefore leave the 'Set' category defined as a reminder (perhaps renamed to 'Global'), on the understanding that the actual behaviour is described by a default-valued Hub category.
(3) The 'twin' hub category is the same as the twin_individual category introduced in the twinning dictionary; in that case, we had to create a copy of the refln category with a different name as we lacked the mechanism being discussed here. Likewise, 'variant' in imgCIF works precisely as described here.  This is a sign that we are on the right track, I think.

I have no particular observations regarding dictionaries, beyond the very nice composability offered by this proposal.

Transformations between 'schema'
============================

A pleasant property of proposal #2 was the guaranteed ability to mechanically transform a datablock between schema, in particular, to create datablocks that conformed to the default schema.  For
H&S, this is also possible by emitting a datablock for each value of each Hub category key.

dREL
====

I repeat the desirability of setting things up so that dREL methods do not need editing when a new 'Hub' child key appears in a Loop category.  This is possible with the following rules:

(1) Any accesses within a dREL method to a value in a different category are taken to refer to the packet that matches the complete set of common sibling keys (i.e. including hub keys)
(2) Where a category contains keys that are themselves siblings, the dREL method must explicitly state the values of those keys when accessing other categories (see GEOM_ANGLE)
(3) A dREL method may never be rewritten because of the addition of a new key to the category. Instead, a new dataname should be defined.

Regarding (3), if we perform decomposition of a datablock into the 'default' schema, then this will mean all datanames have their usual interpretation and will have the exact relationship defined in the dREL.

At first glance, it appears that 'Hub' categories are in no way special from a DDLm/dREL point of view and do not need a special designation.

I inserted a few further comments regarding dREL below.  Once we have tidied up any loose ends, I suggest we settle on a proposal that essentially boils down to defining _audit.schema appropriately.

James.

On 2 July 2016 at 02:29, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Dear James and others,

[...edited out]

(JRH)
> Note that this logic cannot be explicitly laid out in dREL methods (even if we wanted to) because we cannot know ahead of time what other Set categories might appear.  For example, in the 'Variant' scenario, when doing atom_site calculations we just want to pick the set of atom_sites corresponding to our current unit cell variant, but when we wrote the dREL method 'variants' did not exist.  I have added an appendix with further analysis of this.

(John B)
If we change the way that unit cells are associated with a collection of structural data so that there can be several multiplexed variants, then it is unsurprising to need to change some dREL routines correspondingly.  dREL cannot be expected to withstand every imaginable change to the relational structure of the data on which it operates -- and that would be a good reason to avoid performing such a change on the primary unit cell of a structure.  If we wanted to provide for associating additional, secondary cells, however, then it might be sensible to provide for those (only) via variants.  There is no essential incompatibility between variants and H&S.

We can absolutely write dREL methods that explicitly traverse hub categories if we wish to do so (because the relationships traversed have well-defined significance), and if we have designed our data well then we can expect that such methods will rarely, if ever, require changes.  Certainly adding a new category does not inherently require any methods to change, as in itself it has no effect on keys of other categories, or on the nature of relationships between other categories in general.

One of the motivations behind dREL was that it provided an algorithm that was accessible to non-programmers but simultaneously machine-readable.  The less explicit category indexing we do, the better.  I think also that the dREL method captures the 'meaning' of a dataname, so if it has to be rewritten (modulo the rules I proposed) then we should instead create a new dataname.  In any case, I don't think that there are particular objections that can be levelled at H&S from the dREL point of view.
 

> Appendix: How dREL should work under proposals #2 and 'Hub and spoke'
> ============================================================
> Consider the following piece of dREL code from the current draft cif_core dictionary for calculating site multiplicity. This code contains a loop over a different category, as well as accessing a "Set" value (space_group.multiplicity), so it is likely to be sensitive to our decisions.
>
>      With  a  as  atom_site
>
>         mul  =   0
>         xyz  =   a.fract_xyz
>
>         Loop  s  as  space_group_symop  {
>
>              sxyz  =   s.R * xyz + s.T
>              diff  =   Mod( 99.5 + xyz - sxyz, 1.0) - 0.5
>
>              If ( Norm ( diff ) < 0.1 ) mul +=  1
>         }
>      _atom_site.site_symmetry_multiplicity =  _space_group.multiplicity / mul
> Suppose now that we have multiple space groups *and* multiple variants, so that the key for _atom_site consists of the label and either (#2) the variant and the space group or (H&S) the hub key.  Suppose also that the category space_group_symop has only two keys: the symop number and either (#2) the space group or (H&S) the hub key.  First, under current dREL rules this code will probably fail miserably, as it will apply symmetry operators from both space groups and then attempt to access a unique value for _space_group.multiplicity.


You can't just say "we have multiple space groups" or "we have multiple variants."  Everything depends on how those are associated with the rest of the data and with each other, which depends in part on *why* there are multiple space groups and multiple variants.  All that is bound up in the definitions of the categories, hub or otherwise, that make it possible.  There are many ways that could be accomplished with either hub and spokes or with Prop 2.  It is unreasonable to assume that the one chosen by the dictionary maintainers would make the data ambiguous or make common operations difficult to perform.  Although I can believe that the given method, running against the current dictionary, could go badly wrong, it is not clear that every dictionary that accommodates such an arrangement of data necessarily affords the possibility that it will go wrong.

It must go wrong simply because of the 'Global' nature of space_group.  Under current dREL and DDLm semantics, there is no way the final line of dREL can access a unique value if more than one SPACE_GROUP packet is present, regardless of what Hubs and child keys have been configured. This is a straightforward outcome of the 'Set'/'Loop' dichotomy in DDLm.


With respect to SPACE_GROUP_SYMOP, I would not suppose that it has a hub key as part of its key.  I see no reason to structure SPACE_GROUP and SPACE_GROUP_SYMOP any differently than they are structured in symCIF and the DDL1 core, and I maintain my previous assertion that they should be structured the same in the DDLm core.  Since that gives SPACE_GROUP a surrogate key for SPACE_GROUP_SYMOP to reference, and since I see no reason why that key would need to be expanded, SPACE_GROUP_SYMOP would not need any additional keys.

To avoid the distraction of SPACE_GROUP, suppose in my example that 'space_group.multiplicity' is 'symmetry.multiplicity', as was done in the original DDLm version before I replaced SYMMETRY with SPACE_GROUP and started this all off.  As I've said before, I give no weight to the way SPACE_GROUP has been defined in symCIF or DDL1. It is a mistake that we are trying to fix, and regardless of the way DDL1 and symCIF define SPACE_GROUP, this category is used as a 'Global' category in all structural CIFs.  The dREL given above explicitly assumes that SPACE_GROUP is a 'Set' category.

> So I propose that we adopt the following simple dREL rules, which are really just deconstructing the category structure back to the original schema:
>
> (i) any loops over categories filter those categories on the current values of "sibling" keys (after hub category examination for H&S).
> (ii) any access to 'Set' categories implicitly uses the current value of any related (sibling or parent) keys (after hub category examination for H&S)

(John B)
Maybe this is equivalent, but the basic rule that I advocate is that references to other categories always follow the defined relationships, with the semantics of a relational JOIN operation.  Thus, ' Loop  s  as  space_group_symop' would mean a loop over those space_group_symop rows that are related to the atom_site row on which the method is invoked.  That seems natural for dREL.

(JRH)
I think that this is equivalent.

(John B)
It gets more complicated if there are multiple paths along which the two categories are related, however.  Generally speaking, there is a specific chain of relationships that is appropriate for any particular purpose.  Discovering such chains is a graph path-finding problem.  We can provide heuristics similar to the rules you suggest where these help choose the correct path, but I am not confident that we can provide rules that can be relied upon always to work.  The longer the relationship chain, the better an idea it is for dREL code to express it specifically.  Which it totally can do.

(JRH)
I would still allow explicit expression of category key lookups, where necessary. A useful example is the 'geom angle' category, where there are multiple keys pointing to the 'model site' category.  There is no unambiguous way to refer to a 'model site' packet from a 'geom_angle' packet without explicitly stating which of the 'geom angle' keys should be used.  Apart from this, if we have to explicitly code category lookups for a dataname after adding a new 'Hub' key, where we did not have to do so under the above rules, then we should create a new dataname instead, as discussed above.

[...]
> If we decide that we have 'variants' for space_group (and therefore necessarily space_group_symop) as well, then the above loop and dereference would additionally restrict their selection using the value of the variant key.

(John B)
And this is where heuristics come in.  I think such heuristics can carry us a long way, and I am not opposed to defining and implementing them, but I think it's ultimately safer for dREL code to traverse the intended relationships explicitly.

(JRH)
The point of the heuristics is to avoid having to change dREL methods every time a new hub child key is added to a category.  If these heuristics are insufficient to capture the new relationships, a new dataname should be defined to accompany whatever new configuration has been created, with the old dataname retaining the original dREL. The old dataname will still have a well-defined value with the new Hub key, as otherwise there would have been ambiguity in the original definition.

In any case, this is in no way a criticism of H&S, just working out how to simplify the system.

all the best,
James.
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.