[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Discussion of hub-spoke proposal

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Discussion of hub-spoke proposal
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Fri, 1 Jul 2016 16:29:40 +0000
  • Accept-Language: en-US
  • authentication-results: spf=none (sender IP is )smtp.mailfrom=John.Bollinger@STJUDE.ORG;
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=k1NNEFhv9LRABSxv3OzVbQSYgUTF2B1IDrFW7VAu8c4=;b=PZapPouh4EbXEaRLphTjV414nA3ZTLVVsulkh0QImO0xz7Lej22I5DhvWlQ9rgwygzTe/euI1dX1H3OjgWObSd9qp4Ylt10ZH70W8x24qRkh/BDuNmbdhlazQXa8aRh+cXDq/V7lkGraNCueImgvvVc1jYl3pXsSmDpMBu1EvRU=
  • In-Reply-To: <CAM+dB2eq=Kn9V=y+11C=Pg4pyL00QMOMFqtRmKhA+Lsvy3QjLw@mail.gmail.com>
  • References: <CAM+dB2eq=Kn9V=y+11C=Pg4pyL00QMOMFqtRmKhA+Lsvy3QjLw@mail.gmail.com>
  • spamdiagnosticmetadata: NSPM
  • spamdiagnosticoutput: 1:99
Dear James and others,

I had some trouble determining how to respond to your comments of June 28, for some of them present valid points about hub & spokes, and others seem only slightly off, but some seem so far off the mark that initially I did not understand where they came from.  I decided I must have been only partially effective at communicating the idea, but I had to read the whole e-mail several times to form a plausible hypothesis about where communication broke down.

Here is where I think the problem lies: although you have understood the mechanics reasonably well, your comments suppose that H&S would provide an *artificial* skeleton with the exclusive purpose of introducing indirection of category relationships.  This is not at all the intention.  A hub category represents a natural entity in our ontology, albeit perhaps a higher-level one than those we usually work with.  The relationships between a hub category and other categories have specific meaning characteristic of the nature of that particular hub category. In this sense, use of a particular hub category is roughly analogous to declaring a particular value of the proposed _audit.schema item.

For example, the initial hub category for core CIF would need to represent an overall crystal structure, in the same sense that we presently rely on a whole data block to provide.  It would make the implicit category relationships explicit, very much as mmCIF's ENTRY does.  Thus, that a CELL_LENGTH row is associated with the same ENTRY key as a particular ATOM_SITE means not just that the one generally affects the other, but specifically that the given CELL_LENGTH row presents the unit cell lengths to which the ATOM_SITE row's fractional coordinates are referred.

Names matter.  I perhaps harmed my earlier exposition of the semantic aspect of the H&S idea by choosing to follow mmCIF with respect to the example name of the initial hub category.  Indeed, I waffled over that decision before finally coming down on the side of mmCIF's ENTRY.  The main alternative I considered was STRUCTURE, whose linguistic relationship with the proposed _audit.schema value 'structural' is not accidental.

I present further comments and responses to specific points in-line below.


On Tuesday, June 28, 2016 8:45 PM, James Hester wrote:

> The core points as I understand them:
> (1) A 'Hub' category is defined with a default key


Clarification: a 'Hub' is a *kind* of category, not a single specific category.  Core CIF needs only one at the moment (or two, if we counted space_group as one of them), but we might need others in the future, or in other dictionaries.


> (2) Every 'Set' category is given a single key dataname that is a child of a 'Hub' key


Not necessarily every current Set, only those for which it is appropriate.  That would be most, but not necessarily all.


> (3) 'Loop' categories are given a single additional key dataname that is also a child of the 'Hub' key


Those loop categories for which it is appropriate -- which are not necessarily all of them -- would get an additional key dataname that is also a child of the appropriate hub category's key.

Although the term "Hub" (or "Star") focuses on the structural role such categories would play, it is important to keep in mind the significant semantic role these categories play.  A Hub category organizes instances of other categories into a consistent assembly with a specific meaning.  Implicit or explicit use of such a category not only shows which instances of which categories are related to each other; it also communicates the nature of those relationships and what the overall assembly means.

Additionally, I prefer not to characterize the proposal in terms of Set and Loop categories, as those designations, especially the former, are tightly coupled to the existing relational structure, and that coupling is exactly the problem we are trying to solve.  How we should apply those designations is a separate and controversial issue that I think would be better shelved until after we decide what to do with the Hub and Spokes proposal, because the latter may bear on the former.

There are categories in our dictionaries that I don't think we currently contemplate affording multiple instances per block, such as PUBL and especially AUDIT.  From a relational perspective, we should indeed view these categories as having keys, but those (implicit) keys can continue to be the trivial one consisting of zero attributes.  Such categories do not need child keys of a hub category, or at least not of the initial hub category I have proposed.


> This works as follows:
> - Datafiles produced according to our current dictionaries are valid as all of the keys described above simply take default values and may be left out of datafiles. In particular, in the default scenario the 'Hub' key can only be single-valued and this constrains all 'Set' categories to be single-valued as their 'Hub' child keys can then only take a single value.


Yes.


> - where a 'Set' category becomes looped:
> (i)  the 'Hub' child key in that 'Set' necessarily becomes multi-valued, meaning that the 'Hub' category must also now be present in the datafile listing the values of the 'Hub' keys.  When packets in different 'Set' categories have a one-to-one correspondence, the same 'Hub' child key value can be used to show this, otherwise distinct 'Hub' key values are necessary


I think the view presented fails to appreciate the nature of the relationship between the hub and the other category, but yes, to present in the same data block multiple instances of a category whose key consists solely of a child key, it would be necessary to also explicitly present the instances of the (hub) category that carry the corresponding parent keys.  The parent keys may be the only attributes explicitly presented in that category.

On the other hand, "becomes looped" suggests a dictionary change, and indeed if we wanted to allow a given category to take multiple values along some different dimension than we already specifically accommodate, then we would need to define a way to represent variation along that dimension.  We already accommodate variation along the dimensions of any hub categories referenced, but if none of those correspond to the dimension along which we want to vary, then we would need to add another key.


>  (ii) those Loop categories that are affected by a multi-valued Set category have 'Hub' child keys set to values that match the appropriate packet of that 'Set' category.


Basically yes, but there are really two cases here: where the traditionally multi-valued category, C, has a child key referencing a non-hub category, and where it does not.  In the former case, the parent category will necessarily have also had its key expanded to include a child key referencing the hub, and C's key must be expanded as a consequence.  In the latter case, C gets a child key directly referencing the appropriate Hub category because that is the natural expression of C being "affected" by other categories associated with that hub.

Also, although I accept "affected" as a generic term for what we're talking about, each relationship described that way has its own specific semantics, defined in a dictionary.


> If I have misunderstood John's proposal I trust he will correct me.
>
> Given the above understanding the most immediate problem is how we would deal with a Loop category that relies on global values from multiple 'Set' categories.  If the particular packets in those 'Set' categories have different 'Hub' key values for even one of the loop category packets, it becomes impossible to specify a 'Hub' key for the Loop packet.


The situation you describe is not a problem to be overcome, but rather a natural and desirable outcome.  It is the CIF version of referential integrity.

Again, it is important to understand that a relationship between a hub category and another category has specific semantics.  If there is no way to assign keys that express a particular relationship then that is because our dictionary does not contemplate or provide for expressing that relationship.  I apologize for speaking so abstractly here, but the objection is also abstract.


>  For concreteness, suppose we have multiple twins and multiple space groups, with no necessary correspondence between the twin individual and the particular space group (e.g. two different settings are presented for every twin individual).  Values in the 'refln' category will depend on both the twin individual chosen and the space group.  There is no unique 'Hub' key and so we cannot cover this situation.


Why would there be no unique Hub key?  That's completely under our control.

But this does get back again to the semantics defined for hub categories.  A hub category that represents the kind of data that a structural CIF now presents would not necessarily be suitable for representing the kind of twin data imagined.  There are two general approaches to handling that: modify the hub to support it, or define a new hub.  The latter may be more conveniently accomplished if the new hub is made to use the other hub instead of completely replacing it -- in particular, that might minimize the number of existing categories that require a child key referencing the new hub directly.


> I think I understand the 'Other' example category in John's email to be providing, in this case, a separate 'Hub' which would solve this particular situation.  Extrapolating, however, this means that there would be a new 'Hub' category for every Loop category that depends on a unique combination of global values from more than one category, which is unwieldy and leads to further multiplication of keys, this time in a more complex scheme, so doesn't appear to have won us anything.


I think you're looking at it the wrong way around.  The need for hub categories and their design is driven by what kinds of data we want to present.  Category keys *follow* from that; they don't drive it.  Additionally, it is possible to establish relationships between one hub category and another, and in that way to reduce or even eliminate the need to add keys to other categories.  In any case, even if we did end up with several hub categories and with categories having several child keys for different hubs, I do not accept that the hub categories would ever constitute more than a small fraction of categories overall.  In that case we still have a big key-reduction win: with H hubs and C non-hub categories, there would be on the order of C*H total keys, as opposed to C*C.


>  I suggest that we can be more economical in the hub and spoke paradigm as follows, which I think is how John envisaged the SPACE_GROUP category working:


Note that SPACE_GROUP is a special case, for the relationship between the proposed hub and SPACE_GROUP has multiplicity n:1, with the proposed hub category on the *n* side.  (That is, any number of structures may have the same space group.)  In a relational model, that requires either a child key on the n (hub) side, or a new category modeling the association between hub and SPACE_GROUP.  I suggested the former.


> (1) 'Set' categories are given their own default-valued key.
> (2) A 'Hub' category is defined with a single dataname acting as the key, and all other datanames in this 'Hub' category are child keys of the Set category keys defined in (1).
> (3) Loop categories are given a single additional, default-valued key dataname which points to the 'Hub' key (same as (3) in the above scheme)
>
> This scheme then works as follows:
>
> - Datafiles produced according to our current dictionaries are valid as all of the keys described above simply take default values and may be left out of datafiles
> - A datafile which introduces multiple values for a 'Set' category:
>  (i) lists those multiple values in that 'Set' category using the 'Set' key created in point 1 above
>  (ii) provides a 'Hub' key value for all those combinations of its 'Set' key values with other non-default 'Set' key values that are used by Loop categories, listing these key values in the 'Hub' category loop
>  (iii) when creating values in a 'Loop' category affected by any of the newly-looped 'Set' categories, sets the Loop's Hub child key defined at point (3) above to point to the row of the 'Hub' category corresponding to the particular values of the 'Set' categories that are relevant to the current Loop category packet.


Yes, where there is a 1:1 relationship between a hub category and another category, it is possible to put the child key on either side.  In fact, I considered the approach you describe, but instead proposed what I did because

1) I do not anticipate the hypothetical key proliferation scenario you described playing out as a significant real-world problem.

2) It is more consistent if all non-hub categories carry a child key referencing the appropriate hub(s) than if the child key is on the hub for 1:1 relationships but on the non-hub for 1:n relationships.  Such consistency has both conceptual and practical advantages.

3) If it is the non-hub category that has the child key, then we can defer any decision about an analog of SQL's 'UNIQUE' constraint by making the child key the category key of erstwhile single-valued categories.  On the other hand, if it is the hub that has the child keys then we immediately need a means to define that no two instances of a given hub category bear the same child key for another category.

4) Putting the keys in the hub requires hub categories' definitions to be updated when ordinary categories are added (which I predict will be much more frequent than the reverse).

5) It is consistent with mmCIF.


The other alternative does have merits, however.


> In terms of how this would be used in e.g. a conversion of a fractional coordinate to Cartesian coordinates, the software (or dREL) would use the hub child key in 'atom_site' to find an entry in the 'Hub' category.  This entry contains the 'Set' category key values relevant to that particular atom_site row, and these are then used to obtain the values that are necessary - so the hub.space_group_id key indexes into the space group category, and the hub.cell_id indexes into the cell parameters.   One drawback of this scheme is that a value must be provided in each 'Hub' category loop packet for every 'Set' category key, regardless of whether the loop that uses that particular 'Hub' value has any dependence on that 'Set' category.  Because of this, the default value must have a special notation so that software can understand when a particular key is irrelevant to a particular loop - 'dot' would suit here.


I think what you're describing with your drawback scenario is that if a hub category is presented explicitly, so that some of its attributes can take multiple values, then some default values for its attributes end up being harmful instead of helpful.  When a hub category takes multiple values, those child keys that must not be duplicated between hub instances must all have values presented explicitly, lest the default values be duplicated.

I hadn't considered that issue before, but it is very serious if we contemplate adding categories associated with existing hubs.  If in conjunction with doing so we give the hub category a child key that must take unique values, then any existing CIFs that present multiple values for that hub will be invalid with respect to the revised dictionary.  This turns out to be another a reason to favor putting the keys on the other side.  Note, however, that adding a new category does not inherently require associating it with any particular hub category.


> Some comments on both these variations:
> Datafiles
> =======
> (1) These proposals meet the criterion of ensuring that current datafiles remain valid


Agreed.


> (2) The presence of a multi-packet 'Hub' category fulfills the same role as _audit.schema in protecting software from misinterpretation. To a workable approximation, a simple text search for the hub category master key dataname would be sufficient to distinguish old-style files from new-style files.


Agreed.  And I already stipulated that a direct analog of _audit.schema could be provided in conjunction with an H&S-based dictionary revision if we want.


> Dictionaries
> =========
> (3) It is notable in the schemes (as I have interpreted them) that we are unable to specify which 'Set' categories influence which 'Loop' categories, as we are simply providing a Hub key and a Hub category that contains all defined 'Set' category keys.  The 'Set' - 'Loop' link is up to the datafile writer.  I do not feel comfortable that CIF software authors will come to identical conclusions on how to model the relatively complex situations we are talking about here: indeed, the job of the dictionaries is to describe usage in sufficient detail that all conforming CIF writers and readers agree on interpretation.


This is where I first started to get confused by the response.  I hope the clarification I presented at the beginning makes the following clear as well.

Regardless of whether a category is designated a 'Set' or a 'Loop', it has relationships directly and indirectly with other categories as defined by category keys and related child keys.  These define what is related to what, both at the dictionary level and at the data level -- that's their purpose, and nothing about that changes with H&S.  Where we intend to provide for more complex relationship patterns than we already do, it is our responsibility to define hub categories appropriately.  And here again I return to the fact that hub categories provide not just structure, but semantics.  Regardless of whether we put child keys on the hub side or the opposite side, which hub category(s) we choose informs about the overall meaning of the data block.  And certainly these details would be expressed in dictionaries.


> dREL
> ====
> (4) It is highly desirable that a dREL routine does not change each time another 'Set' category becomes looped, as it would require new routines to be written for each combination of looped 'Set' categories.


I agree that it is desirable for dREL routines to be invariant under dictionary changes.  I am not seeing why dREL routines would require changes in the event that an unrelated category changes, regardless of the nature of the change, under any proposal now on the table.  On the other hand, I am not seeing how dREL routines could be expected to _not_ require changes in the event that the key structure of a _related_ category changes in any of several ways.  I am certainly not seeing why a new version of each routine would be required for every distinct overall pattern of category relationships, but even if it were, I am missing how that would distinguish between current proposals.


>   Under the current proposals, for us to preserve a given dREL method unchanged, dREL would have to give 'hub' keys and 'hub' categories special treatment, so that anytime a 'Set' category was accessed in dREL, the 'Hub' key value for the current packet is used to index into the appropriate packet of the 'Set' category.   We should therefore create a new class of Category, (e.g. 'Hub') specifying this behaviour.


I am not necessarily against defining a new class of category, but I am not convinced that doing so would be required.  Certainly, however, dREL implementations would need to be modified.  In particular, if we want the dREL routines we already have to work unchanged, then dREL needs an ability to recognize and implicitly traverse relationships, at least where they are unambiguous.  Perhaps that would be easier, clearer, or cleaner if it can be restricted to relationships with specially-marked categories (i.e. hubs), but it would be more flexible and powerful if it worked without such restrictions.  I'm not presently seeing what prevents it from doing so, but I don't deny that there may be something that does.


> Note that this logic cannot be explicitly laid out in dREL methods (even if we wanted to) because we cannot know ahead of time what other Set categories might appear.  For example, in the 'Variant' scenario, when doing atom_site calculations we just want to pick the set of atom_sites corresponding to our current unit cell variant, but when we wrote the dREL method 'variants' did not exist.  I have added an appendix with further analysis of this.


If we change the way that unit cells are associated with a collection of structural data so that there can be several multiplexed variants, then it is unsurprising to need to change some dREL routines correspondingly.  dREL cannot be expected to withstand every imaginable change to the relational structure of the data on which it operates -- and that would be a good reason to avoid performing such a change on the primary unit cell of a structure.  If we wanted to provide for associating additional, secondary cells, however, then it might be sensible to provide for those (only) via variants.  There is no essential incompatibility between variants and H&S.

We can absolutely write dREL methods that explicitly traverse hub categories if we wish to do so (because the relationships traversed have well-defined significance), and if we have designed our data well then we can expect that such methods will rarely, if ever, require changes.  Certainly adding a new category does not inherently require any methods to change, as in itself it has no effect on keys of other categories, or on the nature of relationships between other categories in general.


> In summary, the present proposal requires the creation of a new category class (e.g. 'Hub'), but DDLm is otherwise unaffected.  dREL semantics for the 'dot' operator need to be changed, but that is also true of Proposal  #2.  My key objection is that we are ceding the modelling of complex cases to individual software authors, thus risking a failure of the standard to ensure correct communication.


I did not initially understand this objection.  Now I think it is based on a misunderstanding of the H&S idea.  H&S in no way cedes any modelling responsibility to individual software authors.  It is a pattern for how _we_ structure _dictionaries_.  All of the modelling is the responsibility of dictionary authors and maintainers, not of individual software authors.


> This is almost (but not quite) saying 'as long as _audit.schema is non-standard, loop whatever you want and figure it out between yourselves'.


No, it isn't anything like that.  A reasonably decent implementation of the hub and spokes pattern establishes specific, well-defined, meaningful relationships between categories.  Those are defined in our dictionaries, for all software authors to rely upon.  There is no reason to suppose that we would be unable to create and maintain "reasonably decent" dictionaries in this sense.  We having done so, authors would not have to sort out relationship semantics for themselves.


> So I remain in favour of proposal #2 with 'Global' replacing 'Set'.
> James.
>
> Appendix: How dREL should work under proposals #2 and 'Hub and spoke'
> ============================================================
> Consider the following piece of dREL code from the current draft cif_core dictionary for calculating site multiplicity. This code contains a loop over a different category, as well as accessing a "Set" value (space_group.multiplicity), so it is likely to be sensitive to our decisions.
>
>      With  a  as  atom_site
>
>         mul  =   0
>         xyz  =   a.fract_xyz
>
>         Loop  s  as  space_group_symop  {
>
>              sxyz  =   s.R * xyz + s.T
>              diff  =   Mod( 99.5 + xyz - sxyz, 1.0) - 0.5
>
>              If ( Norm ( diff ) < 0.1 ) mul +=  1
>         }
>      _atom_site.site_symmetry_multiplicity =  _space_group.multiplicity / mul
> Suppose now that we have multiple space groups *and* multiple variants, so that the key for _atom_site consists of the label and either (#2) the variant and the space group or (H&S) the hub key.  Suppose also that the category space_group_symop has only two keys: the symop number and either (#2) the space group or (H&S) the hub key.  First, under current dREL rules this code will probably fail miserably, as it will apply symmetry operators from both space groups and then attempt to access a unique value for _space_group.multiplicity.


You can't just say "we have multiple space groups" or "we have multiple variants."  Everything depends on how those are associated with the rest of the data and with each other, which depends in part on *why* there are multiple space groups and multiple variants.  All that is bound up in the definitions of the categories, hub or otherwise, that make it possible.  There are many ways that could be accomplished with either hub and spokes or with Prop 2.  It is unreasonable to assume that the one chosen by the dictionary maintainers would make the data ambiguous or make common operations difficult to perform.  Although I can believe that the given method, running against the current dictionary, could go badly wrong, it is not clear that every dictionary that accommodates such an arrangement of data necessarily affords the possibility that it will go wrong.

With respect to SPACE_GROUP_SYMOP, I would not suppose that it has a hub key as part of its key.  I see no reason to structure SPACE_GROUP and SPACE_GROUP_SYMOP any differently than they are structured in symCIF and the DDL1 core, and I maintain my previous assertion that they should be structured the same in the DDLm core.  Since that gives SPACE_GROUP a surrogate key for SPACE_GROUP_SYMOP to reference, and since I see no reason why that key would need to be expanded, SPACE_GROUP_SYMOP would not need any additional keys.


> So I propose that we adopt the following simple dREL rules, which are really just deconstructing the category structure back to the original schema:
>
> (i) any loops over categories filter those categories on the current values of "sibling" keys (after hub category examination for H&S).
> (ii) any access to 'Set' categories implicitly uses the current value of any related (sibling or parent) keys (after hub category examination for H&S)


Maybe this is equivalent, but the basic rule that I advocate is that references to other categories always follow the defined relationships, with the semantics of a relational JOIN operation.  Thus, ' Loop  s  as  space_group_symop' would mean a loop over those space_group_symop rows that are related to the atom_site row on which the method is invoked.  That seems natural for dREL.

It gets more complicated if there are multiple paths along which the two categories are related, however.  Generally speaking, there is a specific chain of relationships that is appropriate for any particular purpose.  Discovering such chains is a graph path-finding problem.  We can provide heuristics similar to the rules you suggest where these help choose the correct path, but I am not confident that we can provide rules that can be relied upon always to work.  The longer the relationship chain, the better an idea it is for dREL code to express it specifically.  Which it totally can do.


> The above dREL executes as follows:
> (1) the code is executed for each packet in _atom_site, so, at execution, all key values are defined. For the H&S proposal, the dREL engine indexes into the hub category using the hub child key and sets all Set category child keys to the given values: these are then placed in scope to recreate the situation of proposal #2.
> (2) space_group_symop shares a sibling key with atom_site: both have child keys of space_group. Therefore, under rule (i) the 'Loop' only handles those packets of space_group_symop that have the same value of space_group_id as the current _atom_site packet. 'Mul' is calculated identically to the single space group case (i.e. correctly)
> (3) the reference to space_group.multiplicity uses the row of space_group indexed by the only related key, space_group_id, for the current site packet


This is essentially the same as my "follow relationships" overall rule requires, though I'm focusing on a slightly higher level: the defined relationships between categories, rather than the details of how keys are allocated and assigned to embody those relationships.  One thing that follows from that higher-level view is that dREL should be able to traverse every relationship in both directions, regardless of which side bears the parent key and which the child key.


> If we decide that we have 'variants' for space_group (and therefore necessarily space_group_symop) as well, then the above loop and dereference would additionally restrict their selection using the value of the variant key.


And this is where heuristics come in.  I think such heuristics can carry us a long way, and I am not opposed to defining and implementing them, but I think it's ultimately safer for dREL code to traverse the intended relationships explicitly.


John

--
John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
John.Bollinger@StJude.org
(901) 595-3166 [office]
www.stjude.org





________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]