Re: [ddlm-group] Further discussion of proposal #2
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Further discussion of proposal #2
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Mon, 20 Jun 2016 21:47:57 +0000
- Accept-Language: en-US
- authentication-results: spf=none (sender IP is )smtp.mailfrom=John.Bollinger@STJUDE.ORG;
- DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=tgxOJlMaupA1ctc1PtNy6wL+goz07Lq7lHue288wRgA=;b=tPJHvMeAkx4FIjCHso64IYfqazczWhaZYq8/355VzVk5jox3et1rxolOM5Q6hvas78lNq0frgXuofiRzOSDeLjJinvg9F6XGQIVEKh81PP5/7gxdyC0xygoTTMN/n4Ct3CVMEWgW1uyV4WJV9QdPa65ifSKcFDijiDC6840BxAI=
- In-Reply-To: <CAM+dB2fu4OXVqGf1X=s+Q0dRD1ftzPUvQ2+iEG-V5rhZKASCow@mail.gmail.com>
- References: <CAM+dB2c4XhGDZQ7PBAHhUfmTXc7X7H2PboWBH3s1dapp0Gh_KQ@mail.gmail.com><BY2PR0401MB093685CBF929AF951C0626FCE0570@BY2PR0401MB0936.namprd04.prod.outlook.com><CAM+dB2fu4OXVqGf1X=s+Q0dRD1ftzPUvQ2+iEG-V5rhZKASCow@mail.gmail.com>
- spamdiagnosticmetadata: NSPM
- spamdiagnosticoutput: 1:99
Dear All, My apologies for the elements of review in what follows. Writing them helped me organize my thoughts, so I hope that reading them will help communicate those
thoughts. As Herbert reminds us, for just about any category that might appear in a data file, one can imagine an experiment, a construct, a model, etc. whose description
requires multiple instances of that category. As James observes, however, many categories in our current dictionaries so rarely require such treatment that we have gotten along fine with the DDL1 and DDLm core dictionaries not, technically, permitting multiple
instances of those categories to be presented in the same data file at all. In mmCIF, on the other hand, substantially all categories are loopable in principle, with many of them associated together indirectly via the ENTRY category and its _entry.id attribute.
Inasmuch as _entry.id "identifies the data block", however, that amounts to a distinction without much difference. But mmCIF’s ENTRY category is nevertheless instructive. Formally, many categories defined as Sets in the DDLm core are associated with each other in mmCIF not
by having global nature but by referring to the same ENTRY. This arrangement is similar to what is called a "star schema" in data warehousing: instead of a multitude of individual entities being global (which cannot generally be accommodated in a data warehouse)
or all having direct relationships declared with a large number of other entities, they are instead all related to a single central entity; the relationships can be visualized as emanating in a star-like pattern from that central entity. In such a data warehouse,
the central entity often represents a point in time; it constitutes the dimension along which all the other entities can jointly and concertedly vary. So suppose we took the ENTRY idea from mmCIF, but allowed a block to contain multiple ENTRYs? As far as I can determine, that’s consistent with the machine-readable
parts of the definitions of ENTRY and _entry.id anyway, though it seems inconsistent with their prose descriptions. In that way, a data file could be valid against mmCIF and nevertheless describe, say, multiple CELLs, without there being any ambiguity about
which CELL went with which REFLNS. That’s similar to what we want to be able to do, but it doesn’t quite get us everywhere we want to go. The problem that we are grappling with can be viewed as how to deal with a situation wherein we want or need a different
pattern of relationships between categories than the one described by the relationships with ENTRY. James’s proposal #2 approaches the problem from a different angle. It acknowledges that there is more than one possible pattern of categories and relationships
characterizing a data set, and it designates these as "schemas", which is indeed an apt
term. It uses the category label 'Set'
or maybe 'Global' (which I prefer for this purpose) to define a pattern of 1:1 relationships
that serves as a functional substitute for mmCIF’s explicit relationships between ENTRY and other categories; it introduces a mechanism for declaring that a given data file in fact complies with a different schema than the default; and it provides a mechanism
aimed at helping software determine whether and to what extent it can correctly interpret the file’s contents. At that high level, I don’t disagree with any of it, but we’ve gone several rounds over the details. Our main sticking point is related to how
the relationships among categories should be described in dictionaries -- especially those that to date have been implicit in categories being defined as Sets. Now suppose we combine the high-level idea of providing for multiple schemas with the mmCIF star schema structure. The DDLm core can model each distinct schema
as a simple category and the hub of its own star schema, like mmCIF’s ENTRY. Existing categories can participate in more than one of these where appropriate, though initially there would be only one. Converting the existing DDLm core to this structure would
involve creating one new key in each current Set category (mmCIF already has these keys), and possibly child keys in other categories. It does not necessarily affect existing data files at all, because we can define default values for the various keys. In
this way, all needed keys can be explicitly defined, with a much more modest overall number of keys than if relationships were expressed directly among all categories, and consequently with much less impact when new categories are added. This also provides a fairly clean way to deal with SPACE_GROUP, and with any future categories that present a similar problem. Whereas with categories such as
CELL we could enforce the restriction of one CELL per hub instance by making CELL’s category key be a child key referencing the hub category, we could reverse that for SPACE_GROUP and any similar category: give the hub category a child key referencing SPACE_GROUP. To wrap it all together and make it easier for software authors to deal with, we can add
_audit.schema or something like it. One variation that occurs to me would be to have _audit_schema.name and _audit_schema.multiplicity, with the former taking as its values
the names of schema hub categories, and the latter taking values from an enumerated set describing whether that category is present and if so, whether it is restricted to a single value. This would provide a fairly easy mechanism by which data files could
advertise their structure to consumers, and for software to gauge whether they can handle the data. Best regards, John From: ddlm-group [mailto:ddlm-group-bounces@iucr.org]
On Behalf Of James Hester Dear John et. al. To summarise at the top, my principal objection to the 'default key' proposal is that it produces more complex dictionaries (more keys) with interactions that are initially surprising to a casual reader. Now in detail: I think our goal here is to come up with semantics that can (i) replicate DDL1/DDL2 'global category' behaviour and (ii) allow these global categories to become multi-packeted, with simultaneous loss of 'globality'.
'Global' categories (what I have referred to previously as 'Set' categories) are just a tool for simplification of dictionaries, and so the more complex we make their operation, the less benefit they provide. Likewise, the mainstream behaviour of feature
(i) should be as easy as possible to use. Proposal #2 as it currently stands (the 'Set' proposal) envisaged that the 'globality' of a category would be removed when using datanames defined within a separate dictionary (mostly key datanames), and software
should use the _audit.schema dataname and potentially _audit.conform to shield itself from the change in meaning that this entails. The 'default keys' proposal that John has outlined instead envisages making almost all 'Set' categories into 'Loop' categories, defining keys for them, and giving those keys default values. John has suggested
that this does not now involve a change in DDLm, because the semantics of having a default key are clear - the dataname can be left out if there is only one packet. However, a 'global' category with only one packet does *not* (currently) act like a 'Loop'
category with only one packet, because (unlike a single-packet 'Loop' category) the values appearing as non-key datanames in the 'global' category may be assumed when interpreting values from all other datanames in all other loops. 'Global' categories really
are different to 'Loop' categories for this reason, regardless of whether or not a key dataname is provided. This difference between 'Global' and 'Loop' categories could be removed completely if all of the global category child keys were defined in parallel. In this case, the 'Global' category no longer acts 'Globally' but only in those categories
for which a child key is defined. This 'simplification' comes at the expense of a whole lot of keys - in some categories, a key for every 'Set' category currently defined. At this point we have lost the practical simplification that we had obtained from
'Set' categories to start with. So, either you accept a change in DDLm (additional consequences of a default key) and define the child keys at a future date in another dictionary, or you keep DDLm unchanged and include the child keys in the main dictionary
immediately, throwing out the considerable simplification afforded by having global values. I would be against the latter option as it introduces a bunch of rarely-used key definitions into the main dictionary and is likely to be confusing to a casual programmer. (We could of course alternatively adopt the blanket rule that values appearing in a single-packet loop act globally with identical 'disappearing key' behaviour. While this is true enough mathematically, it now becomes permissible to drop
keys that have up until now been required even for single packet loops and loops with foreign keys that point to those single-packet loops, and this would break current software. So I exclude this as an option, even if it is an elegant rule.) So, given that we are stuck with two types of 'Loop' category, I would prefer communicating this clearly up front in the _definition.class tag, rather than relying on the presence of a default key value. What
I think might communicate better than the current 'Set' definition, however, is a change from 'Set' to 'Global' (or 'Overall'), with a definition something like: Global A special type of 'Loop' category. When single-valued, (i.e.
; I'm not sure if this is more likely to meet with approval. I have added some more comments in John's email below. On 18 June 2016 at 08:31, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Very little DDLm software has been written, and mostly by those in this group. A lot of thought and negotiation (I believe) has gone into DDLm, so we should not be too cavalier with our changes. Now is the best time to make them rather
than later when we might hope for more widespread adoption.
See my comments at the top of the email. I have provided a new label and definition, which indicates that the category can be looped, and under what conditions multiple packets may be expected. Perhaps this is acceptable?
The reason for the special case 'Set' category is the considerable simplification it offers. We trade complexity of behaviour in one place for simplicity elsewhere. And we are ultimately stuck with it because of DDL1.
Absolutely, I wouldn't argue with this.
See my comments at the beginning for why I think there is more than just logical consequences going on here i.e. there is global behaviour.
Your proposal on child keys would be interesting as I argue above that an explosion of child keys is a drawback and essentially removes the advantage gained by having global categories.
I would see no problem in a separate dictionary defining the key and default value for a 'Global' category as I've defined above. Essentially, conformance to this separate dictionary erases the 'Global' nature of the category and turns
it into a normal 'Loop' category with default key, so that datafiles created according to the original specification remain valid with the new dictionary - we have in fact elegantly expanded the ontology.
I agree with this - I'm not arguing that default key values are somehow bad or present problems, only that the 'global' behaviour is not captured.
OK, point taken, I did say my objection wasn't critical.
That is where we started - if we are to allow datanames to used with global meaning and in multi-packet loops, then we are talking about different meanings, and only something like _audit.schema can insulate
software from that. If we are to exclude changes in meaning, we have to define all child keys up front and then we need _audit.schema even more than before, as _audit.conform won't help. In the 'all child keys defined up front' scenario, we completely abandon
global categories and _audit.schema becomes the signal as to when a datablock can be interpreted as for the old cif_core.
It could indeed sway me as I think this is at the core of my objection. If we could effectively define all the child keys, while at the same time keeping the key definitions from swamping out the meat of the dictionary, and allow for the
appearance of future 'used to be global' categories like twinning and variants adding their own child keys, then it would be worth serious thought. I'm pretty sure dREL can be brought along with whatever variation you propose.
Our policy of keeping definitions stable is not an end in itself, but a logical requirement born of the need to guarantee that software that is already written remains valid. If everybody is using unlooped SPACE_GROUP to read and write
structures I don't see any issue in fiddling with the meaning, as long as any changes are consistent with that expectation of an unlooped value.
-- T +61 (02) 9717 9907 Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer |
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Further discussion of proposal #2 (James Hester)
- References:
- Re: [ddlm-group] Further discussion of proposal #2 (James Hester)
- Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)
- Re: [ddlm-group] Further discussion of proposal #2 (James Hester)
- Prev by Date: Re: [ddlm-group] Further discussion of proposal #2
- Next by Date: Re: [ddlm-group] Further discussion of proposal #2
- Prev by thread: Re: [ddlm-group] Further discussion of proposal #2
- Next by thread: Re: [ddlm-group] Further discussion of proposal #2
- Index(es):