Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm Group,

Deferring for the moment any response to the technical proposal on the table, I agree with James that a GitHub discussion thread would provide a more convenient venue to carry on the conversation.

John

-----Original Message-----
From: James H <jamesrhester@gmail.com>
Sent: Sunday, December 5, 2021 11:34 PM
To: Bollinger, John C <John.Bollinger@STJUDE.ORG>
Cc: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Multi block principles

Caution: External Sender. Do not open unless you know the content is safe.


Dear DDLm group,

Here at the top I'll provide an updated, renamed "dictionary driven"
proposal, with comments afterwards and comments inline as well. I note that it is probably quite hard to follow progress - I would be happy to move this discussion to Github and communicate the result here if others agree.

Dictionary Driven Proposal for Data Block Aggregration into a Dataset =====================================================

(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).
(ii) An aggregate is "valid" against a relational schema if the data contained can, in principle, be assigned unambiguously and consistently to cells in that relational schema

The remaining principles apply only to data items for which a mapping to a CIF-dictionary-defined data name is available

(iii) Dictionary conformance of individual data objects is specified either by audit_conform data names in each block, or as-yet-undefined dataset-level audit_dataset datanames in any block. Dataset-level dictionaries override block-level dictionaries.
(iv) Appearance: A category "appears" in a data block if any data name belonging to it is either present, or referred to via `_name.linked_item_id` of a data name that is present
(v) If a data name appearing in a data block is not defined in the dictionary to which that data block conforms, it is considered absent for the purposes of these principles
(vi) Compatibility: Two dictionaries describing data blocks are compatible if, for all categories that appear in those data blocks, the dictionaries prescribe the same set of key data names
(vii) Data blocks are compatible if either (1) their dictionaries are compatible or (2) a unique value can be determined or assigned for any key data names that are absent from one of the incompatible dictionaries
(viii) A "Set" category is a "Loop" category for which only one row may be presented in a data block
(ix) The set of "Set" categories for a data block is determined by the dictionary to which that data block conforms
(x) A "Set" category may only be aggregated from multiple data blocks if at least one key data name for the "Set" category has been provided in the dictionary to which those data blocks conform.
(xi) Where a "Set" category has been provided with a key data name in the dictionary as per (x), and that data name is not itself a child data name of some other data name, all child data names must also be provided in the dictionary.
(xii) The value of "Set" category key data names for a given data block may be assigned arbitrarily if missing
(xiii) Values for child data names of "Set" category key data names may always be elided.

The guiding principle in designing these rules is "can we uniquely assign the values in a given data block into cells of the relational structure describing the whole dataset?". Reasons for failure that I can see might include:
1. Not being able to determine a single relational schema for the dataset (contradictory dictionaries) 2. Not knowing the values of all the key data names in a row 3. Contradictory attribute values for the same key data name values

The conditions I've described above in (iii)-(vii) attempt to exclude these failure modes while also allowing maximum leeway in unambiguous situations (e.g. a category has only one value for a single key data name for the whole dataset).

On Thu, 2 Dec 2021 at 07:42, Bollinger, John C <John.Bollinger@stjude.org> wrote:
[edit]

> In this case the example provided by John would remain as two separate data blocks, as the appropriate "Set" category key has not been defined.
>
> Would it be acceptable to couch that last statement in different terms?  In particular, how about “the example provided by John would not be a valid aggregate, as […]”?  I am concerned with the conclusion that might otherwise be drawn, that every collection of data blocks should _automatically_ be viewed as an aggregate, assembled in whatever manner the data permit.  Also, considering the use case of extending aggregates simply by dropping in additional data blocks without other modification, I am much more comfortable with the idea that that might break the aggregate than with the idea that it might cause the data present in the aggregate before the addition to be interpreted differently after the addition.

Quite agree. I indeed meant to imply that it was not a valid aggregate, that is, the blocks remain completely independent of one another.

> For the same reason that no single dictionary could support all the patterns of inferred relationships that one might plausibly want, no single dictionary can support all those patterns of explicit relationships, either.  Therefore, I think we are now talking about having dictionaries specific to various families of aggregation patterns.  In that case, some kind of representation of these dictionaries will need to be created and maintained.  Where?  By whom?  Is it assumed that COMCIFS or any of the existing dictionary maintenance groups would play a role?  Is it assumed that the producers or consumers of such data will provide these directly?  Dare I suggest an option of delivering some form of a dictionary description inside the aggregate?

I have taken some of these comments into account in the revised rules above.

To answer John's questions specifically:
Dictionaries specific to aggregation patterns? At this point I believe that our current set of dictionaries describes a single relational space, and that we can relatively easily extend them to cover typical aggregates without those extensions contradicting one another. But I cannot rule out the possibility that sometime in the future we will run up against incompatibilities, so...

Creation and maintenance: COMCIFS would register and maintain dictionaries used in describing common aggregates, as necessary. These would be developed by interested working groups.

Internal provision of dictionaries: Data objects may state conformance against a custom dictionary by providing an appropriate location in the audit_conform category, which may resolve to a dictionary file within the aggregate. I don't think such embedding is to be encouraged, as it is essentially impossible for software to support arbitrary custom dictionaries in any meaningful way. The information of interest to software authors is in the text descriptions of data names, so even if such embedded dictionaries allowed everything to be put into a single relational schema, nothing further can be done until a software author reads the actual dictionary to find out what the data names mean. At which point custom software will need to be written.

> Does this scheme permit aggregating blocks that declare different _audit.schema?  Is _audit.schema even still relevant if we rely on explicit dictionaries to direct how aggregation can be performed?

I have removed the reference to _audit.schema in the rules, but yes, it definitely does allow different _audit.schema. I think it is still useful as a minimal-effort way to allow software that is still being maintained to detect data that does not conform with legacy expectations. We maybe could simplify it to be binary-valued, e.g.
either 'Base' or 'Dictionary'? I'm also still a bit on the fence about removing this, as I'd like a single powder dictionary to allow _phase.id to be looped and unlooped in different blocks, without needing a different dictionary to be created. Might require some more thought.

> Similarly, “the dictionary to which those data blocks conform” could be read to indicate that data blocks may be aggregated only if they conform to the same dictionary as each other.  Is that intentional?  Desirable?

Not intentional or particularly desirable, I have updated the principles above.

> Also, are data blocks required to explicitly specify the dictionary to which they conform (via the audit_conform category) in order to be eligible for aggregation?  If not, then how do we know which dictionary(-ies) to use, since it seems likely that we will have distinct dictionaries with overlapping definitions.

See the revised principles above. I don't think we should ever have overlapping contradictory definitions? If a data name is defined twice it must be referring to the same relational attribute. I have also introduced the possibility of a single dictionary that would override any block-level dictionaries for the whole aggregate.

> If a Set category is not given at all in a particular data block, then may its child categories appear in that block?  If so, then are all the child data names required to be the same as each other?  Another way to look at this may be to consider whether the automatic assignment of a value for the Set category’s key applies in this case.

I may not have fully appreciated the thrust of this question, but let me try to reply. Only if items from that Set category appear in no other data blocks would automatic assignment be possible, as only in this case is there no ambiguity.  Otherwise it is a violation. I think this is logical. If a single-phase powder sample has a separate data block giving the atomic positions but no explicit phase id or child key data name in atom_site, then there is no ambiguity, but if another data block is added that lists 3 phases the original atomic position data block is now impossible to aggregate unambiguously.

> The proposal’s wording seems to assume that a Set category will have at most one key data name.  Loop categories are not subject to such a constraint, so is this a special rule for Set categories?  That might become an issue if we ever move toward forming aggregates of aggregates.  And maybe even if we don’t.

Hmm. Interesting question. I have updated the principles as it is indeed possible that Set categories can accumulate key data names that are children of other categories, e.g. "cell" could accumulate key data names referring both to powder phase and to diffraction conditions (_diffrn.id). Aggregates of aggregates I would treat simply as appending all contents together.

> As I understand the proposal, it provides for cross-data-block relationships only where those are explicitly specified by parent and child keys appearing in one and the other block.  That’s a perfectly valid choice, but it does mean that one probably cannot perform much useful aggregation of data blocks that were not written or modified specifically for the purpose.  Is that an acceptable constraint?

I think so. I'd suggest that this is an essential element of the relational model, and that if data blocks do form a data set then those relationships necessarily exist, otherwise the data set could be split into separate parts with no loss of information.

> Should the proposal account more directly for the implicit relationships among otherwise disconnected data that arise from those data being presented in the same data block?  The allowance for elision of (certain) child keys depends on these relationships for correctness, and I think dREL relies implicitly on them, too.  I am uneasy about not taking them fully into consideration.

It seems intuitively reasonable to me that any meaningful link between categories would be reflected by the provision of linked data names.
Where these do not exist, then the two categories could be presented in separate data blocks with no loss of information. The example that immediately comes to mind is the journal data names: assuming there actually is no reference to items in the other categories, these data items could be placed in a separate data block in the same aggregate with no loss of information (assuming one article per dataset).
Whatever the case, this is not a hill I care about dying on: we can, if we want, define both a "_dataset.id" and "_block.id" to be implicitly added to all categories "just in case".

I think we can design dREL to not care about blocks. As I am imagining it, any missing information required for cross-category references would be obtained from explicitly-defined key data name relationships, not from same-block relationships.

> Operation of dREL
>
> ==============
[edit]
> My gut tells me that dREL methods cannot reliably be imported from dictionary A into dictionary B if B adds any key data names to categories defined by A.  I don’t doubt that many specific cases could be made to work, but I predict insurmountable difficulties in the general case if backwards compatibility is required.

We do allow and accept that dREL methods might need to be rewritten by dictionary B, but I think the assumptions that we have defined for dREL makes this rewriting less likely.

> What I could see happening is methods from A operating on multiblock
> data in terms of A-shaped slices.  I think that is roughly the same
> idea as presented in the above discussion. I do not consider it the
> same thing as importing the methods into B, at least not as any of the
> kinds of methods that we currently support.  A per-slice approach
> would, of course, depend on a mechanism to define the slices.  I think
> that would be straightforward if we could identify slices with data
> blocks, but my understanding of the discussion so far is that that is
> not an acceptable delineation

I am indeed inclined towards making data blocks "disappear" in the final relational schema. I'm open to alternatives, but my vision is that data blocks represent possible slices of the full relational schema, and that the way in which these slices are taken should not affect the meaning as embodied in dREL. From a different perspective, making dREL aware of blocks creates thorny problems when writing dREL that refers to data scattered across multiple blocks: if I want to calculate a measured powder intensity at multiple temperatures at an angle with contributions from all phases, I would like to loop over "_phase.id" for each single "_diffrn.id". The rules I propose above (edited out in this reply) are designed to make this the natural operation in a category that has a child key data name of `_diffrn.id`. However, if I am instead to make cross-block references explicit, I need a way to refer to those blocks in dREL or some principles that allow dREL to make sense both in the context that single phases are in separate blocks, and in the context that all the phases are grouped together in a single block (the so-called "summary block" I mentioned at the beginning). Anybody is welcome to suggest an elegant solution to this as I haven't been able to find a good one.

I should reiterate that we always have the fallback option as far as dREL is concerned of rewriting the dREL - any scheme I'm proposing is just to simplify common cases, and so any new scheme should also aim for some simplification.

> I think there are a number of questions still to be answered (see above) before that one can be.  Right now, I’d have to say “maybe.”

Hopefully the rewritten rules at the top make some forward progress.

all the best,
James.
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]