Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm group,
Here at the top I'll provide an updated, renamed "dictionary driven"proposal, with comments afterwards and comments inline as well. I notethat it is probably quite hard to follow progress - I would be happyto move this discussion to Github and communicate the result here ifothers agree.
Dictionary Driven Proposal for Data Block Aggregration into a Dataset=====================================================
(i) Multiple data blocks are aggregated into a data set by beingpresented in the same data container (e.g. file, zip archive,directory).(ii) An aggregate is "valid" against a relational schema if the datacontained can, in principle, be assigned unambiguously andconsistently to cells in that relational schema
The remaining principles apply only to data items for which a mappingto a CIF-dictionary-defined data name is available
(iii) Dictionary conformance of individual data objects is specifiedeither by audit_conform data names in each block, or as-yet-undefineddataset-level audit_dataset datanames in any block. Dataset-leveldictionaries override block-level dictionaries.(iv) Appearance: A category "appears" in a data block if any data namebelonging to it is either present, or referred to via`_name.linked_item_id` of a data name that is present(v) If a data name appearing in a data block is not defined in thedictionary to which that data block conforms, it is considered absentfor the purposes of these principles(vi) Compatibility: Two dictionaries describing data blocks arecompatible if, for all categories that appear in those data blocks,the dictionaries prescribe the same set of key data names(vii) Data blocks are compatible if either (1) their dictionaries arecompatible or (2) a unique value can be determined or assigned for anykey data names that are absent from one of the incompatibledictionaries(viii) A "Set" category is a "Loop" category for which only one rowmay be presented in a data block(ix) The set of "Set" categories for a data block is determined by thedictionary to which that data block conforms(x) A "Set" category may only be aggregated from multiple data blocksif at least one key data name for the "Set" category has been providedin the dictionary to which those data blocks conform.(xi) Where a "Set" category has been provided with a key data name inthe dictionary as per (x), and that data name is not itself a childdata name of some other data name, all child data names must also beprovided in the dictionary.(xii) The value of "Set" category key data names for a given datablock may be assigned arbitrarily if missing(xiii) Values for child data names of "Set" category key data namesmay always be elided.
The guiding principle in designing these rules is "can we uniquelyassign the values in a given data block into cells of the relationalstructure describing the whole dataset?". Reasons for failure that Ican see might include:1. Not being able to determine a single relational schema for thedataset (contradictory dictionaries)2. Not knowing the values of all the key data names in a row3. Contradictory attribute values for the same key data name values
The conditions I've described above in (iii)-(vii) attempt to excludethese failure modes while also allowing maximum leeway in unambiguoussituations (e.g. a category has only one value for a single key dataname for the whole dataset).
On Thu, 2 Dec 2021 at 07:42, Bollinger, John C<John.Bollinger@stjude.org> wrote:[edit]
> In this case the example provided by John would remain as two separate data blocks, as the appropriate "Set" category key has not been defined.>> Would it be acceptable to couch that last statement in different terms?  In particular, how about “the example provided by John would not be a valid aggregate, as […]”?  I am concerned with the conclusion that might otherwise be drawn, that every collection of data blocks should _automatically_ be viewed as an aggregate, assembled in whatever manner the data permit.  Also, considering the use case of extending aggregates simply by dropping in additional data blocks without other modification, I am much more comfortable with the idea that that might break the aggregate than with the idea that it might cause the data present in the aggregate before the addition to be interpreted differently after the addition.
Quite agree. I indeed meant to imply that it was not a validaggregate, that is, the blocks remain completely independent of oneanother.
> For the same reason that no single dictionary could support all the patterns of inferred relationships that one might plausibly want, no single dictionary can support all those patterns of explicit relationships, either.  Therefore, I think we are now talking about having dictionaries specific to various families of aggregation patterns.  In that case, some kind of representation of these dictionaries will need to be created and maintained.  Where?  By whom?  Is it assumed that COMCIFS or any of the existing dictionary maintenance groups would play a role?  Is it assumed that the producers or consumers of such data will provide these directly?  Dare I suggest an option of delivering some form of a dictionary description inside the aggregate?
I have taken some of these comments into account in the revised rules above.
To answer John's questions specifically:Dictionaries specific to aggregation patterns? At this point I believethat our current set of dictionaries describes a single relationalspace, and that we can relatively easily extend them to cover typicalaggregates without those extensions contradicting one another. But Icannot rule out the possibility that sometime in the future we willrun up against incompatibilities, so...
Creation and maintenance: COMCIFS would register and maintaindictionaries used in describing common aggregates, as necessary. Thesewould be developed by interested working groups.
Internal provision of dictionaries: Data objects may state conformanceagainst a custom dictionary by providing an appropriate location inthe audit_conform category, which may resolve to a dictionary filewithin the aggregate. I don't think such embedding is to beencouraged, as it is essentially impossible for software to supportarbitrary custom dictionaries in any meaningful way. The informationof interest to software authors is in the text descriptions of datanames, so even if such embedded dictionaries allowed everything to beput into a single relational schema, nothing further can be done untila software author reads the actual dictionary to find out what thedata names mean. At which point custom software will need to bewritten.
> Does this scheme permit aggregating blocks that declare different _audit.schema?  Is _audit.schema even still relevant if we rely on explicit dictionaries to direct how aggregation can be performed?
I have removed the reference to _audit.schema in the rules, but yes,it definitely does allow different _audit.schema. I think it is stilluseful as a minimal-effort way to allow software that is still beingmaintained to detect data that does not conform with legacyexpectations. We maybe could simplify it to be binary-valued, e.g.either 'Base' or 'Dictionary'? I'm also still a bit on the fence aboutremoving this, as I'd like a single powder dictionary to allow_phase.id to be looped and unlooped in different blocks, withoutneeding a different dictionary to be created. Might require some morethought.
> Similarly, “the dictionary to which those data blocks conform” could be read to indicate that data blocks may be aggregated only if they conform to the same dictionary as each other.  Is that intentional?  Desirable?
Not intentional or particularly desirable, I have updated the principles above.
> Also, are data blocks required to explicitly specify the dictionary to which they conform (via the audit_conform category) in order to be eligible for aggregation?  If not, then how do we know which dictionary(-ies) to use, since it seems likely that we will have distinct dictionaries with overlapping definitions.
See the revised principles above. I don't think we should ever haveoverlapping contradictory definitions? If a data name is defined twiceit must be referring to the same relational attribute. I have alsointroduced the possibility of a single dictionary that would overrideany block-level dictionaries for the whole aggregate.
> If a Set category is not given at all in a particular data block, then may its child categories appear in that block?  If so, then are all the child data names required to be the same as each other?  Another way to look at this may be to consider whether the automatic assignment of a value for the Set category’s key applies in this case.
I may not have fully appreciated the thrust of this question, but letme try to reply. Only if items from that Set category appear in noother data blocks would automatic assignment be possible, as only inthis case is there no ambiguity.  Otherwise it is a violation. I thinkthis is logical. If a single-phase powder sample has a separate datablock giving the atomic positions but no explicit phase id or childkey data name in atom_site, then there is no ambiguity, but if anotherdata block is added that lists 3 phases the original atomic positiondata block is now impossible to aggregate unambiguously.
> The proposal’s wording seems to assume that a Set category will have at most one key data name.  Loop categories are not subject to such a constraint, so is this a special rule for Set categories?  That might become an issue if we ever move toward forming aggregates of aggregates.  And maybe even if we don’t.
Hmm. Interesting question. I have updated the principles as it isindeed possible that Set categories can accumulate key data names thatare children of other categories, e.g. "cell" could accumulate keydata names referring both to powder phase and to diffractionconditions (_diffrn.id). Aggregates of aggregates I would treat simplyas appending all contents together.
> As I understand the proposal, it provides for cross-data-block relationships only where those are explicitly specified by parent and child keys appearing in one and the other block.  That’s a perfectly valid choice, but it does mean that one probably cannot perform much useful aggregation of data blocks that were not written or modified specifically for the purpose.  Is that an acceptable constraint?
I think so. I'd suggest that this is an essential element of therelational model, and that if data blocks do form a data set thenthose relationships necessarily exist, otherwise the data set could besplit into separate parts with no loss of information.
> Should the proposal account more directly for the implicit relationships among otherwise disconnected data that arise from those data being presented in the same data block?  The allowance for elision of (certain) child keys depends on these relationships for correctness, and I think dREL relies implicitly on them, too.  I am uneasy about not taking them fully into consideration.
It seems intuitively reasonable to me that any meaningful link betweencategories would be reflected by the provision of linked data names.Where these do not exist, then the two categories could be presentedin separate data blocks with no loss of information. The example thatimmediately comes to mind is the journal data names: assuming thereactually is no reference to items in the other categories, these dataitems could be placed in a separate data block in the same aggregatewith no loss of information (assuming one article per dataset).Whatever the case, this is not a hill I care about dying on: we can,if we want, define both a "_dataset.id" and "_block.id" to beimplicitly added to all categories "just in case".
I think we can design dREL to not care about blocks. As I am imaginingit, any missing information required for cross-category referenceswould be obtained from explicitly-defined key data name relationships,not from same-block relationships.
> Operation of dREL>> ==============[edit]> My gut tells me that dREL methods cannot reliably be imported from dictionary A into dictionary B if B adds any key data names to categories defined by A.  I don’t doubt that many specific cases could be made to work, but I predict insurmountable difficulties in the general case if backwards compatibility is required.
We do allow and accept that dREL methods might need to be rewritten bydictionary B, but I think the assumptions that we have defined fordREL makes this rewriting less likely.
> What I could see happening is methods from A operating on multiblock data in terms of A-shaped slices.  I think that is roughly the same idea as presented in the above discussion. I do not consider it the same thing as importing the methods into B, at least not as any of the kinds of methods that we currently support.  A per-slice approach would, of course, depend on a mechanism to define the slices.  I think that would be straightforward if we could identify slices with data blocks, but my understanding of the discussion so far is that that is not an acceptable delineation
I am indeed inclined towards making data blocks "disappear" in thefinal relational schema. I'm open to alternatives, but my vision isthat data blocks represent possible slices of the full relationalschema, and that the way in which these slices are taken should notaffect the meaning as embodied in dREL. From a different perspective,making dREL aware of blocks creates thorny problems when writing dRELthat refers to data scattered across multiple blocks: if I want tocalculate a measured powder intensity at multiple temperatures at anangle with contributions from all phases, I would like to loop over"_phase.id" for each single "_diffrn.id". The rules I propose above(edited out in this reply) are designed to make this the naturaloperation in a category that has a child key data name of`_diffrn.id`. However, if I am instead to make cross-block referencesexplicit, I need a way to refer to those blocks in dREL or someprinciples that allow dREL to make sense both in the context thatsingle phases are in separate blocks, and in the context that all thephases are grouped together in a single block (the so-called "summaryblock" I mentioned at the beginning). Anybody is welcome to suggest anelegant solution to this as I haven't been able to find a good one.
I should reiterate that we always have the fallback option as far asdREL is concerned of rewriting the dREL - any scheme I'm proposing isjust to simplify common cases, and so any new scheme should also aimfor some simplification.
> I think there are a number of questions still to be answered (see above) before that one can be.  Right now, I’d have to say “maybe.”
Hopefully the rewritten rules at the top make some forward progress.
all the best,James.--T +61 (02) 9717 9907F +61 (02) 9717 3145M +61 (04) 0249 4148_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]