Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm group,

 

My apologies to James and all of you for being slow to respond to James’s prior message.  Please find my comments in response to his latest message inline below.

 

 

On Wednesday, December 1, 2021 12:05 AM, James H jamesrhester@gmail.com wrote:

Having had some time to think about this further, I've had the following, not entirely groundbreaking, thought: there is no need to simplify the work of dictionary authors, as we have developed a dictionary style guide. This style guide allows the writing of automated tools to process dictionaries that do not introduce spurious changes in ordering or whitespace that would confuse people looking for substantial differences. Therefore, rather than simplify the task of dictionary authors by defining special behaviour and automatic implicit parent-child key data name relationships, it would be sufficient to produce a tool that could be fed a list of category relationships and a dictionary, outputting an updated dictionary with the relevant key data names defined appropriately. This in turn means that human dictionary readers can continue to simply look at the list of category keys and associated definitions to understand relationships.

 

 

Ok, that sounds reasonable.

 

 

I was also operating under the misguided assumption that implicit key data names and their children could be deduced for all Set categories in our current dictionaries. I'm no longer sure that this is true, in which case explicit specification of category keys and links is going to be required. I note that the "cif_multiblock" proposal assumed this from the start.

 

 

That’s gracious, thank you.  Yes, I have been approaching the issue from the start with the view that a general-purpose scheme for recognizing and handling multiblock data sets would need to allow for explicit specification of keys and links, and the cif_multiblock proposal reflects that.  With that said, I do think that some relationships could be deduced automatically, especially with the help of hints embedded in dictionaries.  If that were enough then I think one of the earlier forms of the “bare bones” proposal probably would be viable.  However, I do not accept that any single dictionary could support inferring all the relationships that one might reasonably want to form.  For example, one can consider pairs of Set categories that plausibly could be assembled with a parent – child relationship that runs in either direction.  The direction doesn’t matter within a single block, but it does if we aggregate multiple rows of one or both of the categories. I don’t see how a deterministic approach could reliably choose the correct direction in every case based on the same dictionary and without additional input.

 

 

So an updated "bare bones" proposal reads as follows:

 

(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).

(ii) The following backwards-compatible assumption is made for DDLm: A "Set" category is a "Loop" category for which only one row may be presented in a data block

(iii) The set of "Set" categories for a data block is determined by _audit.schema

(iv) A "Set" category may only be aggregated from multiple data blocks if a key data name for the "Set" category has been provided in the dictionary to which those data blocks conform.

(v) Where a "Set" category has been provided with a key data name in the dictionary as per (iv), all child data names must also be provided in the dictionary.

(vi) The value of a "Set" category key data name for a given data block may be explicitly stated or, if missing, an arbitrary, unique value is assigned.

(vii) Values for child data names of "Set" category key data names may always be elided.

 

In this case the example provided by John would remain as two separate data blocks, as the appropriate "Set" category key has not been defined.

 

 

Would it be acceptable to couch that last statement in different terms?  In particular, how about “the example provided by John would not be a valid aggregate, as […]”?  I am concerned with the conclusion that might otherwise be drawn, that every collection of data blocks should _automatically_ be viewed as an aggregate, assembled in whatever manner the data permit.  Also, considering the use case of extending aggregates simply by dropping in additional data blocks without other modification, I am much more comfortable with the idea that that might break the aggregate than with the idea that it might cause the data present in the aggregate before the addition to be interpreted differently after the addition.

 

Additional considerations include:

 

  • My understanding of this version of the proposal is that aggregation would be performed only as directed and allowed by an explicit dictionary.  If that’s accurate, then perhaps a more appropriate designation for it would be “dictionary directed” or similar.  I think we’ve moved beyond “bare bones” territory by this point.

 

  • For the same reason that no single dictionary could support all the patterns of inferred relationships that one might plausibly want, no single dictionary can support all those patterns of explicit relationships, either.  Therefore, I think we are now talking about having dictionaries specific to various families of aggregation patterns.  In that case, some kind of representation of these dictionaries will need to be created and maintained.  Where?  By whom?  Is it assumed that COMCIFS or any of the existing dictionary maintenance groups would play a role?  Is it assumed that the producers or consumers of such data will provide these directly?  Dare I suggest an option of delivering some form of a dictionary description inside the aggregate?

 

  • Does this scheme permit aggregating blocks that declare different _audit.schema?  Is _audit.schema even still relevant if we rely on explicit dictionaries to direct how aggregation can be performed?

 

  • Similarly, “the dictionary to which those data blocks conform” could be read to indicate that data blocks may be aggregated only if they conform to the same dictionary as each other.  Is that intentional?  Desirable?

 

  • Also, are data blocks required to explicitly specify the dictionary to which they conform (via the audit_conform category) in order to be eligible for aggregation?  If not, then how do we know which dictionary(-ies) to use, since it seems likely that we will have distinct dictionaries with overlapping definitions.

 

  • If a Set category is not given at all in a particular data block, then may its child categories appear in that block?  If so, then are all the child data names required to be the same as each other?  Another way to look at this may be to consider whether the automatic assignment of a value for the Set category’s key applies in this case.

 

  • The proposal’s wording seems to assume that a Set category will have at most one key data name.  Loop categories are not subject to such a constraint, so is this a special rule for Set categories?  That might become an issue if we ever move toward forming aggregates of aggregates.  And maybe even if we don’t.

 

  • As I understand the proposal, it provides for cross-data-block relationships only where those are explicitly specified by parent and child keys appearing in one and the other block.  That’s a perfectly valid choice, but it does mean that one probably cannot perform much useful aggregation of data blocks that were not written or modified specifically for the purpose.  Is that an acceptable constraint?

 

  • Should the proposal account more directly for the implicit relationships among otherwise disconnected data that arise from those data being presented in the same data block?  The allowance for elision of (certain) child keys depends on these relationships for correctness, and I think dREL relies implicitly on them, too.  I am uneasy about not taking them fully into consideration.

 

 

 

Operation of dREL

==============

 

The only dREL construct that still needs to be considered is the "Loop" over a category. This would normally consider every row of a category in turn. For example, one might loop over atom sites to determine the density of the compound. However, in a situation where a category has been equipped by a dictionary with extra key data names, the dREL routine needs to decide whether or not it should still consider all rows, or only a subset. So for a multi-phase powder sample, we would want to calculate density for each compound in turn, not simply use all atom sites as before. There are a number of ways of making this work: for example, a default rule that only those key data names of the looped-over category that are *not* child or parent data names of the keys of the category in which the calculation takes place would cover most situations that I can see. This rule would not cover surrogate keys, so might need to be expanded to cover all key data name relationships that can be deduced.

 

 

My gut tells me that dREL methods cannot reliably be imported from dictionary A into dictionary B if B adds any key data names to categories defined by A.  I don’t doubt that many specific cases could be made to work, but I predict insurmountable difficulties in the general case if backwards compatibility is required.

 

What I could see happening is methods from A operating on multiblock data in terms of A-shaped slices.  I think that is roughly the same idea as presented in the above discussion. I do not consider it the same thing as importing the methods into B, at least not as any of the kinds of methods that we currently support.  A per-slice approach would, of course, depend on a mechanism to define the slices.  I think that would be straightforward if we could identify slices with data blocks, but my understanding of the discussion so far is that that is not an acceptable delineation.

 

 

I'm wondering if the "bare bones" proposal above is acceptable?

 

 

I think there are a number of questions still to be answered (see above) before that one can be.  Right now, I’d have to say “maybe.”

 

 

Best Regards,

 

John

 

--

John C. Bollinger, Ph.D., RHCSA

Computing and X-ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

 



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]