Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm group and John,

The main difference in opinion I see here is regarding the possibility of arbitrary separation of data block contents. As per John's example, I am indeed proposing that *any* Set category *should* be spread across multiple blocks whenever the necessity arises to provide multiple values for its data names. John points out that this would lose the information that these items of information are linked, and so such splitting should not take place unless that original link can be reconstituted.

My "solution" to this is to rely on the context to aggregate data blocks and files into something that the context asserts is a "dataset".  The reason I have ended up at this position is as a result of our previous discussion (see https://www.iucr.org/__data/iucr/lists/ddlm-group/msg01626.html and following comments): there is no bulletproof way to insert the appropriate information into all data blocks in all situations. "Context" more verbosely means "the collection of data objects to which this data object belongs and which has been designated by the context as a coherent dataset". The "coherency" requirement means that there are no contradictions (relations with the same key data values but different values in the corresponding rows) after assembly of the data blocks into a single set of relations.

Particular communities have specified the use of the "summary blocks" mentioned in the proposal: a separate data block where a list of all of the blocks and their roles may be collected. Powder diffraction, for example, uses a data block where "phase_id" is looped together with the block_id of the data block containing the phase information for that phase_id. Such summary blocks may not, in general, be guaranteed to include data blocks added after the summary blocks were generated (e.g. calibration data collected separately from the experiment) and does not cover historical uses of CIF where information about a single dataset has been split into a couple or more data blocks in a single file.

So, some options for strengthening links between data blocks, without attempting to be bulletproof:
1. A new data name e.g. "_audit.multiblock" (true/false) that indicates more than one data block is to be expected
2. A new data name "_audit.dataset_id". The same value for this in separate blocks is sufficient but not necessary for those blocks to be considered part of the same dataset
3. Recommend/require that a summary block be included in a dataset: this covers all known data objects at the time the summary block was written.

I would of course welcome some bulletproof yet practical way to link blocks together in a CIF way.

Let's sort this piece of the puzzle out before the zero-column keys and parent-child mechanism.

all the best,

On Sat, 20 Nov 2021 at 03:17, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm group,


Comments inline below.


On Thursday, November 18, 2021 8:41 PM, James Hester wrote:

On Wed, 17 Nov 2021 at 03:41, Bollinger, John C <John.Bollinger@stjude.org> wrote:

 3. It might be useful to clarify that writing data blocks using the _audit.schema 'Base' implies that all the categories on which each category depends -- both Set categories and related Loop categories -- are presented in the same data block.  In relational terms, one might say that each data block provides a distinct, implicit key value that associates the categories presented within.  I would like to avoid giving the impression that multiple data blocks, each specifying _audit.schema 'Base' and valid against (say) the Core dictionary, and without any duplicate items or key conflicts among them, can or should be interpreted the same as a single data block containing the union of the multiple blocks’ contents.


I'm not sure I follow. If a powder diffraction experiment splits the structures of, say, 3 component phases over 3 blocks + 1 block for the invariable information, all conforming to 'Base', isn't that information identical to presenting it all in a single block (no longer conforming to 'Base') with appropriate key data names added?


Perhaps, then, it is me who does not follow.  Let me try again.


I do not think it is the intention to allow CIF data to be split _arbitrarily_ among data blocks, but if so, then I do object to the “arbitrarily” part.  Up to now, different CIF data items have been associated with each other at a basic level by appearing in the same data block.  I understand that some communities may have layered additional conventions on top of that, but no such additions have been baked into CIF or any of our DDLs.  I want to maintain the data block as a coherent unit of data, such that splitting this …



_set_category_1.x  5.00

_set_category_2.a  'example'



... into this …




_set_category_1.x  5.00



_set_category_2.a  'example'



... loses information (that the two items are associated with each other) except in cases that are explicitly provided for by dictionaries and conveyed by data explicitly presented among the relevant collection of data blocks.


The message that one can split data across different blocks should not be de-contextualized or oversold.



4. Speaking of data blocks defining an implicit key, I think the draft overemphasizes the relationships between Loop categories and Set categories.  When considering _audit.schema values other than 'Base', one has to recognize and account for the fact that there are relationships between pairs of Set categories, too.  These tend to be weak in the Core dictionary because it is fairly well factored, but for an example, take Set categories _exptl_crystal and _chemical_formula.  There is a non-trivial dependency there via (at least) _exptl_crystal.density_diffrn.  Also along these lines, it would be appropriate to say not that Set categories *may be* equipped with a category key, but that they *are* equipped with one.  If that can’t be considered technically correct, then we should make it so.  We could introduce the possibility of a zero-column key for this, which would offer some mathematical consistency both with there being only one possible category key value for Set categories, and with the effects of expanding that key with additional columns.


What is a zero-column key?  Is that like an implicit key with no actual values stored?



I’m not sure it’s a concept that anyone else uses, but it’s a simple generalization of standard ideas:


A key for a relation consists of some subset of attributes of that relation (columns).  No two distinct rows of the relation can match in every key attribute, so in the absence of other constraints, as many rows can be present as there are distinct combinations of values drawn from the key attributes’ domains.


Now suppose we want a relation that is restricted to a single row.  One way to do that would be to give the relation an attribute whose domain contains only one value, and to designate that as the only key attribute. But in most cases that’s artificial and untidy.


There is a cleaner and simpler alternative: designate a key consisting of _zero_ attributes.  There is only one distinct combination of zero values: the empty set / tuple / dictionary. Therefore, a zero-attribute (zero-column) key affords only one row.  If we contemplate adding key attributes to a category, then adding them to an existing zero-attribute key is both logically and structurally simpler than converting a category that does not have a key at all into one that does.


For DDLm dictionaries, that concept could be applied to give category keys to Set categories without defining any new attributes in those categories.  Where Set categories need to be changed, possibly dynamically, into Loop categories, that’s made simpler if the fundamental difference is quantitative (the number of key attributes) rather than qualitative.



8. I disfavor relying on parent categories to identify their child categories.  That approach already constrains how DDL2 dictionaries may be supplemented by extension dictionaries in the more constrained context of Set categories not needing to participate in child-declaration, especially if one wants to use multiple extension dictionaries together.  I just don’t see it being sustainable in an environment where we must consider substantially every category to be a potential Loop category.  A plan that localizes the required definition changes as much as possible is to be preferred.  As an alternative, it may be useful to come up with a standard way to encode the additional dependency information into DDLm dictionaries *now*.  That could at least provide for automating the generation of the needed additional definitions an extension dictionaries.


So how about reversing it, and the child categories instead identify their parent categories using a new DDLm attribute?



Of course, we already have exactly this for Loop – Loop relationships, and we will need to use it for Loop – Set_turned_into_Loop relationships, at least conceptually.  It takes the form of the _name.linked_item_id of an item with _type.purpose Link.  I think it’s sensible to handle Loop – Set relationships analogously.  It’s too late to design a single mechanism that could handle both, but that would have been ideal.



This would still require extension dictionaries to add information to core categories from time to time. One example might be an imaginary twinning dictionary that introduces 'twin_id' in category 'twin'. Until this dictionary, the 'refln' category implicitly assumed a single value of this identifier, so the dictionary would redefine 'refln' to also depend on 'twin', as would 'diffrn_refln' and some others.  The test comes when e.g. the modulated structure dictionary does not know about the existence of the 'twin' dictionary and redefines 'refln' its own way.


This missing information is, however, not a problem as it simply retains the meaning of 'single individual twin' for a modulated structure. If someone wants to describe a twinned modulated structure, then the modulated structures dictionary categories can be updated accordingly, and as long as the 'Base' schema is retained legacy software will be OK.  We still retain the option of explicitly defining the parent/child data names for complex situations.



I acknowledge that we can expect that extensions will still sometimes need to define modifications to core categories.  I am ok with that in principle, and although it will require some care in practice, I think it is workable.



On reflection, the original extension dictionary mechanism (adding explicit key data names to child categories) was really just creating these dependencies in child categories, but at the cost of proliferation of extension dictionaries (e.g. a modulated-structure-twin dictionary, a modulated-structures-laue-twin dictionary etc.). It seems much neater to simply gradually expand the lists of "parent" categories in child categories within the single dictionary as the need arises. If we are agreeable with this approach I'll draft a definition for a new DDLm attribute that we can discuss.



As may be evident from my previous comments, I’m not sure I recognize a distinction between an original extension mechanism and a new one.  At minimum, I guess I have not been viewing the multi-block proposal as a dictionary extension mechanism, though upon reflection, I see how it has a form of that rolled in.  If you are satisfied that we have enough common ground to consider specifics of a DDLm attribute then I would be happy to have that conversation.



Best regards,




Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]