Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

  • To: "james.r.hester@gmail.com" <james.r.hester@gmail.com>, "Group finalisingDDLm and associated dictionaries" <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Multi block principles
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Fri, 19 Nov 2021 16:16:59 +0000
  • Accept-Language: en-US
  • ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
  • ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;bh=hAT+f77atKjRhwpV5QV2gOt5eVq9e6kl/WcMZ5IDh8I=;b=mcgboh8s+TwRKVGuRch8fHgcBqG0GyVCP2cSmKq4PkjZxlvK1nxdpLPGs/og16VlVNNChKQZJysR/4B9uRKbrJVThDMFoaSSqMdWYI4H+d6zVdDaxHZy7ffCdJXyDSOSzwsrfJ+bUjx/Qu3kW/ut7F3tRxjdNEQjAYKw2nzWCQF85Q/m35NsdqjPRJ0lGAxlRagtWXGnKKSOgV7I8fr+OcwwSvBQEMOGKLwp874PHT8H+QztF+snmRF2IbZ2nPzTBuDAkVu+pTH0u1r68ZoL+rHj1oqFezgV6nzrldl3ZFNdmWAnl3jTBa+RtC59VXqUH9CfEhFN9dQlvU5QVWgZUA==
  • ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=gG4dvHMd+r7QXdkFPckb1npEHaJzWsYy3+/t10YvUD0Y8J2qcFlAxG9y8maXNVmYSgQLRTR4G3T0/Yl4xA/y7Zo/jR+eLT5xR4RPU8wdzsOEQCYzB0EhnMgIWg+RjrT6CoxAZrWfQWAHWjVDg+dnQFInydPWdoJNUund5esBn7mS+C5W67AgB3o/qSO1v+a+9p6et2+rX2b6/4mFG77PQF6zGI4TQ30xU88KYkalfAJ5qPP3AkFK7snYY9GH+SCwQIBTKxqiZ0d6cK70tyuXpKHC1DKg+VkDI7Ks7IoB6IWKorSHszmw89fOoyC5bvRjmQ5Xypl+v6vR5USPknE+sw==
  • In-Reply-To: <CAM+dB2frS4Xg7fhxy5GQcw5t0WJ+pvia-HvqwLAzC-ySZoU+QQ@mail.gmail.com>
  • IronPort-SDR: xFqogj9M72RI54zkynvTJ6/Gwexe5rwlGrZ4AbvduTOUNF5wyTpWsZDUEu3qURuB9S2ToBJlkieUmhlM46D56l8SRmR3Zv0E2ah1TRE4tQLzd0s6Jy5w4MHn2VFe/8JsYb5n3inT61hhhlOo3+xd8E0VA7zWfDSlmfkho6fT4k/JK08dHodaJPRY7F1NNjVZmehmHDNKQRdAjMPRLgGiwBSkxy+exDJkUrlXmKUBnYxv5CWouJJpkIOt138uuaHyzRZ8v1DBnCbammq9T1mB7z/c0F4qAzJ5gleOl+VuSUM=
  • References: <CAM+dB2fajH1c1vhrCJU9v-QQw0kt4Y2udDEx4HBK9QzDq=LD3w@mail.gmail.com><CH2PR04MB6950E54AF550C819FF598F35E0999@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2frS4Xg7fhxy5GQcw5t0WJ+pvia-HvqwLAzC-ySZoU+QQ@mail.gmail.com>

Dear DDLm group,

 

Comments inline below.

 

On Thursday, November 18, 2021 8:41 PM, James Hester wrote:

On Wed, 17 Nov 2021 at 03:41, Bollinger, John C <John.Bollinger@stjude.org> wrote:

 3. It might be useful to clarify that writing data blocks using the _audit.schema 'Base' implies that all the categories on which each category depends -- both Set categories and related Loop categories -- are presented in the same data block.  In relational terms, one might say that each data block provides a distinct, implicit key value that associates the categories presented within.  I would like to avoid giving the impression that multiple data blocks, each specifying _audit.schema 'Base' and valid against (say) the Core dictionary, and without any duplicate items or key conflicts among them, can or should be interpreted the same as a single data block containing the union of the multiple blocks’ contents.

 

I'm not sure I follow. If a powder diffraction experiment splits the structures of, say, 3 component phases over 3 blocks + 1 block for the invariable information, all conforming to 'Base', isn't that information identical to presenting it all in a single block (no longer conforming to 'Base') with appropriate key data names added?

 

Perhaps, then, it is me who does not follow.  Let me try again.

 

I do not think it is the intention to allow CIF data to be split _arbitrarily_ among data blocks, but if so, then I do object to the “arbitrarily” part.  Up to now, different CIF data items have been associated with each other at a basic level by appearing in the same data block.  I understand that some communities may have layered additional conventions on top of that, but no such additions have been baked into CIF or any of our DDLs.  I want to maintain the data block as a coherent unit of data, such that splitting this …

 

data_combined

_set_category_1.x  5.00

_set_category_2.a  'example'

 

 

... into this …

 

 

data_1

_set_category_1.x  5.00

 

data_2

_set_category_2.a  'example'

 

 

... loses information (that the two items are associated with each other) except in cases that are explicitly provided for by dictionaries and conveyed by data explicitly presented among the relevant collection of data blocks.

 

The message that one can split data across different blocks should not be de-contextualized or oversold.

 

 

4. Speaking of data blocks defining an implicit key, I think the draft overemphasizes the relationships between Loop categories and Set categories.  When considering _audit.schema values other than 'Base', one has to recognize and account for the fact that there are relationships between pairs of Set categories, too.  These tend to be weak in the Core dictionary because it is fairly well factored, but for an example, take Set categories _exptl_crystal and _chemical_formula.  There is a non-trivial dependency there via (at least) _exptl_crystal.density_diffrn.  Also along these lines, it would be appropriate to say not that Set categories *may be* equipped with a category key, but that they *are* equipped with one.  If that can’t be considered technically correct, then we should make it so.  We could introduce the possibility of a zero-column key for this, which would offer some mathematical consistency both with there being only one possible category key value for Set categories, and with the effects of expanding that key with additional columns.

 

What is a zero-column key?  Is that like an implicit key with no actual values stored?

 

 

I’m not sure it’s a concept that anyone else uses, but it’s a simple generalization of standard ideas:

 

A key for a relation consists of some subset of attributes of that relation (columns).  No two distinct rows of the relation can match in every key attribute, so in the absence of other constraints, as many rows can be present as there are distinct combinations of values drawn from the key attributes’ domains.

 

Now suppose we want a relation that is restricted to a single row.  One way to do that would be to give the relation an attribute whose domain contains only one value, and to designate that as the only key attribute. But in most cases that’s artificial and untidy.

 

There is a cleaner and simpler alternative: designate a key consisting of _zero_ attributes.  There is only one distinct combination of zero values: the empty set / tuple / dictionary. Therefore, a zero-attribute (zero-column) key affords only one row.  If we contemplate adding key attributes to a category, then adding them to an existing zero-attribute key is both logically and structurally simpler than converting a category that does not have a key at all into one that does.

 

For DDLm dictionaries, that concept could be applied to give category keys to Set categories without defining any new attributes in those categories.  Where Set categories need to be changed, possibly dynamically, into Loop categories, that’s made simpler if the fundamental difference is quantitative (the number of key attributes) rather than qualitative.

 

 

8. I disfavor relying on parent categories to identify their child categories.  That approach already constrains how DDL2 dictionaries may be supplemented by extension dictionaries in the more constrained context of Set categories not needing to participate in child-declaration, especially if one wants to use multiple extension dictionaries together.  I just don’t see it being sustainable in an environment where we must consider substantially every category to be a potential Loop category.  A plan that localizes the required definition changes as much as possible is to be preferred.  As an alternative, it may be useful to come up with a standard way to encode the additional dependency information into DDLm dictionaries *now*.  That could at least provide for automating the generation of the needed additional definitions an extension dictionaries.

 

So how about reversing it, and the child categories instead identify their parent categories using a new DDLm attribute?

 

 

Of course, we already have exactly this for Loop – Loop relationships, and we will need to use it for Loop – Set_turned_into_Loop relationships, at least conceptually.  It takes the form of the _name.linked_item_id of an item with _type.purpose Link.  I think it’s sensible to handle Loop – Set relationships analogously.  It’s too late to design a single mechanism that could handle both, but that would have been ideal.

 

 

This would still require extension dictionaries to add information to core categories from time to time. One example might be an imaginary twinning dictionary that introduces 'twin_id' in category 'twin'. Until this dictionary, the 'refln' category implicitly assumed a single value of this identifier, so the dictionary would redefine 'refln' to also depend on 'twin', as would 'diffrn_refln' and some others.  The test comes when e.g. the modulated structure dictionary does not know about the existence of the 'twin' dictionary and redefines 'refln' its own way.

 

This missing information is, however, not a problem as it simply retains the meaning of 'single individual twin' for a modulated structure. If someone wants to describe a twinned modulated structure, then the modulated structures dictionary categories can be updated accordingly, and as long as the 'Base' schema is retained legacy software will be OK.  We still retain the option of explicitly defining the parent/child data names for complex situations.

 

 

I acknowledge that we can expect that extensions will still sometimes need to define modifications to core categories.  I am ok with that in principle, and although it will require some care in practice, I think it is workable.

 

 

On reflection, the original extension dictionary mechanism (adding explicit key data names to child categories) was really just creating these dependencies in child categories, but at the cost of proliferation of extension dictionaries (e.g. a modulated-structure-twin dictionary, a modulated-structures-laue-twin dictionary etc.). It seems much neater to simply gradually expand the lists of "parent" categories in child categories within the single dictionary as the need arises. If we are agreeable with this approach I'll draft a definition for a new DDLm attribute that we can discuss.

 

 

As may be evident from my previous comments, I’m not sure I recognize a distinction between an original extension mechanism and a new one.  At minimum, I guess I have not been viewing the multi-block proposal as a dictionary extension mechanism, though upon reflection, I see how it has a form of that rolled in.  If you are satisfied that we have enough common ground to consider specifics of a DDLm attribute then I would be happy to have that conversation.

 

 

Best regards,

 

John

 



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]