Re: [ddlm-group] Multi block principles
- To: "Bollinger, John C" <John.Bollinger@stjude.org>
- Subject: Re: [ddlm-group] Multi block principles
- From: James H <jamesrhester@gmail.com>
- Date: Fri, 26 Nov 2021 18:17:10 +1100
- Cc: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- In-Reply-To: <CH2PR04MB69507BF8620D6DC960DECEB6E0609@CH2PR04MB6950.namprd04.prod.outlook.com>
- References: <CAM+dB2fajH1c1vhrCJU9v-QQw0kt4Y2udDEx4HBK9QzDq=LD3w@mail.gmail.com><CH2PR04MB6950E54AF550C819FF598F35E0999@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2frS4Xg7fhxy5GQcw5t0WJ+pvia-HvqwLAzC-ySZoU+QQ@mail.gmail.com><CH2PR04MB695069448B0B396DE843ECEAE09C9@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2e-hSg5m4C7+MUbEWhK1ni2GOWWEWoGJCwS5_Xkq6uzuA@mail.gmail.com><CH2PR04MB6950FD4238F32E64416CAC4EE09F9@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2co6kOup_P9ZawKontOM+yE_CfFPh2NuVU0S6fvtT_T3w@mail.gmail.com><CH2PR04MB69507BF8620D6DC960DECEB6E0609@CH2PR04MB6950.namprd04.prod.outlook.com>
(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).
(ii) The following backwards-compatible assumptions are made for DDLm:
* A "Set" category is a "Loop" category for which only one row may be presented in a data block
* The set of "Set" categories for a data block is determined by _audit.schema
* The key data name of a "Set" category may be omitted from the dictionary if it is never referred to explicitly elsewhere in the dictionary (e.g. using _name.linked_item_id)
* If a key data name of a "Set" category is omitted from the dictionary, child relationships with that key data name *must* be defined using a new category-level DDLm attribute in order for multi-block presentation of that Set category to be possible.
(iii) If a Set category key is absent from a data file, it and
its child data names may be assigned a unique, arbitrary value. If any
data blocks need to refer to Set category
values in other data blocks, explicit values for Set category key data names must
necessarily be provided in both the dictionary and data block
(iv) If a Set category key data name is absent from a dictionary and data file, values are populated as for (iii) as if arbitrary, unique data names existed for the key data name and children.
And let me for the sake of comparison provide the "fallback" scheme which we originally envisaged:
(i) No aggregation mechanism was specified
(ii) No principle for distribution among data blocks was specified
(iii) All multi-row Set categories are explicitly provided with key data names by extension dictionaries
(iv) All categories with child data names of Set category key data names are explicitly provided with them
Our goal as I see it is to resolve (i) and (ii), as well as to minimise the work involved in (iii) and (iv).
The constraints we are operating under are:
(i) Current single-data-block CIF data files must remain valid
(ii) If possible the approaches adopted by pdCIF and msCIF should remain valid
(iii) Data spread between multiple data blocks must map into a strictly relational structure
(iv) dREL methods must either remain valid or be updated
Note that I haven't mentioned constraint (iv) previously as I had overlooked it. dREL as we have developed it allows references to values of data names in other categories to be resolved implicitly based on parent-child relationships of key data names, so as long as these are unambiguous methods will generally not need to be redefined.
Going back to our "fallback" scheme, my assessment of our current situation is that there is an approximate agreement on aggregation (+/- an extra data block) and on how best to distribute data between data blocks. We are not so sure about the best way to minimise dictionary writing. The updated bare bones approach above requires the use of extension dictionaries, but these are in any case unavoidable when completely new categories are defined (eg PD_PHASE in powder diffraction). Such dictionaries would in addition add dependencies on the new category(s) to core categories, but not always have to write out every key data name definition.
Further comments inline below.
Dear DDLm Group,
James and I seem to agree that it is essential for any multiblock aggregation mechanism to address how to define the relational structure of the aggregate data set. I take this to be consistent with Herbert’s view as well.
I am not especially tied to the particular model mechanism I proposed, but I do think that working out a functional model aggregation mechanism is an important exercise for clarifying the issues involved and focusing on possible solutions. Working out two such mechanisms would be even better. From that perspective, I offer additional comments inline below.
The _cif_multiblock proposal should be contrasted with the bare-bones proposal below, which requires dictionary authors to specify ahead of time inter-category dependencies. For example, in John's example multiple chemical formulae require multiple cells. Is this "universal"? Are there different perspectives that are contradictory? Without attempting to determine such implicit inter-category relationships in the core dictionary I'm not sure.
For the record, in the example, there is lexically one formula and multiple cells. This might arise, for example, in a diffraction experiment where the specimen is a multicrystal of one compound, with each component having been indexed independently. Supporting this sort of data de-duplication is one of the multiblock objectives that was presented earlier. I acknowledge, however, it may not be apparent from my short-form description why the given procedure results in multiple CHEMICAL_FORMULA rows when the input contains only one.
#####
data_flattened
_chemical_formula.sum C6H12O6
loop_
_cell.id #arbitrary cell identifier
_cell.length_a
_cell.length_b
_cell.length_c
_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
xyz 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90
abc 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90
#######
I claim that we know from the coexistence of their definitions in the Core dictionary that there is or can be a formula that goes with each cell. In that sense, yes, it is universal. But on the other hand, what does it matter? I infer that the concern is that it might be possible to express nonsensical relationships, but what if it is? I am not much concerned with an opening for creating bad data as long as it is not unreasonably difficult to create good data. I am much more interested in the ability to express all the information I want to convey.
I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:
1. It must identify all of the data blocks comprised by an aggregate data set. This is the essential and most fundamental requirement.
2. It must designate all data names needed for forming or expanding category keys for the aggregate data set.
3. It must specify values for the added key attributes on a per-datablock basis. Together with the previous, this is what makes a simple aggregate into a data set.
4. It should not depend on adding items to component data blocks.
5. It may define an overall data set identifier, though that is not required.
6. It should be machine actionable. This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.
7. It may itself be based on CIF syntax, but that is not required. I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.
Commenting on these points inline caused gmail to fiddle with the numbering, so I'll collect the comments here:
(1) Agreed. If I place a tar.gz file at the end of a DOI, I think I have identified all components of an aggregate data set (everything in the archive).
I agree that “all the contents of the archive” or “all the contents of the directory” or “all the contents of the CIF” are possible ways of identifying the components of a multiblock data set, but I do not accept that those are necessary interpretations of data delivered in an archive, or in a directory, or in a CIF containing multiple data blocks. Nor do I accept that the whole contents of X is always the best or most appropriate way to identify components of a multiblock set. There is room for different mechanisms to do this differently, or for a given mechanism to allow multiple options.
(2) We can specify that either a Set category key data name is given explicitly (e.g. phase_id is provided in each data block), or else an arbitrary value distinct from all other values for that data name in the aggregate is chosen. In that case there is no need for a separate designation I would have thought.
At present, DDLm Set categories do not have category keys (unless we accept some kind of anonymous implicit key). DDLm expressly specifies that _category_key.name gives a data name that is part of a category key for a Loop category, and the DDLm core dictionary accordingly does not define category keys for Set categories. There are no existing key data names available to be given for Set categories, so if we want any then we need either to generate them or to designate them. And it’s not really any different with Loop categories. Although these do have category keys, it is a natural use case that we would want to present multiblock data in which one or more loop categories’ keys are expanded with additional data names. For example, a multiblock data set providing information about multiple distinct structures might need to present ATOM_SITE data for each one, which would require expanding the ATOM_SITE key to distinguish among the rows for the different structures and to avoid the risk of duplicate keys.
But I feel like I may have lost the thread of the conversation. I thought one of the objectives was to avoid creating permanent, formal extension dictionaries to support key extensions for the various multiblock scenarios that are not adequately supported for category keys as defined in our present dictionaries. A combinatorial argument against that was raised, and an argument about how that would result in adding key data names that we would prefer to avoid including literally in multiblock CIF documents. So what is the target, actually?
So again, (ii) supposes that there are Set category keys, and in particular, keys that can take multiple values. I will understand this as implying that the relevant categories are actually Loop categories in a possibly-virtual extension dictionary obtained by converting some of the Set categories from a base dictionary. Furthermore, this then seems to be proposing that contrary to our usual expectations for valid CIF, data for a category may be presented without presenting the (full) category key, at least in the special case where at most one value is presented for each attribute of the category in a given CIF. This is supported in part by engaging a unique key generator (notionally, at least).
Requirement (2) is met by (ii) by not eliding any key data names whose values are not arbitrary.
I think we have a misunderstanding here. By (2) I mean that the identities of all the data names composing each category key must be established deterministically by the aggregation scheme. This supposes that these might be a superset of those defined by any given category’s dictionary definition. The “bare bones” scheme seems to assume that these are always defined in an existing dictionary, no extension required. That does satisfy (2), but that has little to do with the ability or duty to elide presentation of any data names.
Note that such values a priori existed if the data made sense before output, regardless of any standard we might be talking about, because if the values are not arbitrary they must be referred to by a defined data name, and for that reference to make sense in software it must have a referent.
I do not accept that. At minimum, collocation in the same CIF data block establishes relationships between data from different categories that is not expressed via any data name. Those relationships need to be preserved when multiple blocks are combined into a larger data set, without also establishing unwanted relationships. That will sometimes require new data names to be chosen and retained for which no data items exist in the dictionary or in component data blocks.
I think there are other plausible cases, too.
Suppose, in the scheme proposed by John, that the special _cif_multiblock data block is absent in an aggregate of data, but the requirements in (ii) above are met: as far as I can tell software is able to populate a full relational schema without problems. So I'd like to see a case where the cif_multiblock block provides information that the bare bones approach would not.
The example I presented is already such a case, as understood in relation directly to the (DDLm) Core dictionary. The CHEMICAL_FORMULA category does not have a category key containing any data names, so no key value for it can be expressed explicitly in the data, neither for the CHEMICAL_FORMULA data themselves, nor for the CELL data to refer to. Additionally, the CELL category does not have a category key containing any data names either, and without that, the combined data set, expressing two rows of CELL data, cannot be valid against the Core dictionary.
At minimum, to be comparable with the cif_multiblock scheme, the bare bones scheme needs to be extended with a description of how and under what circumstances category keys are created / expanded, and what the validity requirements are for the resulting combined data set.
No doubt a mechanism could be defined that does generate that result without reference to instructions included with the data, but the “bare bones” approach, as defined, does not seem to be such a mechanism. More on that below.
Where the ingesting software has followed the following logic:
1. Set category "chemical_formula" is repeated, with differing values, and no category key value is provided, so arbitrary, unique values for the key data name are assigned (xyz and abc)
2. Set category "cell" is repeated. As it depends on "chemical_formula", and this is repeated, no further action is taken.
3. The list of dependent categories for "chemical_formula" is provided with values for the child data names of chemical_formula.id
I don’t see how this is consistent with “If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be provided.” Explicit values for the Set category keys are *not* provided (and cannot be, because Set categories have no key data names), yet it is desired to form cross-data-block associations. It seems, then, like bare bones should not even be applicable to the example input. Or at least I assume that that provision is meant as a constraint on input, for if it were meant instead as a description of the result then it would leave a big gap in the “getting from here to there” department.
With respect to the specific steps above,
(1) No, the CHEMICAL_FORMULA category is _not_ repeated in the original multiblock representation. There is no key data name either, but if we suppose that a virtual one is synthesized then yes, we can choose a value for it to go with the one provided value of _chemical_formula.sum (just ‘xyz’, then).
(2) Yes, the CELL category is repeated, but why does it depend on CHEMICAL_FORMULA (see also below)? How does it matter to the bare bones procedure, as given, which categories are repeated or have multiple values?
(3) If CHEMICAL_FORMULA did have multiple values, as supposed, then it’s unclear to me how would those be matched to the CELL data.
I think we could define some automatic matching for simple cases like this, but I expect that such a scheme would struggle to handle more complex cases. I think it would also be difficult to handle cases where a relationships could be formed, but should not be.
Importantly, steps (2) and (3) depend on a list of dependent categories being provided by the dictionary.
I certainly agree that steps (2) and (3) have such a dependency, but I don’t see why a dictionary would define one of the two categories involved as being dependent on the other, unless specifically to serve this particular pattern of multiblock combination. That was among the reasons that I chose these particular categories for the example. A chemical formula is meaningful without a unit cell. A unit cell is meaningful without a chemical formula. Neither has a functional dependency on the other.
And there is a variety of other pairs of categories having a bidirectional association that does not involve one being dependent on or subordinate to another. If we are to provide for multiblock aggregation schemes that rely on per-category lists of associated categories, then we need to think out the details carefully. For sure, “depends on” is not a sufficient predicate for deciding about the contents of such lists, unless we want only very narrowly scoped facilities, and probably only very few of them.
Best regards,
John
Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
--
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Multi block principles (James H)
- References:
- [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Prev by Date: Re: [ddlm-group] Multi block principles
- Next by Date: Re: [ddlm-group] Multi block principles
- Prev by thread: Re: [ddlm-group] Multi block principles
- Next by thread: Re: [ddlm-group] Multi block principles
- Index(es):