Re: [ddlm-group] Multi block principles
- To: "james.r.hester@gmail.com" <james.r.hester@gmail.com>
- Subject: Re: [ddlm-group] Multi block principles
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Tue, 23 Nov 2021 22:40:53 +0000
- Accept-Language: en-US
- ARC-Authentication-Results: i=1; mx.microsoft.com 1; spf=passsmtp.mailfrom=stjude.org; dmarc=pass action=none header.from=stjude.org;dkim=pass header.d=stjude.org; arc=none
- ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=microsoft.com; s=arcselector9901;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-AntiSpam-MessageData-ChunkCount:X-MS-Exchange-AntiSpam-MessageData-0:X-MS-Exchange-AntiSpam-MessageData-1;bh=t+1UJ05PQ93SqsaxE4xK0wUMYlVw1Xhbd2mAsvDxvCE=;b=R/KN4ll2OlZhqMatTMwlWrRoiSF1QsF5gk6H0fQ1oSRVMq7swhfI/FQWJjXxGya+By3L/H6OTNYNWwCVFLli2K++pnZrDeDHvqC3OQPFpyV4m308Ok47xfK6wk/Kv32kGPUrILpyXvUrwcHcf1pq4IAJS5zCAb6RmIQlNslRE63Ex3ZvL8gIVS6BMGpuD1/Jjh4feFYKDA6icbBd5I7Dcl7byYYvrHnyOdpAEgCcBTmsmMMR7P8hjifdn3qD2lDudquJkwI9mdQ5g9trDVkdLDkG3EC+hb4XJMjPetLvVMgDUqQaW7Kc58lk+jh+LVpkSBe/XgPtPf2E3/9IyLKaQw==
- ARC-Seal: i=1; a=rsa-sha256; s=arcselector9901; d=microsoft.com; cv=none;b=hwZ+YFDjxHgcDEFIQCYQOBR7QrZKQiPGXcVT8x8iBidWStPDeDm8YWSaY29VzsOQH50Z7zWf9UpktTcj3TN5gLxu4VJfXbAbv70L5QTnPLsRO0RzgLnkGrQEftGug3wJ9Bj/h0aN/f6Wsx2NGZwe6hq0PKl6r4Hv1UKKgv/cuLVPzkB9Z7tXRHGJotdENNMG4EAnKdqk3OYq6sz3Y7qvWsyEyiSEUixGB7hBtBpqvk4oSrr34E8vyl6bhJdKGQWm8Ey5e25r7k4A8dCSyJyxeAUz8TX9DIyjqeoDM3gAChJpFybmxjURNpi77trYjK3ZxHf2mAHugTPkBGlxtHE07w==
- Cc: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- In-Reply-To: <CAM+dB2co6kOup_P9ZawKontOM+yE_CfFPh2NuVU0S6fvtT_T3w@mail.gmail.com>
- IronPort-SDR: zWRqO0RONRRXzFiMQtgKaL++xsRSGidnTjdqCGsg2FcA4SRYT06/JOVPlu5DLM2UEzw0xkUj/bL3r3Qm9LxBJzOHojCvI/dVK31RRHLGiDNRe0vPGINUaRGt9jrq//u6/ZZ4N6GjtFISAsNG2aQoWh5KIUPrT7thbBj1YDxBx+wS7SDMoUlSc6WMFFF7LFdyKaJ1c5PMJ0W1vXKP11zypElVqYCDpALsNznJiGkRjrQQR1cNNzNMdTdcD04xUlItrgohrqnOYC6FNda1lghHcRbEJ/r4irosYgHgvotljbc=
- References: <CAM+dB2fajH1c1vhrCJU9v-QQw0kt4Y2udDEx4HBK9QzDq=LD3w@mail.gmail.com><CH2PR04MB6950E54AF550C819FF598F35E0999@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2frS4Xg7fhxy5GQcw5t0WJ+pvia-HvqwLAzC-ySZoU+QQ@mail.gmail.com><CH2PR04MB695069448B0B396DE843ECEAE09C9@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2e-hSg5m4C7+MUbEWhK1ni2GOWWEWoGJCwS5_Xkq6uzuA@mail.gmail.com><CH2PR04MB6950FD4238F32E64416CAC4EE09F9@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2co6kOup_P9ZawKontOM+yE_CfFPh2NuVU0S6fvtT_T3w@mail.gmail.com>
Dear DDLm Group, James and I seem to agree that it is essential for any multiblock aggregation mechanism to address how to define the relational structure of the aggregate data set. I take this to be consistent with Herbert’s view as well. I am not especially tied to the particular model mechanism I proposed, but I do think that working out a functional model aggregation mechanism is an important exercise for clarifying the issues involved and focusing on possible solutions.
Working out two such mechanisms would be even better. From that perspective, I offer additional comments inline below. On Tuesday, November 23, 2021 1:08 AM, James H
jamesrhester@gmail.com wrote: The _cif_multiblock proposal relies on the _cif_multiblock contents to define data names and values for any elided Set category key data names and their children. This has the advantage of allowing us to avoid defining
lots of technical, uninformative data names in the dictionaries for Sets (and Loops), as the _cif_multiblocks would specify the names. The other advantage is that the data set creator has the flexibility to decide which categories are related, as only categories
that required child data names of the Set categories would be collected in separate data blocks. I largely agree with that assessment. I should say, though, that I am not particularly focused on distinctions between Set and Loop categories, which are functions of particular domain-dictionary projections of the universe of data. The
existence of these somewhat artificial distinctions is one of the secondary issues we are dealing with. The _cif_multiblock proposal should be contrasted with the bare-bones proposal below, which requires dictionary authors to specify ahead of time inter-category dependencies. For example, in John's example multiple
chemical formulae require multiple cells. Is this "universal"? Are there different perspectives that are contradictory? Without attempting to determine such implicit inter-category relationships in the core dictionary I'm not sure.
For the record, in the example, there is lexically one formula and multiple cells. This might arise, for example, in a diffraction experiment where the specimen is a multicrystal of one compound, with each component having been indexed
independently. Supporting this sort of data de-duplication is one of the multiblock objectives that was presented earlier. I acknowledge, however, it may not be apparent from my short-form description why the given procedure results in multiple CHEMICAL_FORMULA
rows when the input contains only one. I claim that we know from the coexistence of their definitions in the Core dictionary that there is or can be a formula that goes with each cell. In that sense, yes, it is universal. But on the other hand, what does it matter? I infer
that the concern is that it might be possible to express nonsensical relationships, but what if it is? I am not much concerned with an opening for creating bad data as long as it is not unreasonably difficult to create good data. I am much more interested
in the ability to express all the information I want to convey. On Tue, 23 Nov 2021 at 06:34, Bollinger, John C <John.Bollinger@stjude.org> wrote:
My general point is that the phase_id -> block_id mapping is surplus information as each data block restates its phase_id and the data blocks are always in the same CIF file. In other words you could completely
delete the data block containing the phase_id -> block_id mapping and retain all the information as long as the data blocks stay in the same file, which is a good assumption as we don't usually worry about pieces of files becoming separated from one another.
I'm not saying that such a mapping block should be ignored or not allowed, just that it is not strictly necessary. A mapping block may not be needed in that particular case, but that case is a special one (or so I say – see below). I take pd_CIF to have been intentionally designed in a manner that supports that variation on multiblock aggregation,
but if not, then the design is at least fortuitous in that regard. If we care only about such cases then we do not need anything more general or more uniform than we already have.
Commenting on these points inline caused gmail to fiddle with the numbering, so I'll collect the comments here: (1) Agreed. If I place a tar.gz file at the end of a DOI, I think I have identified all components of an aggregate data set (everything in the archive). I agree that “all the contents of the archive” or “all the contents of the directory” or “all the contents of the CIF” are possible ways of identifying the components of a multiblock data set, but I do not accept that those are necessary
interpretations of data delivered in an archive, or in a directory, or in a CIF containing multiple data blocks. Nor do I accept that the whole contents of X is always the best or most appropriate way to identify components of a multiblock set. There is room
for different mechanisms to do this differently, or for a given mechanism to allow multiple options. (2) We can specify that either a Set category key data name is given explicitly (e.g. phase_id is provided in each data block), or else an arbitrary value distinct from all other values for that data name in
the aggregate is chosen. In that case there is no need for a separate designation I would have thought. At present, DDLm Set categories do not have category keys (unless we accept some kind of anonymous implicit key). DDLm expressly specifies that _category_key.name gives a data name that is part of a category key for a Loop category, and
the DDLm core dictionary accordingly does not define category keys for Set categories. There are no existing key data names available to be given for Set categories, so if we want any then we need either to generate them or to designate them. And it’s not
really any different with Loop categories. Although these do have category keys, it is a natural use case that we would want to present multiblock data in which one or more loop categories’ keys are expanded with additional data names. For example, a multiblock
data set providing information about multiple distinct structures might need to present ATOM_SITE data for each one, which would require expanding the ATOM_SITE key to distinguish among the rows for the different structures and to avoid the risk of duplicate
keys. But I feel like I may have lost the thread of the conversation. I thought one of the objectives was to avoid creating permanent, formal extension dictionaries to support key extensions for the various multiblock scenarios that are not
adequately supported for category keys as defined in our present dictionaries. A combinatorial argument against that was raised, and an argument about how that would result in adding key data names that we would prefer to avoid including literally in multiblock
CIF documents. So what is the target, actually? I note that there is no need for a special block name (_cif_multiblock) as the presence of the multiblock.* data names is sufficient to identify the "special" block. I agree that a special data block name is not an essential component of that scheme. I think it serves a useful practical purpose to have such a name, but yes, it would be possible to identify the aggregation metadata block by its contents
alone. From the same perspective, let me describe a "bare bones" scheme as a point of comparison: (i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory). (ii) Set categories are single-row in each data block. If a Set category key is absent, it and its child data names may be assigned a unique, arbitrary value. If any data blocks need to refer to Set category
values in other data blocks, explicit values for Set category keys must necessarily be provided. So again, (ii) supposes that there are Set category keys, and in particular, keys that can take multiple values. I will understand this as implying that the relevant categories are actually Loop categories in a possibly-virtual extension
dictionary obtained by converting some of the Set categories from a base dictionary. Furthermore, this then seems to be proposing that contrary to our usual expectations for valid CIF, data for a category may be presented without presenting the (full) category
key, at least in the special case where at most one value is presented for each attribute of the category in a given CIF. This is supported in part by engaging a unique key generator (notionally, at least). So requirement (1) is met by (i). Agreed. Requirement (2) is met by (ii) by not eliding any key data names whose values are not arbitrary. I think we have a misunderstanding here. By (2) I mean that the identities of all the data names composing each category key must be established deterministically by the aggregation scheme. This supposes that these might be a superset
of those defined by any given category’s dictionary definition. The “bare bones” scheme seems to assume that these are always defined in an existing dictionary, no extension required. That does satisfy (2), but that has little to do with the ability or duty
to elide presentation of any data names. Note that such values a priori existed if the data made sense before output, regardless of any standard we might be talking about, because if the values are not arbitrary they must be referred to by a defined
data name, and for that reference to make sense in software it must have a referent. I do not accept that. At minimum, collocation in the same CIF data block establishes relationships between data from different categories that is not expressed via any data name. Those relationships need to be preserved when multiple
blocks are combined into a larger data set, without also establishing unwanted relationships. That will sometimes require new data names to be chosen and retained for which no data items exist in the dictionary or in component data blocks. I think there are other plausible cases, too. Suppose, in the scheme proposed by John, that the special _cif_multiblock data block is absent in an aggregate of data, but the requirements in (ii) above are met: as far as I can tell software is able to populate
a full relational schema without problems. So I'd like to see a case where the cif_multiblock block provides information that the bare bones approach would not. The example I presented is already such a case, as understood in relation directly to the (DDLm) Core dictionary. The CHEMICAL_FORMULA category does not have a category key containing any data names, so no key value for it can be expressed
explicitly in the data, neither for the CHEMICAL_FORMULA data themselves, nor for the CELL data to refer to. Additionally, the CELL category does not have a category key containing any data names either, and without that, the combined data set, expressing
two rows of CELL data, cannot be valid against the Core dictionary. At minimum, to be comparable with the cif_multiblock scheme, the bare bones scheme needs to be extended with a description of how and under what circumstances category keys are created / expanded, and what the validity requirements are
for the resulting combined data set. Anyway, if I take John's starting example and remove the special _cif_multiblock category, changing nothing else, then reconstitute the data block according to the "bare bones" approach, it would look like:
#####
data_flattened
loop_
_chemical_formula.sum
xyz C6H12O6
abc C6H12O6
loop_
_cell.chemical_formula_id
_cell.length_a
_cell.length_b
_cell.length_c
_cell.angle_alpha
_cell.angle_beta
_cell.angle_gamma
xyz 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90
abc 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90
####### No doubt a mechanism could be defined that does generate that result without reference to instructions included with the data, but the “bare bones” approach, as defined, does not
seem to be such a mechanism. More on that below.
Where the ingesting software has followed the following logic:
1. Set category "chemical_formula" is repeated, with differing values, and no category key value is provided, so arbitrary, unique values for the key data name are assigned (xyz and abc)
2. Set category "cell" is repeated. As it depends on "chemical_formula", and this is repeated, no further action is taken.
3. The list of dependent categories for "chemical_formula" is provided with values for the child data names of
chemical_formula.id I don’t see how this is consistent with “If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be
provided.” Explicit values for the Set category keys are *not* provided (and cannot be, because Set categories have no key data names), yet it is desired to form cross-data-block associations. It seems, then, like bare bones should not even be applicable
to the example input. Or at least I assume that that provision is meant as a constraint on input, for if it were meant instead as a description of the result then it would leave a big gap in the “getting from here to there” department. With respect to the specific steps above, (1) No, the CHEMICAL_FORMULA category is _not_ repeated in the original multiblock representation. There is no key data name either, but if we suppose that a virtual one is synthesized
then yes, we can choose a value for it to go with the one provided value of _chemical_formula.sum (just ‘xyz’, then). (2) Yes, the CELL category is repeated, but why does it depend on CHEMICAL_FORMULA (see also below)? How does it matter to the bare bones procedure, as given, which categories
are repeated or have multiple values? (3) If CHEMICAL_FORMULA did have multiple values, as supposed, then it’s unclear to me how would those be matched to the CELL data. I think we could define some automatic matching for simple cases like this, but I expect that such a scheme would struggle to handle more complex cases. I think it would also be difficult to handle cases where a relationships could be
formed, but should not be. Importantly, steps (2) and (3) depend on a list of dependent categories being provided by the dictionary. I certainly agree that steps (2) and (3) have such a dependency, but I don’t see why a dictionary would define one of the two categories involved as being dependent on the other, unless specifically to serve this particular pattern of multiblock
combination. That was among the reasons that I chose these particular categories for the example. A chemical formula is meaningful without a unit cell. A unit cell is meaningful without a chemical formula. Neither has a functional dependency on the other. And there is a variety of other pairs of categories having a bidirectional association that does not involve one being dependent on or subordinate to another. If we are to provide for multiblock aggregation schemes that rely on per-category
lists of associated categories, then we need to think out the details carefully. For sure, “depends on” is not a sufficient predicate for deciding about the contents of such lists, unless we want only very narrowly scoped facilities, and probably only very
few of them. Best regards, John Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer |
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Multi block principles (James H)
- References:
- [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Re: [ddlm-group] Multi block principles (Bollinger, John C)
- Re: [ddlm-group] Multi block principles (James H)
- Prev by Date: Re: [ddlm-group] Multi block principles
- Next by Date: Re: [ddlm-group] Multi block principles
- Prev by thread: Re: [ddlm-group] Multi block principles
- Next by thread: Re: [ddlm-group] Multi block principles
- Index(es):