Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm Group,


James and I seem to agree that it is essential for any multiblock aggregation mechanism to address how to define the relational structure of the aggregate data set.  I take this to be consistent with Herbert’s view as well.


I am not especially tied to the particular model mechanism I proposed, but I do think that working out a functional model aggregation mechanism is an important exercise for clarifying the issues involved and focusing on possible solutions.  Working out two such mechanisms would be even better.  From that perspective, I offer additional comments inline below.


On Tuesday, November 23, 2021 1:08 AM, James H jamesrhester@gmail.com wrote:

The _cif_multiblock proposal relies on the _cif_multiblock contents to define data names and values for any elided Set category key data names and their children. This has the advantage of allowing us to avoid defining lots of technical, uninformative data names in the dictionaries for Sets (and Loops), as the _cif_multiblocks would specify the names. The other advantage is that the data set creator has the flexibility to decide which categories are related, as only categories that required child data names of the Set categories would be collected in separate data blocks.



I largely agree with that assessment.  I should say, though, that I am not particularly focused on distinctions between Set and Loop categories, which are functions of particular domain-dictionary projections of the universe of data.  The existence of these somewhat artificial distinctions is one of the secondary issues we are dealing with.



The _cif_multiblock proposal should be contrasted with the bare-bones proposal below, which requires dictionary authors to specify ahead of time inter-category dependencies. For example, in John's example multiple chemical formulae require multiple cells. Is this "universal"? Are there different perspectives that are contradictory? Without attempting to determine such implicit inter-category relationships in the core dictionary I'm not sure.



For the record, in the example, there is lexically one formula and multiple cells.  This might arise, for example, in a diffraction experiment where the specimen is a multicrystal of one compound, with each component having been indexed independently.  Supporting this sort of data de-duplication is one of the multiblock objectives that was presented earlier.  I acknowledge, however, it may not be apparent from my short-form description why the given procedure results in multiple CHEMICAL_FORMULA rows when the input contains only one.


I claim that we know from the coexistence of their definitions in the Core dictionary that there is or can be a formula that goes with each cell.  In that sense, yes, it is universal.  But on the other hand, what does it matter?  I infer that the concern is that it might be possible to express nonsensical relationships, but what if it is?  I am not much concerned with an opening for creating bad data as long as it is not unreasonably difficult to create good data.  I am much more interested in the ability to express all the information I want to convey.



On Tue, 23 Nov 2021 at 06:34, Bollinger, John C <John.Bollinger@stjude.org> wrote:

On Sunday, November 21, 2021 10:57 PM, James H jamesrhester@gmail.com wrote:

Particular communities have specified the use of the "summary blocks" mentioned in the proposal: a separate data block where a list of all of the blocks and their roles may be collected. Powder diffraction, for example, uses a data block where "phase_id" is looped together with the block_id of the data block containing the phase information for that phase_id. Such summary blocks may not, in general, be guaranteed to include data blocks added after the summary blocks were generated (e.g. calibration data collected separately from the experiment) and does not cover historical uses of CIF where information about a single dataset has been split into a couple or more data blocks in a single file.


I think summary blocks have most of the right characteristics.  Together with the conventions surrounding their present use, I’m sure that the ones already in play have all the characteristics needed within their particular scopes of application.


From the perspective of relational schema, it is helpful to view a CIF data block as a projection of a wider, higher-dimensional relational space.  Only selected attributes of each selected relation are presented, and some of the attributes *not* presented are elements of the higher-dimensional keys (more so with DDLm than with DDL2).  If one wants to form a data set from multiple projections of that sort, then one must reconstitute a representation of the missing key components. That’s essentially what powder CIF does by providing a mapping from phase_id to block_id.


My general point is that the phase_id -> block_id mapping is surplus information as each data block restates its phase_id and the data blocks are always in the same CIF file. In other words you could completely delete the data block containing the phase_id -> block_id mapping and retain all the information as long as the data blocks stay in the same file, which is a good assumption as we don't usually worry about pieces of files becoming separated from one another. I'm not saying that such a mapping block should be ignored or not allowed, just that it is not strictly necessary.



A mapping block may not be needed in that particular case, but that case is a special one (or so I say – see below).  I take pd_CIF to have been intentionally designed in a manner that supports that variation on multiblock aggregation, but if not, then the design is at least fortuitous in that regard.  If we care only about such cases then we do not need anything more general or more uniform than we already have.



I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:


1.       It must identify all of the data blocks comprised by an aggregate data set.  This is the essential and most fundamental requirement.

2.       It must designate all data names needed for forming or expanding category keys for the aggregate data set.

3.       It must specify values for the added key attributes on a per-datablock basis.  Together with the previous, this is what makes a simple aggregate into a data set.

4.       It should not depend on adding items to component data blocks.

5.       It may define an overall data set identifier, though that is not required.

6.       It should be machine actionable.  This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.

7.       It may itself be based on CIF syntax, but that is not required.  I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.

Commenting on these points inline caused gmail to fiddle with the numbering, so I'll collect the comments here:

(1) Agreed. If I place a tar.gz file at the end of a DOI, I think I have identified all components of an aggregate data set (everything in the archive).



I agree that “all the contents of the archive” or “all the contents of the directory” or “all the contents of the CIF” are possible ways of identifying the components of a multiblock data set, but I do not accept that those are necessary interpretations of data delivered in an archive, or in a directory, or in a CIF containing multiple data blocks. Nor do I accept that the whole contents of X is always the best or most appropriate way to identify components of a multiblock set.  There is room for different mechanisms to do this differently, or for a given mechanism to allow multiple options.



(2) We can specify that either a Set category key data name is given explicitly (e.g. phase_id is provided in each data block), or else an arbitrary value distinct from all other values for that data name in the aggregate is chosen. In that case there is no need for a separate designation I would have thought.



At present, DDLm Set categories do not have category keys (unless we accept some kind of anonymous implicit key).  DDLm expressly specifies that _category_key.name gives a data name that is part of a category key for a Loop category, and the DDLm core dictionary accordingly does not define category keys for Set categories. There are no existing key data names available to be given for Set categories, so if we want any then we need either to generate them or to designate them.  And it’s not really any different with Loop categories.  Although these do have category keys, it is a natural use case that we would want to present multiblock data in which one or more loop categories’ keys are expanded with additional data names.  For example, a multiblock data set providing information about multiple distinct structures might need to present ATOM_SITE data for each one, which would require expanding the ATOM_SITE key to distinguish among the rows for the different structures and to avoid the risk of duplicate keys.


But I feel like I may have lost the thread of the conversation.  I thought one of the objectives was to avoid creating permanent, formal extension dictionaries to support key extensions for the various multiblock scenarios that are not adequately supported for category keys as defined in our present dictionaries.  A combinatorial argument against that was raised, and an argument about how that would result in adding key data names that we would prefer to avoid including literally in multiblock CIF documents.  So what is the target, actually?



I note that there is no need for a special block name (_cif_multiblock) as the presence of the multiblock.* data names is sufficient to identify the "special" block.



I agree that a special data block name is not an essential component of that scheme.  I think it serves a useful practical purpose to have such a name, but yes, it would be possible to identify the aggregation metadata block by its contents alone.



From the same perspective, let me describe a "bare bones" scheme as a point of comparison:

(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).

(ii) Set categories are single-row in each data block. If a Set category key is absent, it and its child data names may be assigned a unique, arbitrary value. If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be provided.



So again, (ii) supposes that there are Set category keys, and in particular, keys that can take multiple values.  I will understand this as implying that the relevant categories are actually Loop categories in a possibly-virtual extension dictionary obtained by converting some of the Set categories from a base dictionary.  Furthermore, this then seems to be proposing that contrary to our usual expectations for valid CIF, data for a category may be presented without presenting the (full) category key, at least in the special case where at most one value is presented for each attribute of the category in a given CIF.  This is supported in part by engaging a unique key generator (notionally, at least).



So requirement (1) is met by (i).






Requirement (2) is met by (ii) by not eliding any key data names whose values are not arbitrary.



I think we have a misunderstanding here.  By (2) I mean that the identities of all the data names composing each category key must be established deterministically by the aggregation scheme.  This supposes that these might be a superset of those defined by any given category’s dictionary definition.  The “bare bones” scheme seems to assume that these are always defined in an existing dictionary, no extension required.  That does satisfy (2), but that has little to do with the ability or duty to elide presentation of any data names.



Note that such values a priori existed if the data made sense before output, regardless of any standard we might be talking about, because if the values are not arbitrary they must be referred to by a defined data name, and for that reference to make sense in software it must have a referent.



I do not accept that.  At minimum, collocation in the same CIF data block establishes relationships between data from different categories that is not expressed via any data name.  Those relationships need to be preserved when multiple blocks are combined into a larger data set, without also establishing unwanted relationships.  That will sometimes require new data names to be chosen and retained for which no data items exist in the dictionary or in component data blocks.


I think there are other plausible cases, too.



Suppose, in the scheme proposed by John, that the special _cif_multiblock data block is absent in an aggregate of data, but the requirements in (ii) above are met: as far as I can tell software is able to populate a full relational schema without problems.  So I'd like to see a case where the cif_multiblock block provides information that the bare bones approach would not.



The example I presented is already such a case, as understood in relation directly to the (DDLm) Core dictionary.  The CHEMICAL_FORMULA category does not have a category key containing any data names, so no key value for it can be expressed explicitly in the data, neither for the CHEMICAL_FORMULA data themselves, nor for the CELL data to refer to.  Additionally, the CELL category does not have a category key containing any data names either, and without that, the combined data set, expressing two rows of CELL data, cannot be valid against the Core dictionary.


At minimum, to be comparable with the cif_multiblock scheme, the bare bones scheme needs to be extended with a description of how and under what circumstances category keys are created / expanded, and what the validity requirements are for the resulting combined data set.



Anyway, if I take John's starting example and remove the special _cif_multiblock category, changing nothing else, then reconstitute the data block according to the "bare bones" approach, it would look like:







xyz C6H12O6

abc C6H12O6










xyz 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

abc 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90




No doubt a mechanism could be defined that does generate that result without reference to instructions included with the data, but the “bare bones” approach, as defined, does not seem to be such a mechanism.  More on that below.


Where the ingesting software has followed the following logic:


1. Set category "chemical_formula" is repeated, with differing values, and no category key value is provided, so arbitrary, unique values for the key data name are assigned (xyz and abc)

2. Set category "cell" is repeated. As it depends on "chemical_formula", and this is repeated, no further action is taken.

3. The list of dependent categories for "chemical_formula" is provided with values for the child data names of chemical_formula.id


I don’t see how this is consistent with “If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be provided.”  Explicit values for the Set category keys are *not* provided (and cannot be, because Set categories have no key data names), yet it is desired to form cross-data-block associations.  It seems, then, like bare bones should not even be applicable to the example input.  Or at least I assume that that provision is meant as a constraint on input, for if it were meant instead as a description of the result then it would leave a big gap in the “getting from here to there” department.

With respect to the specific steps above,

(1) No, the CHEMICAL_FORMULA category is _not_ repeated in the original multiblock representation.  There is no key data name either, but if we suppose that a virtual one is synthesized then yes, we can choose a value for it to go with the one provided value of _chemical_formula.sum (just ‘xyz’, then).

(2) Yes, the CELL category is repeated, but why does it depend on CHEMICAL_FORMULA (see also below)?  How does it matter to the bare bones procedure, as given, which categories are repeated or have multiple values?

(3) If CHEMICAL_FORMULA did have multiple values, as supposed, then it’s unclear to me how would those be matched to the CELL data.


I think we could define some automatic matching for simple cases like this, but I expect that such a scheme would struggle to handle more complex cases.  I think it would also be difficult to handle cases where a relationships could be formed, but should not be.



Importantly, steps (2) and (3) depend on a list of dependent categories being provided by the dictionary.



I certainly agree that steps (2) and (3) have such a dependency, but I don’t see why a dictionary would define one of the two categories involved as being dependent on the other, unless specifically to serve this particular pattern of multiblock combination.  That was among the reasons that I chose these particular categories for the example.  A chemical formula is meaningful without a unit cell.  A unit cell is meaningful without a chemical formula.  Neither has a functional dependency on the other.


And there is a variety of other pairs of categories having a bidirectional association that does not involve one being dependent on or subordinate to another.  If we are to provide for multiblock aggregation schemes that rely on per-category lists of associated categories, then we need to think out the details carefully.  For sure, “depends on” is not a sufficient predicate for deciding about the contents of such lists, unless we want only very narrowly scoped facilities, and probably only very few of them.



Best regards,




Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

Reply to: [list | sender only]