Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm group,

I think the key discussion point following John's email below is how to define the values of absent Set category keys and those of their child data names. I have inserted comments inline below, but I'll make some points up the top here as a summary of my response. The _cif_multiblock proposal relies on the _cif_multiblock contents to define data names and values for any elided Set category key data names and their children. This has the advantage of allowing us to avoid defining lots of technical, uninformative data names in the dictionaries for Sets (and Loops), as the _cif_multiblocks would specify the names. The other advantage is that the data set creator has the flexibility to decide which categories are related, as only categories that required child data names of the Set categories would be collected in separate data blocks.

The _cif_multiblock proposal should be contrasted with the bare-bones proposal below, which requires dictionary authors to specify ahead of time inter-category dependencies. For example, in John's example multiple chemical formulae require multiple cells. Is this "universal"? Are there different perspectives that are contradictory? Without attempting to determine such implicit inter-category relationships in the core dictionary I'm not sure.

Further comments inline.

On Tue, 23 Nov 2021 at 06:34, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm Group and James,


Comments inline below.


On Sunday, November 21, 2021 10:57 PM, James H jamesrhester@gmail.com wrote:

The main difference in opinion I see here is regarding the possibility of arbitrary separation of data block contents. As per John's example, I am indeed proposing that *any* Set category *should* be spread across multiple blocks whenever the necessity arises to provide multiple values for its data names. John points out that this would lose the information that these items of information are linked, and so such splitting should not take place unless that original link can be reconstituted.


My "solution" to this is to rely on the context to aggregate data blocks and files into something that the context asserts is a "dataset".  The reason I have ended up at this position is as a result of our previous discussion (see https://www.iucr.org/__data/iucr/lists/ddlm-group/msg01626.html and following comments): there is no bulletproof way to insert the appropriate information into all data blocks in all situations. "Context" more verbosely means "the collection of data objects to which this data object belongs and which has been designated by the context as a coherent dataset". The "coherency" requirement means that there are no contradictions (relations with the same key data values but different values in the corresponding rows) after assembly of the data blocks into a single set of relations.



I think it appropriate at this point to recapitulate one of Herbert’s comments from the previous discussion about multiblock data sets: “It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we use, so that we can work with interoperable presentations in CIF, XML, json, etc.”  That applies to the current proposal as much as it did to the prior one.  Additionally, I think a central reason why a clean and clear description of the schema supports interoperable presentations in multiple languages is that it provides for the relationships between data to be clear and well defined *in CIF*.  This is one of my main areas of concern with the new proposal.


Another area is how nebulous the role of the context seems to be in determining the contents and boundaries of a multi-block data set.  I think I would be willing to accept an absence of specific contextual datablock aggregation mechanisms, but the proposal needs to be clear about that.  Also, I think it needs to be clear about what are the responsibilities and constraints of contextual aggregation mechanisms.  Presumably, such mechanisms must at least identify all the data blocks contributing to a given multiblock data set.  Do they also have a role in defining (expanded) keys and relationships?  Are they obligated to provide identifiers that enable all contributing data blocks actually to be retrieved?  Are there other considerations they need to address?



Particular communities have specified the use of the "summary blocks" mentioned in the proposal: a separate data block where a list of all of the blocks and their roles may be collected. Powder diffraction, for example, uses a data block where "phase_id" is looped together with the block_id of the data block containing the phase information for that phase_id. Such summary blocks may not, in general, be guaranteed to include data blocks added after the summary blocks were generated (e.g. calibration data collected separately from the experiment) and does not cover historical uses of CIF where information about a single dataset has been split into a couple or more data blocks in a single file.



I think summary blocks have most of the right characteristics.  Together with the conventions surrounding their present use, I’m sure that the ones already in play have all the characteristics needed within their particular scopes of application.


From the perspective of relational schema, it is helpful to view a CIF data block as a projection of a wider, higher-dimensional relational space.  Only selected attributes of each selected relation are presented, and some of the attributes *not* presented are elements of the higher-dimensional keys (more so with DDLm than with DDL2).  If one wants to form a data set from multiple projections of that sort, then one must reconstitute a representation of the missing key components. That’s essentially what powder CIF does by providing a mapping from phase_id to block_id.

My general point is that the phase_id -> block_id mapping is surplus information as each data block restates its phase_id and the data blocks are always in the same CIF file. In other words you could completely delete the data block containing the phase_id -> block_id mapping and retain all the information as long as the data blocks stay in the same file, which is a good assumption as we don't usually worry about pieces of files becoming separated from one another. I'm not saying that such a mapping block should be ignored or not allowed, just that it is not strictly necessary.


In more generalized context, there may be more than one extra key that needs to be reconstituted.  Also, the intention seems to be that the mappings from data blocks to added keys don’t need to be one-to-one, which is to say that in general, block ids cannot be assumed to be keys.  Also, there is a desire to provide for normalized representations to avoid the need to duplicate data.


To answer one of my previous questions, then: a contextual aggregation mechanism does have a role in defining keys and relationships if the result of the aggregation is to be useful as a single data set.



So, some options for strengthening links between data blocks, without attempting to be bulletproof:

1. A new data name e.g. "_audit.multiblock" (true/false) that indicates more than one data block is to be expected

2. A new data name "_audit.dataset_id". The same value for this in separate blocks is sufficient but not necessary for those blocks to be considered part of the same dataset

3. Recommend/require that a summary block be included in a dataset: this covers all known data objects at the time the summary block was written.


I would of course welcome some bulletproof yet practical way to link blocks together in a CIF way.



I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:


  1. It must identify all of the data blocks comprised by an aggregate data set.  This is the essential and most fundamental requirement.
  2. It must designate all data names needed for forming or expanding category keys for the aggregate data set.
  3. It must specify values for the added key attributes on a per-datablock basis.  Together with the previous, this is what makes a simple aggregate into a data set.
  4. It should not depend on adding items to component data blocks.
  5. It may define an overall data set identifier, though that is not required.
  6. It should be machine actionable.  This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.
  7. It may itself be based on CIF syntax, but that is not required.  I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.
Commenting on these points inline caused gmail to fiddle with the numbering, so I'll collect the comments here:
(1) Agreed. If I place a tar.gz file at the end of a DOI, I think I have identified all components of an aggregate data set (everything in the archive).
(2) We can specify that either a Set category key data name is given explicitly (e.g. phase_id is provided in each data block), or else an arbitrary value distinct from all other values for that data name in the aggregate is chosen. In that case there is no need for a separate designation I would have thought.
(3) - (7) agreed.


From that perspective, here is a sketch of a CIF-based design for such a mechanism:


  • Multiple data blocks are physically aggregated into a data set by being presented in the same file, together with a special data block described by the remaining points.
  • A data block in the same file and with the special id _cif_multiblock_ provides information about the relationships among the data in the block. Specifically,
    • There is a loop category “multiblock” with attributes “block_id” and “extra_keys”, where
    • extra_keys values are tables associating extension attribute names with the one-per-block values that they take within the scope of a given data block.
    • All the key attributes so designated for each block are added to all the categories presented within that block, and to those categories’ keys.
    • The values for these key attributes within each data block are taken from the corresponding extra_keys table in the _cif_multiblock_ block.
  • The overall multiblock data set is formed from all the items in the data blocks listed in the “multiblock” category, as expanded with additional keys
  • Validity against the implicit expanded dictionary for the multiblock data set is determined by considering each distinct combination of all the extra keys:
    • All the data associated with such a combination are considered as a group.  Categories with a subset of the full set of added keys are considered to be in every group that all the added keys they do include match – this can be formalized in terms of natural joins.
    • The added attributes are otherwise ignored for the purposes of validating each group, and
    • The remaining data are validated against the appropriate dictionary.



This …


### multiblock_example.cif ###


_chemical_formula.sum C6H12O6



_cell.length_a 6.000(2)

_cell.length_b 7.000(2)

_cell.length_c 8.000(2)

_cell.angle_alpha 90

_cell.angle_beta 100.0(3)

_cell.angle_gamma 90



_cell.length_a 5.999(2)

_cell.length_b 7.003(2)

_cell.length_c 8.000(3)

_cell.angle_alpha 90

_cell.angle_beta 100.1(3)

_cell.angle_gamma 90






common        { }

component1 { 'component': 1 }

component2 { 'component': 2 }



… would correspond to this flattened representation:








1 C6H12O6

2 C6H12O6










1 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

2 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90




I note that there is no need for a special block name (_cif_multiblock) as the presence of the multiblock.* data names is sufficient to identify the "special" block.
From the same perspective, let me describe a "bare bones" scheme as a point of comparison:
(i) Multiple data blocks are aggregated into a data set by being presented in the same data container (e.g. file, zip archive, directory).
(ii) Set categories are single-row in each data block. If a Set category key is absent, it and its child data names may be assigned a unique, arbitrary value. If any data blocks need to refer to Set category values in other data blocks, explicit values for Set category keys must necessarily be provided.

So requirement (1) is met by (i). Requirement (2) is met by (ii) by not eliding any key data names whose values are not arbitrary.  Note that such values a priori existed if the data made sense before output, regardless of any standard we might be talking about, because if the values are not arbitrary they must be referred to by a defined data name, and for that reference to make sense in software it must have a referent. The other requirements are trivially satisfied.

Suppose, in the scheme proposed by John, that the special _cif_multiblock data block is absent in an aggregate of data, but the requirements in (ii) above are met: as far as I can tell software is able to populate a full relational schema without problems.  So I'd like to see a case where the cif_multiblock block provides information that the bare bones approach would not.

Anyway, if I take John's starting example and remove the special _cif_multiblock category, changing nothing else, then reconstitute the data block according to the "bare bones" approach, it would look like:







xyz C6H12O6

abc C6H12O6










xyz 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

abc 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90


Where the ingesting software has followed the following logic:

1. Set category "chemical_formula" is repeated, with differing values, and no category key value is provided, so arbitrary, unique values for the key data name are assigned (xyz and abc)

2. Set category "cell" is repeated. As it depends on "chemical_formula", and this is repeated, no further action is taken.

2. The list of dependent categories for "chemical_formula" is provided with values for the child data names of chemical_formula.id

Importantly, steps (2) and (3) depend on a list of dependent categories being provided by the dictionary.

all the best,

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]