Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm Group and James,


Comments inline below.


On Sunday, November 21, 2021 10:57 PM, James H jamesrhester@gmail.com wrote:

The main difference in opinion I see here is regarding the possibility of arbitrary separation of data block contents. As per John's example, I am indeed proposing that *any* Set category *should* be spread across multiple blocks whenever the necessity arises to provide multiple values for its data names. John points out that this would lose the information that these items of information are linked, and so such splitting should not take place unless that original link can be reconstituted.


My "solution" to this is to rely on the context to aggregate data blocks and files into something that the context asserts is a "dataset".  The reason I have ended up at this position is as a result of our previous discussion (see https://www.iucr.org/__data/iucr/lists/ddlm-group/msg01626.html and following comments): there is no bulletproof way to insert the appropriate information into all data blocks in all situations. "Context" more verbosely means "the collection of data objects to which this data object belongs and which has been designated by the context as a coherent dataset". The "coherency" requirement means that there are no contradictions (relations with the same key data values but different values in the corresponding rows) after assembly of the data blocks into a single set of relations.



I think it appropriate at this point to recapitulate one of Herbert’s comments from the previous discussion about multiblock data sets: “It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we use, so that we can work with interoperable presentations in CIF, XML, json, etc.”  That applies to the current proposal as much as it did to the prior one.  Additionally, I think a central reason why a clean and clear description of the schema supports interoperable presentations in multiple languages is that it provides for the relationships between data to be clear and well defined *in CIF*.  This is one of my main areas of concern with the new proposal.


Another area is how nebulous the role of the context seems to be in determining the contents and boundaries of a multi-block data set.  I think I would be willing to accept an absence of specific contextual datablock aggregation mechanisms, but the proposal needs to be clear about that.  Also, I think it needs to be clear about what are the responsibilities and constraints of contextual aggregation mechanisms.  Presumably, such mechanisms must at least identify all the data blocks contributing to a given multiblock data set.  Do they also have a role in defining (expanded) keys and relationships?  Are they obligated to provide identifiers that enable all contributing data blocks actually to be retrieved?  Are there other considerations they need to address?



Particular communities have specified the use of the "summary blocks" mentioned in the proposal: a separate data block where a list of all of the blocks and their roles may be collected. Powder diffraction, for example, uses a data block where "phase_id" is looped together with the block_id of the data block containing the phase information for that phase_id. Such summary blocks may not, in general, be guaranteed to include data blocks added after the summary blocks were generated (e.g. calibration data collected separately from the experiment) and does not cover historical uses of CIF where information about a single dataset has been split into a couple or more data blocks in a single file.



I think summary blocks have most of the right characteristics.  Together with the conventions surrounding their present use, I’m sure that the ones already in play have all the characteristics needed within their particular scopes of application.


From the perspective of relational schema, it is helpful to view a CIF data block as a projection of a wider, higher-dimensional relational space.  Only selected attributes of each selected relation are presented, and some of the attributes *not* presented are elements of the higher-dimensional keys (more so with DDLm than with DDL2).  If one wants to form a data set from multiple projections of that sort, then one must reconstitute a representation of the missing key components. That’s essentially what powder CIF does by providing a mapping from phase_id to block_id.


In more generalized context, there may be more than one extra key that needs to be reconstituted.  Also, the intention seems to be that the mappings from data blocks to added keys don’t need to be one-to-one, which is to say that in general, block ids cannot be assumed to be keys.  Also, there is a desire to provide for normalized representations to avoid the need to duplicate data.


To answer one of my previous questions, then: a contextual aggregation mechanism does have a role in defining keys and relationships if the result of the aggregation is to be useful as a single data set.



So, some options for strengthening links between data blocks, without attempting to be bulletproof:

1. A new data name e.g. "_audit.multiblock" (true/false) that indicates more than one data block is to be expected

2. A new data name "_audit.dataset_id". The same value for this in separate blocks is sufficient but not necessary for those blocks to be considered part of the same dataset

3. Recommend/require that a summary block be included in a dataset: this covers all known data objects at the time the summary block was written.


I would of course welcome some bulletproof yet practical way to link blocks together in a CIF way.



I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:


  1. It must identify all of the data blocks comprised by an aggregate data set.  This is the essential and most fundamental requirement.
  2. It must designate all data names needed for forming or expanding category keys for the aggregate data set.
  3. It must specify values for the added key attributes on a per-datablock basis.  Together with the previous, this is what makes a simple aggregate into a data set.
  4. It should not depend on adding items to component data blocks.
  5. It may define an overall data set identifier, though that is not required.
  6. It should be machine actionable.  This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.
  7. It may itself be based on CIF syntax, but that is not required.  I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.


From that perspective, here is a sketch of a CIF-based design for such a mechanism:


  • Multiple data blocks are physically aggregated into a data set by being presented in the same file, together with a special data block described by the remaining points.
  • A data block in the same file and with the special id _cif_multiblock_ provides information about the relationships among the data in the block. Specifically,
    • There is a loop category “multiblock” with attributes “block_id” and “extra_keys”, where
    • extra_keys values are tables associating extension attribute names with the one-per-block values that they take within the scope of a given data block.
    • All the key attributes so designated for each block are added to all the categories presented within that block, and to those categories’ keys.
    • The values for these key attributes within each data block are taken from the corresponding extra_keys table in the _cif_multiblock_ block.
  • The overall multiblock data set is formed from all the items in the data blocks listed in the “multiblock” category, as expanded with additional keys
  • Validity against the implicit expanded dictionary for the multiblock data set is determined by considering each distinct combination of all the extra keys:
    • All the data associated with such a combination are considered as a group.  Categories with a subset of the full set of added keys are considered to be in every group that all the added keys they do include match – this can be formalized in terms of natural joins.
    • The added attributes are otherwise ignored for the purposes of validating each group, and
    • The remaining data are validated against the appropriate dictionary.




This …


### multiblock_example.cif ###


_chemical_formula.sum C6H12O6



_cell.length_a 6.000(2)

_cell.length_b 7.000(2)

_cell.length_c 8.000(2)

_cell.angle_alpha 90

_cell.angle_beta 100.0(3)

_cell.angle_gamma 90



_cell.length_a 5.999(2)

_cell.length_b 7.003(2)

_cell.length_c 8.000(3)

_cell.angle_alpha 90

_cell.angle_beta 100.1(3)

_cell.angle_gamma 90






common        { }

component1 { 'component': 1 }

component2 { 'component': 2 }



… would correspond to this flattened representation:








1 C6H12O6

2 C6H12O6










1 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

2 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90




Of course, the above is equally as applicable to categories that start out as Loop categories, and to data blocks presenting multiple categories.  However, let us not overlook that these specifics are intended as an example.  I am prepared to talk about them, but I don’t think it’s useful to go very far in that direction without first coming to an agreement about the properties we want such a scheme to have.



Best regards,






John C. Bollinger, Ph.D., RHCSA

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital


(901) 595-3166 [office]




Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

Reply to: [list | sender only]