[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

To: "Herbert J. Bernstein" <[email protected]>
Subject: Re: [ddlm-group] Multi block principles
From: James H <[email protected]>
Date: Tue, 23 Nov 2021 18:26:46 +1100
Cc: Group finalising DDLm and associated dictionaries <[email protected]>
In-Reply-To: <CABcsX24aHapqVdcAViQvyTHn61ZDpHyowsDmORmA0v1kM6=Xaw@mail.gmail.com>
References: <CAM+dB2fajH1c1vhrCJU9v-QQw0kt4Y2udDEx4HBK9QzDq=LD3w@mail.gmail.com><CH2PR04MB6950E54AF550C819FF598F35E0999@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2frS4Xg7fhxy5GQcw5t0WJ+pvia-HvqwLAzC-ySZoU+QQ@mail.gmail.com><CH2PR04MB695069448B0B396DE843ECEAE09C9@CH2PR04MB6950.namprd04.prod.outlook.com><CAM+dB2e-hSg5m4C7+MUbEWhK1ni2GOWWEWoGJCwS5_Xkq6uzuA@mail.gmail.com><CH2PR04MB6950FD4238F32E64416CAC4EE09F9@CH2PR04MB6950.namprd04.prod.outlook.com><CABcsX24aHapqVdcAViQvyTHn61ZDpHyowsDmORmA0v1kM6=Xaw@mail.gmail.com>

Dear Herbert,

Thank you for the reminder. As far as I can tell both John B and I are assiduously speaking in relational terms. If you do detect anything that does not map cleanly to the relational model, let us know.

all the best,

James.

On Tue, 23 Nov 2021 at 07:22, Herbert J. Bernstein <[email protected]> wrote:

Dear Colleagues,
There is a lot of value in what is being proposed, just as there was a lot of value in network databases,
object-oriented databases, and no-sql databases, etc., but the truth of the matter is: if you cannot cleanly
and clearly map what you are doing to and from relational databases, you are going to have a lot
of trouble with practical application of your approach. It is very sad that John Westbrook is no longer
with us. If he were I am sure he would have a clear view on how to say this so it might be understood,
but let me suggest one way to do this that just uses the tools we already have:

Until and unless we have a clean presentation of the proposed new core CIF with the proposed
multiblock schema in DDL2 we still have work to do. DDL2 provides a very good approximation
to a true relational presentation. Yes, it lacks the distinction between sets and loops, but if we need
the distinction to present any real data, we are in trouble.

Regards,
Herbert

Regards,
Herbert

On Mon, Nov 22, 2021 at 2:34 PM Bollinger, John C <[email protected]> wrote:

Dear DDLm Group and James,

Comments inline below.

On Sunday, November 21, 2021 10:57 PM, James H [email protected] wrote:

The main difference in opinion I see here is regarding the possibility of arbitrary separation of data block contents. As per John's example, I am indeed proposing that *any* Set category *should* be spread across multiple blocks whenever the necessity arises to provide multiple values for its data names. John points out that this would lose the information that these items of information are linked, and so such splitting should not take place unless that original link can be reconstituted.

My "solution" to this is to rely on the context to aggregate data blocks and files into something that the context asserts is a "dataset". The reason I have ended up at this position is as a result of our previous discussion (see https://www.iucr.org/__data/iucr/lists/ddlm-group/msg01626.html and following comments): there is no bulletproof way to insert the appropriate information into all data blocks in all situations. "Context" more verbosely means "the collection of data objects to which this data object belongs and which has been designated by the context as a coherent dataset". The "coherency" requirement means that there are no contradictions (relations with the same key data values but different values in the corresponding rows) after assembly of the data blocks into a single set of relations.

I think it appropriate at this point to recapitulate one of Herbert’s comments from the previous discussion about multiblock data sets: “It is very important that the relational database schema be cleanly and clearly described independent of the syntax of the container languages we use, so that we can work with interoperable presentations in CIF, XML, json, etc.” That applies to the current proposal as much as it did to the prior one. Additionally, I think a central reason why a clean and clear description of the schema supports interoperable presentations in multiple languages is that it provides for the relationships between data to be clear and well defined *in CIF*. This is one of my main areas of concern with the new proposal.

Another area is how nebulous the role of the context seems to be in determining the contents and boundaries of a multi-block data set. I think I would be willing to accept an absence of specific contextual datablock aggregation mechanisms, but the proposal needs to be clear about that. Also, I think it needs to be clear about what are the responsibilities and constraints of contextual aggregation mechanisms. Presumably, such mechanisms must at least identify all the data blocks contributing to a given multiblock data set. Do they also have a role in defining (expanded) keys and relationships? Are they obligated to provide identifiers that enable all contributing data blocks actually to be retrieved? Are there other considerations they need to address?

Particular communities have specified the use of the "summary blocks" mentioned in the proposal: a separate data block where a list of all of the blocks and their roles may be collected. Powder diffraction, for example, uses a data block where "phase_id" is looped together with the block_id of the data block containing the phase information for that phase_id. Such summary blocks may not, in general, be guaranteed to include data blocks added after the summary blocks were generated (e.g. calibration data collected separately from the experiment) and does not cover historical uses of CIF where information about a single dataset has been split into a couple or more data blocks in a single file.

I think summary blocks have most of the right characteristics. Together with the conventions surrounding their present use, I’m sure that the ones already in play have all the characteristics needed within their particular scopes of application.

From the perspective of relational schema, it is helpful to view a CIF data block as a projection of a wider, higher-dimensional relational space. Only selected attributes of each selected relation are presented, and some of the attributes *not* presented are elements of the higher-dimensional keys (more so with DDLm than with DDL2). If one wants to form a data set from multiple projections of that sort, then one must reconstitute a representation of the missing key components. That’s essentially what powder CIF does by providing a mapping from phase_id to block_id.

In more generalized context, there may be more than one extra key that needs to be reconstituted. Also, the intention seems to be that the mappings from data blocks to added keys don’t need to be one-to-one, which is to say that in general, block ids cannot be assumed to be keys. Also, there is a desire to provide for normalized representations to avoid the need to duplicate data.

To answer one of my previous questions, then: a contextual aggregation mechanism does have a role in defining keys and relationships if the result of the aggregation is to be useful as a single data set.

So, some options for strengthening links between data blocks, without attempting to be bulletproof:

1. A new data name e.g. "_audit.multiblock" (true/false) that indicates more than one data block is to be expected

2. A new data name "_audit.dataset_id". The same value for this in separate blocks is sufficient but not necessary for those blocks to be considered part of the same dataset

3. Recommend/require that a summary block be included in a dataset: this covers all known data objects at the time the summary block was written.

I would of course welcome some bulletproof yet practical way to link blocks together in a CIF way.

I think a more fundamental need is to establish the requirements and responsibilities for aggregation mechanisms in general. Here is my initial cut at a list of mandatory, desirable, and additional characteristics for a viable aggregation mechanism:

It must identify all of the data blocks comprised by an aggregate data set. This is the essential and most fundamental requirement.
It must designate all data names needed for forming or expanding category keys for the aggregate data set.
It must specify values for the added key attributes on a per-datablock basis. Together with the previous, this is what makes a simple aggregate into a data set.
It should not depend on adding items to component data blocks.
It may define an overall data set identifier, though that is not required.
It should be machine actionable. This is not an essential characteristic in any abstract sense, but I don’t see much scope for an aggregation scheme that is not machine actionable being of interest.
It may itself be based on CIF syntax, but that is not required. I see no inherent reason to insist on a specific machine-actionable form for the aggregation metadata.

From that perspective, here is a sketch of a CIF-based design for such a mechanism:

Multiple data blocks are physically aggregated into a data set by being presented in the same file, together with a special data block described by the remaining points.
A data block in the same file and with the special id _cif_multiblock_ provides information about the relationships among the data in the block. Specifically,

There is a loop category “multiblock” with attributes “block_id” and “extra_keys”, where
extra_keys values are tables associating extension attribute names with the one-per-block values that they take within the scope of a given data block.
All the key attributes so designated for each block are added to all the categories presented within that block, and to those categories’ keys.
The values for these key attributes within each data block are taken from the corresponding extra_keys table in the _cif_multiblock_ block.

The overall multiblock data set is formed from all the items in the data blocks listed in the “multiblock” category, as expanded with additional keys
Validity against the implicit expanded dictionary for the multiblock data set is determined by considering each distinct combination of all the extra keys:

All the data associated with such a combination are considered as a group. Categories with a subset of the full set of added keys are considered to be in every group that all the added keys they do include match – this can be formalized in terms of natural joins.
The added attributes are otherwise ignored for the purposes of validating each group, and
The remaining data are validated against the appropriate dictionary.

Example:

This …

### multiblock_example.cif ###

data_common

_chemical_formula.sum C6H12O6

data_component1

_cell.length_a 6.000(2)

_cell.length_b 7.000(2)

_cell.length_c 8.000(2)

_cell.angle_alpha 90

_cell.angle_beta 100.0(3)

_cell.angle_gamma 90

data_component2

_cell.length_a 5.999(2)

_cell.length_b 7.003(2)

_cell.length_c 8.000(3)

_cell.angle_alpha 90

_cell.angle_beta 100.1(3)

_cell.angle_gamma 90

data__cif_multiblock_

loop_

_multiblock.block_id

_multiblock.extra_keys

common { }

component1 { 'component': 1 }

component2 { 'component': 2 }

######

… would correspond to this flattened representation:

######

data_flattened

loop_

_chemical_formula.component

_chemical_formula.sum

1 C6H12O6

2 C6H12O6

loop_

_cell.component

_cell.length_a

_cell.length_b

_cell.length_c

_cell.angle_alpha

_cell.angle_beta

_cell.angle_gamma

1 6.000(2) 7.000(2) 8.000(2) 90 100.0(3) 90

2 5.999(2) 7.003(2) 8.000(3) 90 100.1(3) 90

######

Of course, the above is equally as applicable to categories that start out as Loop categories, and to data blocks presenting multiple categories. However, let us not overlook that these specifics are intended as an example. I am prepared to talk about them, but I don’t think it’s useful to go very far in that direction without first coming to an agreement about the properties we want such a scheme to have.

Best regards,

John

--

John C. Bollinger, Ph.D., RHCSA

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

[email protected]

(901) 595-3166 [office]

www.stjude.org

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] Multi block principles (James H)

Re: [ddlm-group] Multi block principles (Bollinger, John C)

Re: [ddlm-group] Multi block principles (James H)

Re: [ddlm-group] Multi block principles (Bollinger, John C)

Re: [ddlm-group] Multi block principles (James H)

Re: [ddlm-group] Multi block principles (Bollinger, John C)

Re: [ddlm-group] Multi block principles (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] Multi block principles

Next by Date: Re: [ddlm-group] Multi block principles

Prev by thread: Re: [ddlm-group] Multi block principles

Next by thread: Re: [ddlm-group] Multi block principles

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Multi block principles