Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

I have incorporated some of the changes as suggested by John B as a pull request to the document on github. I suggest further technical editing there in order to leave this group for discussing the substantial issues. The link to the changes is https://github.com/jamesrhester/comcifs.github.io/pull/1/commits/02dc80d41b341833a1fbd50739d6e5ddbefeab1c and comments may be made from the 'Conversation' tab on that page.

all the best,

On Wed, 17 Nov 2021 at 03:41, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm group,


I have no fundamental objection to setting out specific principles guiding the use of multiple data blocks to represent coherent collections of data.  I do, however, have some comments about the specifics of the draft presented.  Most of these are technical, but a couple are editorial:


1.  “Choice” is widely considered desirable.  How about choosing a less positively connoted word to use in the introduction, such as “variability”?


2. The document refers to _audit.schema a lot, and in particular, at places it recommends defining new values that it may take.  It would be helpful to clarify that _audit.schema is defined in the Core dictionary, and especially to give or point to some guidance on exactly how one would define new values for it.  I suggest also clarifying that although the values for this item do indicate which categories are Set categories, they do not provide direct support for enumerating those categories programmatically.


3. It might be useful to clarify that writing data blocks using the _audit.schema 'Base' implies that all the categories on which each category depends -- both Set categories and related Loop categories -- are presented in the same data block.  In relational terms, one might say that each data block provides a distinct, implicit key value that associates the categories presented within.  I would like to avoid giving the impression that multiple data blocks, each specifying _audit.schema 'Base' and valid against (say) the Core dictionary, and without any duplicate items or key conflicts among them, can or should be interpreted the same as a single data block containing the union of the multiple blocks’ contents.


4. Speaking of data blocks defining an implicit key, I think the draft overemphasizes the relationships between Loop categories and Set categories.  When considering _audit.schema values other than 'Base', one has to recognize and account for the fact that there are relationships between pairs of Set categories, too.  These tend to be weak in the Core dictionary because it is fairly well factored, but for an example, take Set categories _exptl_crystal and _chemical_formula.  There is a non-trivial dependency there via (at least) _exptl_crystal.density_diffrn.  Also along these lines, it would be appropriate to say not that Set categories *may be* equipped with a category key, but that they *are* equipped with one.  If that can’t be considered technically correct, then we should make it so.  We could introduce the possibility of a zero-column key for this, which would offer some mathematical consistency both with there being only one possible category key value for Set categories, and with the effects of expanding that key with additional columns.


5. “categories whose values do not depend on any of the `Set` category values” does not make sense to me.  I understand that the objective is to specify that unnecessary data duplication should be avoided, but surely it is a matter of the selected _audit.schema which categories need to be need to be presented in which data blocks.  Right?  If this is about schema design, then maybe it would be better to express the principle in terms of the non-duplication objective.


6. It is unclear what “allow[ing] the context to determine aggregation” means.  It could be taken to imply that there is some well-defined contextual mechanism available.  I think less would be more here: “The CIF standard does not stipulate how to identify data blocks belonging to a single data set.  Optionally, dictionaries may define data names that help in this task.[END]”  Or maybe only the first of those sentences.


7. I don’t understand in any significant detail what the description of summary blocks is trying to say.  Perhaps it would be more meaningful to someone experienced in using powder or modulated structure CIF, but the name and description convey only a vague impression to me.


8. I disfavor relying on parent categories to identify their child categories.  That approach already constrains how DDL2 dictionaries may be supplemented by extension dictionaries in the more constrained context of Set categories not needing to participate in child-declaration, especially if one wants to use multiple extension dictionaries together.  I just don’t see it being sustainable in an environment where we must consider substantially every category to be a potential Loop category.  A plan that localizes the required definition changes as much as possible is to be preferred.  As an alternative, it may be useful to come up with a standard way to encode the additional dependency information into DDLm dictionaries *now*.  That could at least provide for automating the generation of the needed additional definitions an extension dictionaries.



Best regards,






From: ddlm-group <ddlm-group-bounces@iucr.org> On Behalf Of James H
Sent: Monday, November 15, 2021 8:18 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Multi block principles


Caution: External Sender. Do not open unless you know the content is safe.


Dear DDLm group,


Please see below a draft version of principles guiding the use of multiple data blocks for encapsulating CIF data. Something similar to this has long been in use for powder data and modulated structure data, and this is essentially an attempt to formalise an approach in terms of DDLm. Getting this right has implications both for those two dictionaries, and for how we combine imgCIF data names with data names from our other dictionaries. As far as I can tell what I am proposing conforms pretty closely to what we have already agreed. Please comment either here or in the repository (github.com/COMCIFS/comcifs.github.io).






# Principles for reading and writing CIF information using multiple data blocks

Version: 0.1
Author: J Hester
Date: November 2021
Status: Draft

## Introduction

Data described by CIF dictionaries can be spread between multiple data
containers. In some cases these data containers may be in a variety of
non-CIF formats (e.g. HDF5, columnar ASCII).  In more complex
scenarios, a number of choices exist as to how such data should be
distributed between data containers. Choice is generally undesirable
in standards, as it complicates the task of aligning reading and
writing software. Therefore, these principles have been developed to
describe how CIF writing software should distribute and describe data
dispersed over multiple data containers.

For the purposes of this document, the data container is assumed to be
a CIF data block.

## Background

All CIF data names are defined in DDLm dictionaries to belong either
to a `Set` or `Loop` category. Data names in a `Set` category are
by default single-valued within a single data block.

The value of data name `_audit.schema` can be used to determine the
list of `Set` categories in a given data block. The default value for
`_audit.schema` of `Base` corresponds to the `Set` categories defined
in the core CIF dictionary. Dictionaries building on the core CIF
dictionary may add further `Set` categories to this list. A
non-default value for `_audit.schema` will usually imply that some
or all of these `Set` categories are looped within a data block.

In general, many `Loop` categories will have an implicit dependence on
items that appear in `Set` categories. For example, atomic positions
in an `atom_site` list depend on the space group and unit cell
information, which both appear in `Set` categories.  In relational
terms, there is an additional key data name in such `Loop` categories
that is a child of the (implicit) key data name for such `Set`
categories. If the parent value belongs to a `Set` category, this also
requires that the child data name for such loops takes only the
stated value of the parent, which allows child data names to be
dropped from the data block as their value is unambiguous. In the
following, "child data names" refers only to these child data names of
`Set` categories.

## Principles: Writing

1. Where that choice is available, all data blocks should be written
using the `Base` `_audit.schema`.  In other words, distinct values for
any items from `Set` categories and any child data names are placed
in separate data blocks.

2. Where multiple data blocks are used, categories whose values do not
depend on any of the `Set` category values (for example, an author
list) should be collected into a separate data block to avoid
unnecessary repetition in each data block.

3. The CIF standard does not stipulate how to identify data blocks
   belonging to a single data set.  Dictionaries may define data names
   that help in this task, or allow the context to determine aggregation.

4. Summary blocks: where desired, the information for one or more
`Set` categories that has been scattered over multiple data blocks may
be repeated in a summary loop in a separate data block, for example, a
list of powder phases with block pointers. In this case
`_audit.schema` *must* be changed from the default appropriately, and
the values listed in the summary loop *must* match the values provided
in each individual data block.

## Principles: Reading

1. Always check `_audit.schema` to ensure that the value is that
expected by your software. This is particularly important for
detecting and handling summary blocks (see above).

## Examples

### Powder diffraction

Consider the results of refining powder diffraction patterns from a
sample containing multiple compounds ("phases") measured at
multiple temperatures. In this case, the results for each phase
at each temperature should be presented in a separate data block,
as both `pd_phase` and `diffrn` are `Set` categories in CIF core.
Optionally, the data block containing information that does not
vary with phase or diffraction conditions (e.g information about
the diffractometer) can also contain a list of the phases and
a list of the diffraction conditions, as long as `_audit.schema`
is set appropriately.

### Using imgCIF with cif_core

The imgCIF dictionary loops several categories that are `Set`
categories in CIF core.  One of those categories is `diffrn_detector`,
covering situations where multiple detectors are used to collect
data. According to the present principles, information about each
detector should be distributed over several data blocks. A separate
document is in preparation that goes into more detail.

## Implications for dictionary authors

1. A `Set` category may be equipped with a category key.
2. Child data names of `Set` category keys must be indicated "somehow"
   (see below)
3. If the desired presentation of data differs from that implied by
the `Set` categories, a new value of `_audit.schema` should be

### A new DDLm data name?

If the above principles are accepted, it will become necessary to
indicate which `Loop` categories have implicit child keys of `Set`
categories. Previously, it was proposed that extension dictionaries
would explicitly add these keys at the same time as turning the `Set`
categories into `Loop` categories. However, this is in almost all
cases a mechanical exercise that simply identifies which `Loop`
categories depend on which `Set` categories, and almost all of the
resulting child data names, at least when the above principles are
followed, end up being dropped from data blocks.

DDL2 solves this inconvenience by listing the child data names
within the definition of the parent data name.

If the above principles are acceptable, we could do something
similar by defining a new DDLm category attribute listing the
categories which implicitly depend on the defining category. When
using the `Base` `_audit.schema`, this would be sufficient. If using
a different schema, explicit data names would need to be defined
and in this case it would be appropriate to provide definitions
within the dictionary itself.

If these principles are acceptable I will prepare such a proposal
for a new DDLm attribute.

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]