Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multi block principles

Dear DDLm group,

I have prepared a brief draft for a new DDLm attribute (below) as a continuation of this discussion. Please feel free to comment here or on Github (see https://github.com/COMCIFS/comcifs.github.io/blob/master/draft/ddlm_attr_implicit_key.md for a nicely-formatted version).

thanks,
James.
==========
# Proposal for new DDLm attribute 'category_implicit.link'

Version: 0.1
Authors: J Hester
Status: Draft
Date: November 2021

## Introduction

The values of data names in a given category often depend implicitly
on the values found in other categories.  For example, atomic
positions (`atom_site`) will change if the cell parameters (`cell`)
and/or space group (`space_group`) change, but there is no
machine-readable information in the `atom_site` category to indicate
this link. This is not a problem when only one cell and space group is
listed in the data block, as the particular cell that a row of the
atom site loop refers to is unambiguous.

This inter-category dependence information does become important when
information is spread over multiple data blocks. In such circumstances,
the preferred approach is to separate information relating to a
particular value of data names in certain categories into separate
data blocks. In order to understand which categories are affected,
the implicit links should be available. For example, if a powder
diffraction result is spread over multiple data blocks depending on
which phase (compound) is being described, it is important to describe
in a machine-readable way which categories should be presented in
each data block, and which categories can be collected together in a
single additional data block.

This proposal suggests a new data name `_category_implicit.link`
that would list the categories that the present category depends
on.

## Definition

```
save_category_implicit.link
    _definition.id              '_category_implicit.link'
    _definition.class           Loop
    _description.text
;
    Values for the category being defined will in general depend
    on the values found in the Set categories listed under this
    data name.  As dependency is transitive, only those
    categories that are sufficient to derive a full list need
    to be listed.
   
    In relational terms, the defining category has an implicit
    data name that is a child data name of the key data name
    of each of the listed Set categories.
   
    Categories listed here are in addition to, and may
    duplicate, explicit relationships defined using
    `_name.linked_item_id` and `_category_key.name`.
;
    _name.category_id           category_implicit
    _name.object_id             link

    _type.container             Single
    _type.contents              Word
    loop_
    _description_example.case
    _description_example.detail
;
    loop_
      _category_implicit.link
      CELL
;

;
    The list of parent categories of ATOM_SITE.  As the cell
    depends on the chosen space group, only CELL needs to
    be listed.
;

save_
```

## Discussion

The mechanism for defining category relationships described here
is mostly useful where `_audit.schema` is `Base`.  In this case,
multi-valued `Set` categories are split over multiple data blocks,
so any child key data names have a single, unambiguous value and
may be omitted. Similarly, the `Set` category key data name may
also be omitted. Thus the attribute serves to link the categories
without the need to define extra data names that will be largely
unused.

Conversely, if `_audit.schema` is not `Base`, in at least some
cases explicit data names will need to be defined in order
to allow looping over those key data names and to remove ambiguity
about which of those key data values is being referred to in
loops.

## Next steps

If the present proposal is acceptable, test versions of the CIF
core dictionary and CIF powder dictionary will be prepared to
make sure these ideas work in practice.
==========

On Fri, 19 Nov 2021 at 13:40, James H <jamesrhester@gmail.com> wrote:
Dear DDLm group,

I respond to John's comments inline below.

On Wed, 17 Nov 2021 at 03:41, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm group,

 

I have no fundamental objection to setting out specific principles guiding the use of multiple data blocks to represent coherent collections of data.  I do, however, have some comments about the specifics of the draft presented.  Most of these are technical, but a couple are editorial:


I'm glad that the general principles are acceptable. I'll edit the document to take into account the technical comments, and confine myself to the substantial points here:

[edit]

 

3. It might be useful to clarify that writing data blocks using the _audit.schema 'Base' implies that all the categories on which each category depends -- both Set categories and related Loop categories -- are presented in the same data block.  In relational terms, one might say that each data block provides a distinct, implicit key value that associates the categories presented within.  I would like to avoid giving the impression that multiple data blocks, each specifying _audit.schema 'Base' and valid against (say) the Core dictionary, and without any duplicate items or key conflicts among them, can or should be interpreted the same as a single data block containing the union of the multiple blocks’ contents.


I'm not sure I follow. If a powder diffraction experiment splits the structures of, say, 3 component phases over 3 blocks + 1 block for the invariable information, all conforming to 'Base', isn't that information identical to presenting it all in a single block (no longer conforming to 'Base') with appropriate key data names added?

 

4. Speaking of data blocks defining an implicit key, I think the draft overemphasizes the relationships between Loop categories and Set categories.  When considering _audit.schema values other than 'Base', one has to recognize and account for the fact that there are relationships between pairs of Set categories, too.  These tend to be weak in the Core dictionary because it is fairly well factored, but for an example, take Set categories _exptl_crystal and _chemical_formula.  There is a non-trivial dependency there via (at least) _exptl_crystal.density_diffrn.  Also along these lines, it would be appropriate to say not that Set categories *may be* equipped with a category key, but that they *are* equipped with one.  If that can’t be considered technically correct, then we should make it so.  We could introduce the possibility of a zero-column key for this, which would offer some mathematical consistency both with there being only one possible category key value for Set categories, and with the effects of expanding that key with additional columns.


What is a zero-column key?  Is that like an implicit key with no actual values stored?

 

5. “categories whose values do not depend on any of the `Set` category values” does not make sense to me.  I understand that the objective is to specify that unnecessary data duplication should be avoided, but surely it is a matter of the selected _audit.schema which categories need to be need to be presented in which data blocks.  Right?  If this is about schema design, then maybe it would be better to express the principle in terms of the non-duplication objective.


The point is that `_audit.schema` doesn't force you to include a category, only how it should be presented (looped or single-valued) in a given data block. So journal information may be put in a separate data block, or put together with one of the other data blocks. Neither might be a violation of an `_audit.schema` that simply says that some journal category is looped, and the contents don't depend on any other category. So the specification is trying to get at the point that this class of information should be kept separate in the interests of reducing variability as to where it is found.  I will rewrite.

 

6. It is unclear what “allow[ing] the context to determine aggregation” means.  It could be taken to imply that there is some well-defined contextual mechanism available.  I think less would be more here: “The CIF standard does not stipulate how to identify data blocks belonging to a single data set.  Optionally, dictionaries may define data names that help in this task.[END]”  Or maybe only the first of those sentences.

 

7. I don’t understand in any significant detail what the description of summary blocks is trying to say.  Perhaps it would be more meaningful to someone experienced in using powder or modulated structure CIF, but the name and description convey only a vague impression to me.


I will expand this.

 

8. I disfavor relying on parent categories to identify their child categories.  That approach already constrains how DDL2 dictionaries may be supplemented by extension dictionaries in the more constrained context of Set categories not needing to participate in child-declaration, especially if one wants to use multiple extension dictionaries together.  I just don’t see it being sustainable in an environment where we must consider substantially every category to be a potential Loop category.  A plan that localizes the required definition changes as much as possible is to be preferred.  As an alternative, it may be useful to come up with a standard way to encode the additional dependency information into DDLm dictionaries *now*.  That could at least provide for automating the generation of the needed additional definitions an extension dictionaries.


So how about reversing it, and the child categories instead identify their parent categories using a new DDLm attribute? This would still require extension dictionaries to add information to core categories from time to time. One example might be an imaginary twinning dictionary that introduces 'twin_id' in category 'twin'. Until this dictionary, the 'refln' category implicitly assumed a single value of this identifier, so the dictionary would redefine 'refln' to also depend on 'twin', as would 'diffrn_refln' and some others.  The test comes when e.g. the modulated structure dictionary does not know about the existence of the 'twin' dictionary and redefines 'refln' its own way.

This missing information is, however, not a problem as it simply retains the meaning of 'single individual twin' for a modulated structure. If someone wants to describe a twinned modulated structure, then the modulated structures dictionary categories can be updated accordingly, and as long as the 'Base' schema is retained legacy software will be OK.  We still retain the option of explicitly defining the parent/child data names for complex situations.

On reflection, the original extension dictionary mechanism (adding explicit key data names to child categories) was really just creating these dependencies in child categories, but at the cost of proliferation of extension dictionaries (e.g. a modulated-structure-twin dictionary, a modulated-structures-laue-twin dictionary etc.). It seems much neater to simply gradually expand the lists of "parent" categories in child categories within the single dictionary as the need arises. If we are agreeable with this approach I'll draft a definition for a new DDLm attribute that we can discuss.

all the best
James.

 

 

 

From: ddlm-group <ddlm-group-bounces@iucr.org> On Behalf Of James H
Sent: Monday, November 15, 2021 8:18 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Multi block principles

 

Caution: External Sender. Do not open unless you know the content is safe.

 

Dear DDLm group,

 

Please see below a draft version of principles guiding the use of multiple data blocks for encapsulating CIF data. Something similar to this has long been in use for powder data and modulated structure data, and this is essentially an attempt to formalise an approach in terms of DDLm. Getting this right has implications both for those two dictionaries, and for how we combine imgCIF data names with data names from our other dictionaries. As far as I can tell what I am proposing conforms pretty closely to what we have already agreed. Please comment either here or in the repository (github.com/COMCIFS/comcifs.github.io).

 

 

thanks,

James.

==========

# Principles for reading and writing CIF information using multiple data blocks

Version: 0.1
Author: J Hester
Date: November 2021
Status: Draft

## Introduction

Data described by CIF dictionaries can be spread between multiple data
containers. In some cases these data containers may be in a variety of
non-CIF formats (e.g. HDF5, columnar ASCII).  In more complex
scenarios, a number of choices exist as to how such data should be
distributed between data containers. Choice is generally undesirable
in standards, as it complicates the task of aligning reading and
writing software. Therefore, these principles have been developed to
describe how CIF writing software should distribute and describe data
dispersed over multiple data containers.

For the purposes of this document, the data container is assumed to be
a CIF data block.

## Background

All CIF data names are defined in DDLm dictionaries to belong either
to a `Set` or `Loop` category. Data names in a `Set` category are
by default single-valued within a single data block.

The value of data name `_audit.schema` can be used to determine the
list of `Set` categories in a given data block. The default value for
`_audit.schema` of `Base` corresponds to the `Set` categories defined
in the core CIF dictionary. Dictionaries building on the core CIF
dictionary may add further `Set` categories to this list. A
non-default value for `_audit.schema` will usually imply that some
or all of these `Set` categories are looped within a data block.

In general, many `Loop` categories will have an implicit dependence on
items that appear in `Set` categories. For example, atomic positions
in an `atom_site` list depend on the space group and unit cell
information, which both appear in `Set` categories.  In relational
terms, there is an additional key data name in such `Loop` categories
that is a child of the (implicit) key data name for such `Set`
categories. If the parent value belongs to a `Set` category, this also
requires that the child data name for such loops takes only the
stated value of the parent, which allows child data names to be
dropped from the data block as their value is unambiguous. In the
following, "child data names" refers only to these child data names of
`Set` categories.

## Principles: Writing

1. Where that choice is available, all data blocks should be written
using the `Base` `_audit.schema`.  In other words, distinct values for
any items from `Set` categories and any child data names are placed
in separate data blocks.

2. Where multiple data blocks are used, categories whose values do not
depend on any of the `Set` category values (for example, an author
list) should be collected into a separate data block to avoid
unnecessary repetition in each data block.

3. The CIF standard does not stipulate how to identify data blocks
   belonging to a single data set.  Dictionaries may define data names
   that help in this task, or allow the context to determine aggregation.

4. Summary blocks: where desired, the information for one or more
`Set` categories that has been scattered over multiple data blocks may
be repeated in a summary loop in a separate data block, for example, a
list of powder phases with block pointers. In this case
`_audit.schema` *must* be changed from the default appropriately, and
the values listed in the summary loop *must* match the values provided
in each individual data block.

## Principles: Reading

1. Always check `_audit.schema` to ensure that the value is that
expected by your software. This is particularly important for
detecting and handling summary blocks (see above).

## Examples

### Powder diffraction

Consider the results of refining powder diffraction patterns from a
sample containing multiple compounds ("phases") measured at
multiple temperatures. In this case, the results for each phase
at each temperature should be presented in a separate data block,
as both `pd_phase` and `diffrn` are `Set` categories in CIF core.
Optionally, the data block containing information that does not
vary with phase or diffraction conditions (e.g information about
the diffractometer) can also contain a list of the phases and
a list of the diffraction conditions, as long as `_audit.schema`
is set appropriately.

### Using imgCIF with cif_core

The imgCIF dictionary loops several categories that are `Set`
categories in CIF core.  One of those categories is `diffrn_detector`,
covering situations where multiple detectors are used to collect
data. According to the present principles, information about each
detector should be distributed over several data blocks. A separate
document is in preparation that goes into more detail.

## Implications for dictionary authors

1. A `Set` category may be equipped with a category key.
2. Child data names of `Set` category keys must be indicated "somehow"
   (see below)
3. If the desired presentation of data differs from that implied by
the `Set` categories, a new value of `_audit.schema` should be
created.

### A new DDLm data name?

If the above principles are accepted, it will become necessary to
indicate which `Loop` categories have implicit child keys of `Set`
categories. Previously, it was proposed that extension dictionaries
would explicitly add these keys at the same time as turning the `Set`
categories into `Loop` categories. However, this is in almost all
cases a mechanical exercise that simply identifies which `Loop`
categories depend on which `Set` categories, and almost all of the
resulting child data names, at least when the above principles are
followed, end up being dropped from data blocks.

DDL2 solves this inconvenience by listing the child data names
within the definition of the parent data name.

If the above principles are acceptable, we could do something
similar by defining a new DDLm category attribute listing the
categories which implicitly depend on the defining category. When
using the `Base` `_audit.schema`, this would be sufficient. If using
a different schema, explicit data names would need to be defined
and in this case it would be appropriate to provide definitions
within the dictionary itself.

If these principles are acceptable I will prepare such a proposal
for a new DDLm attribute.
--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]