[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Second proposal to allow looping of 'Set'categories

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Second proposal to allow looping of 'Set'categories
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Fri, 10 Jun 2016 14:00:35 +0000
  • Accept-Language: en-US
  • authentication-results: spf=none (sender IP is )smtp.mailfrom=John.Bollinger@STJUDE.ORG;
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=ftaWbbdv0zjo/drlXqXYWbvHutAPPmOLoGUqj65nr8E=;b=V8Q72fydp3Yxm8v9ekkPhhIrzcS992Guf2IsxG55D3M/pyVnpGOIAmp02YV6Gxq6NH0q1URv6MLF53IDpGoUkqcniG1fvXeylaVUB4QTlODFiRc70KfwbXPP3fhyWUosDyRXnaW8N8pGZOGsU+bi8CYY92E78N7jbVWVMvkynws=
  • In-Reply-To: <CAM+dB2fTUYtNQNaFMGQFnNyqnAgmU4koexAu-ZsiKm5L+S7qBg@mail.gmail.com>
  • References: <CAM+dB2fTUYtNQNaFMGQFnNyqnAgmU4koexAu-ZsiKm5L+S7qBg@mail.gmail.com>
  • spamdiagnosticmetadata: NSPM
  • spamdiagnosticoutput: 1:99

Dear all,

 

These distinguishing details of the James’s new proposal, "Proposal #2", stand out to me (comments interspersed):

 

() It depends on a new data name, which must be assumed to be well-known to all CIF processors, regardless of which dictionary, if any, actually contains its definition.

 

() The proposal gives COMCIFS (or our delegate) the responsibility to maintain a controlled vocabulary for values of the new data name.

 

() The value, if any, associated with the new data name modulates the definitions of other items appearing in the same data block (or save frame?).

 

() New data names representing category keys and child keys must be created in conjunction with maintaining the vocabulary for the new data name.

 

At first I thought the key idea there was that CIF data files that make use of loopability of Set categories should affirmatively declare that they are doing so, on a category-by-category basis.  Perhaps that was indeed the intent, but CIF data files express that same thing more effectively and less redundantly by simply providing the looped data.  Use of an additional item provides no advantage with respect to interpreting data files, and especially not with respect to existing software avoiding misinterpretation of new data files.

I later decided that the primary effect of requiring looped-Set usage to be explicitly declared would be to maintain central control over which Set categories can be presented as multi-packet loops.  Leaving aside for the moment the question of whether that’s an appropriate objective, the proposal still assumes that definitions of the relevant parent and child keys will be created, and that provides the same measure of control by itself.

 

The only other purpose I have come up with for the proposed new item is to support cross validation.  That is, given a CIF data file containing a multi-packet loop of items belonging to a Set category, one could consult the new item to confirm that the looped data were presented as such intentionally, with knowledge that the usage of the category is out of the ordinary.  I can accept that as a rationale, but I find it pretty weak.

 

() The proposal retains the distinction between Set and Loop categories, while nevertheless allowing Set categories to be presented as multi-packet loops under some circumstances.

 

I think I understand why the proposal does this: it maintains a distinction between categories that ordinarily are not looped and those that ordinarily are looped.  It also helps support the restrictions on which categories may be presented as multi-packet loops, as discussed above.  I am not yet persuaded, however, that this approach should be preferred over simply making most or all categories defined by data dictionaries (as opposed to DDLm itself) be Loops.  It also maintains a bias towards an ordinary / customary uses of items that may or may not actually be warranted – that’s what got us into this situation in the first place, after all.

 

() Permission to omit category keys of Set categories is expressed in prose, not machine-readable form.

 

This would by no means be the only aspect of CIF data definitions whose expression is not machine-readable, but if there were a way to express this aspect in machine readable form -- and I think there is -- then that would be preferable.

 

() The proposal has no particular provision for accommodating the implicit relationships between each Set category and every other category.

 

I’m talking here about the relationships that arise simply by virtue of categories being Sets -- all other items in the same container are at least potentially associated with every set that appears in the container.  These relationships can be expressed in English in the form "The FOO appearing in the same data block".  In effect, DDLm Sets are like global variables.

 

We rely on this all over the place -- for example the REFLNS (Set) and REFLN (Loop) categories rely on the DIFFRN (Set) category to provide the associated experimental details.  If DIFFRN were looped, then both of these categories (and potentially many others) would need child keys, too.

 

Overall, any proposal that requires COMCIFS’s or a DMG’s intervention to enable new usages of existing data names, and that causes such changes to have global scope, as proposal #2 does, destabilizes CIF by increasing the frequency of disruptive changes.  I think it would be better to find an alternative that solves the problem once for all.  Adopting such an approach probably would mean relinquishing some of the control that the present proposal would afford us, but I think that’s an essential aspect of the problem space: the more control we exert over what data can be expressed, the more occasions will arise when we need to make changes to allow more or different data expressions.

 

It will be obvious by this point that I have significant reservations about proposal #2.  Lest I seem relentlessly negative, I do have a general idea for an alternative.  This e-mail is already more than long enough, however, so I will present that separately.

 

 

Best regards,

 

John

 

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Tuesday, June 07, 2016 10:59 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Second proposal to allow looping of 'Set' categories

 

Introduction
============

The previous proposal
(http://www.iucr.org/__data/iucr/lists/ddlm-group/msg01428.html) was
deemed inadequate (see discussion in that thread). The two key issues are

(i) current software must not misinterpret files produced
according to any new semantic principles.

(ii) we wish to minimise the number of datanames that software must
potentially check when searching for a particular item of information,
as each new dataname is a required update to all CIF reading software
that processes aliases of that dataname (unfortunately the vast bulk
of software does not use the latest version of the dictionary to find
aliases).

Please carefully consider and improve the following proposal:

Proposal #2 for allowing loopable 'Set' categories
==================================================

Step 1
------

(a) A new dataname '_audit.schema' (or similar) is defined, and all
CIF reading software is expected, after a transitional period, to
check its value; if missing, the value defaults to 'Structural',
corresponding to all current CIF1 datafiles.  Here is a sketchy
definition:

_definition.id    '_audit.schema'
_description.text
;
     This dataname identifies the type of information contained in the datablock. Each
     possible value of this dataname is a list of 'Set' categories that may have more than a
     single value for each dataname in that category (that is, may have more than one row in
     the category loop).
;
loop_
_enumeration_state.code
_enumeration_state.detail
    'Structural'          [ ]
    'Space group tables'  [ space_group ]
_enumeration.default      'Structural'

(b) The 'Set' _definition.class attribute is updated to read as follows ("magic keys"):

                Set
               
;                 Datanames from a Set category usually appear as part of a key-value
                  pair or in a single-row loop, in which case instance files may
                  omit datanames that are linked to the Set category's key (if
                  such a key is defined).
;

Step 2
------

Approval of new values for _audit.schema should consider the
possible impact on the community in light of adoption rates of
the _audit.schema dataname and number of categories affected by
the changes.

Whenever a new value for _audit.schema is approved, the list of
newly-looped categories is added to the above enumeration list, and:

(i) The newly-looped categories are given a category key, probably in
a separate dictionary

(ii) All looped categories that depend on any newly-looped categories
are updated to always include key dataname(s) that point to the
dependent categor(ies).  For example, "atom_site" would have a
"atom_site.cell_id" dataname added if cell parameters were looped.
The precise meaning of 'depends' is that, if the depended-upon
category loop has multiple rows, then the dependent category would
need to include the key dataname pointing to the depended-upon
category in order to uniquely identify a row.  Again, these extra
key datanames would appear in a separate dictionary.

Discussion
==========

Effects on current standards
----------------------------

This proposal affects the DDLm/dREL standards only, and has no
implications for DDL2 or DDL1 dictionaries.  DDLm dictionaries will
still reproduce DDL1 behaviour, that is, all CIF1 files remain
semantically valid after application of DDLm aliases.

DDLm
----

It is no longer possible to specify Set categories as children of
other Set categories, as this would stop the parent becoming looped.
As the Set-Set parent-child relationship had no semantic meaning
(only organisational), this has no semantic implications.  Where
looping of the parent necessarily implies looping of the child, the
parent-child relationship can remain, but in this case it would
additionally allow optional merging of the parent and child loops,
which may not be intended.

dREL
----

dREL item methods reference datanames from 'Set' categories directly,
in "category.object" notation. All dREL methods can be considered to
operate on the current row of the category within which they are
defined, which means that the current value of any future Set category
child key dataname is available whenever such a category.object
reference is made. If dREL is tweaked to say that any category.object
references use the current value of the child key for that category,
the whole system works (and assuming a default key for non-looped
'Set' categories) and indeed is simplified in many cases, as the
explicit "category[foreign_key].object" notation can often be dropped
where a single key dataname to that category is defined. Some
categories use more than one key to the same category (e.g. geom_bond
has two datanames for the two atoms at each end of a bond) in which
case an explicit reference would still be necessary.

Other notes
===========

Datafiles conforming to any of the schema can be automatically
transformed to datafiles conforming to any of the other schema by
splitting items that now need to be single-valued into separate
datablocks, filtering all dependent loops in each new datablock using
the corresponding value of the child key, then dropping the child key
from the filtered loop.

The _audit.schema dataname acts differently to the dictionary
versioning datanames. _audit.schema provides a concise, precise
description of the compatibility of datablock contents, and permits
machine transformation between different schema.  In contrast, the
dictionary versioning mechanism cannot indicate whether or not a given
datablock will be incorrectly interpreted against a later or earlier
dictionary, and given that we have undertaken not to change the
meanings of datanames, it is reasonable for a programmer to assume
that datanames mean what the dictionary they are referring to at
program creation time says, regardless of the dictionary version
stated in a datafile.

Space_group presents no legacy issues as it behaves precisely as
described here.  Furthermore, the original vision of the symmetry
dictionary authors can be safely implemented to e.g. include
transformation matrices between different cell settings in a single
datablock.

Actions
=======

Approve the updated DDLm _definition.class definition in this group,
with note to COMCIFS.

Develop the definition of _audit.schema to link a CIF2 list to
the enumerated states rather than a text string.

Approve the _audit.schema dataname through cif_core and COMCIFS.

Write clear documentation for these enhancements and distribute
to cif-developers and on CIF website.

Update dREL implementations to properly interpret Set category
references.

Create one or more datafiles to test software conformance

Advertise the new dataname and actively work with authors of
popular CIF-reading software to update software.

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]