Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Proposal to enhance the behaviour of a DDLm"Set" category: please consider

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Proposal to enhance the behaviour of a DDLm"Set" category: please consider
  • From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
  • Date: Wed, 25 May 2016 14:52:27 +0000
  • Accept-Language: en-US
  • authentication-results: iucr.org; dkim=none (message not signed)header.d=none;iucr.org; dmarc=none action=none header.from=STJUDE.ORG;
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:To:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=fCaRbOzijgMvoPxSvJAxcbwHv7p7Iv6Dwa6QNdt5un0=;b=UgAKLoH+IclrovGjBuIf9wSf3eoV9XsY2NYB6xs3gM+AuocJI6Nm5d5X4aDnFRvRtS0y7KOQygMb7qZlkYd4Y0vwfqq6hn6Qh4M3vpsImKqlu5etePHnS605NlsuUdMcircM3tp1gN+w59M51OrI5mmmuT79Z+IaPSkea5SR4rQ=
  • In-Reply-To: <CAM+dB2cQ3c3HSOBiyH=F4Bm55ceZmL4g4KrTjHCcTHTsmYn3cw@mail.gmail.com>
  • References: <CAM+dB2cQ3c3HSOBiyH=F4Bm55ceZmL4g4KrTjHCcTHTsmYn3cw@mail.gmail.com>
  • spamdiagnosticmetadata: NSPM
  • spamdiagnosticoutput: 1:23

Dear James and DDLm group,


I’m not sure that I have fully comprehended the proposal to alter the meaning of the 'Set' definition class, so let me try to summarize in my own words:


(*) Presently, DDLm categories defined as 'Sets' contain items that must not be looped, or at least must not appear in multi-packet loops.  Items in such categories take at most one value per data block or save frame.


(*) The choice between the 'Set' and 'Loop' category classes is made by dictionary developers based on the envisioned use of the category in data files.  For example, the SYMMETRY category in the DDLm version of the core dictionary is defined to be a ‘Set’ because the dictionary is structured around the idea that each data block or save frame in a data file describes at most one structure, and a structure has exactly one set of symmetry information.


(*) Substantially the same item may be relevant to different kinds of overall data sets, and the appropriate choice between 'Set' and 'Loop' (as they are presently defined) may vary between kinds of data sets.  This mismatch prevents some desired re-uses of definitions across dictionaries.


(*) To enable the desired kinds of re-use, it is proposed that the 'Set' category class be redefined to require uniqueness only with respect to a category key.  New constraints are placed on the other categories that can appear in the same block or frame, so as to ensure that each datum can be associated with at most one value for any item in any 'Set' category.


Based on that understanding of the proposal:


1. I am concerned about the proposed new constraint on other categories that may appear in the same container with a 'Set' category.  I think I understand the purpose, but I also think this will be easier to get wrong and more complicated to validate.  Moreover, it introduces an unresolved conflict with categories that really ought to be 'Sets' as they currently are defined, as the proposal itself acknowledges with respect to the AUDIT category.


2. The proposed change almost completely erases the distinction between 'Set' and 'Loop' categories.  I am not convinced that retaining the two as separate classes with such a fine distinction between them is the best course of action.


3. I am not fond of how conditional the proposed new definition text is.


4. It seems likely that all existing methods of current 'Set' items would be broken by the proposed change.



My present thinking is that changing specific 'Set' categories into bona fide 'Loop' categories would be better than making all 'Sets' loop-like without actually making them  'Loops'.  This could be reconciled with existing data files by introducing a mechanism for defaulting category key values or by allowing category keys to be omitted from category data when only one set of date from that category is presented.  I think an approach along these lines could solve the problem at hand while addressing my concerns 1-3.  I am uncertain whether a solution is possible that fully addresses my concern #4, but if we convert  'Sets' into 'Loops' only selectively, then at least we narrow the scope of the problems with methods, and perhaps also allow an incremental approach to be taken for updating dictionaries.









John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital


(901) 595-3166 [office]





From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Tuesday, May 24, 2016 6:42 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Proposal to enhance the behaviour of a DDLm "Set" category: please consider


Dear DDLm group,

Please find below a proposal to add additional behaviour to DDLm Set categories.  The "Background" section provides some of the motivation for this.  In a nutshell, the proposal creates a mechanism which would allow normally single-valued datanames to take multiple values (i.e. become looped) but only within tightly specified conditions. 


Draft proposal to adjust meaning of 'Set' Categories

Version: 1  Date: 2016-05-23


It is proposed that the text accompanying the description of a 'Set'
category in the DDLm attribute dictionary (in the definition of
_definition.class) is changed as follows:

Old text:

;                 Category of items that form a set (but not a
                  loopable list). These items may be referenced
                  as a class of items in a dREL methods expression.

New text:

;                 Category of items that are usually not looped.  Items from this
                  category may only be looped if the following conditions hold:
                  (1) A category key is defined
                  (2) All other datanames appearing in the same datablock are taken from
                  categories that:
                      (i) Include a dataname with a name.linked_item_id that refers directly or
                      indirectly to the category key defined in (1)
                      (ii) Include the dataname (i) in the _category_key.name loop


There have been persistent requests over the years to re-use
notionally single-valued datanames in contexts that would allow
multiple values. For example, although datablocks containing CIF
structural descriptions expect a single space group, an application
that wished to tabulate space groups together with symmetry operators
and transformations requires multiple space groups to appear in the
data block.

Simply allowing a previously single-valued dataname to optionally take
multiple values causes the meaning of those datanames whose
interpretation has implicitly depended on the assumption of a single,
overall value to become ambiguous.  For example, fractional
coordinates and reflection hkl are calculated and interpreted relative
to a particular space group and set of cell parameters. As soon as
multiple space groups are available, an unambiguous interpretation of
these items is impossible.  Therefore, DDL1 dictionaries have never
expanded to allow looping of previously single-valued datanames.

In apparent contrast, all categories in DDL2 are notionally loopable
and are provided with a category key.  In order to reproduce the DDL1
behaviour, at the domain dictionary level a dataname is defined
("entry.id" for mmCIF) that identifies the datablock and is
constrained by the definition to have a single value. All categories
that should only have single-valued datanames are given a category key
that is a child of this dataname. For example, it is not possible to
provide multiple space groups in a single datablock using mmCIF
datanames, as the symmetry category has a key that points to entry.id,
and is thus constrained to a single value.

At the present time the DDL1 core dictionary is being translated to
DDLm, and we have promised that datablocks written according to the
old DDL1 dictionaries will continue to be interpreted in exactly the
same way after application of aliases found in the new DDLm
dictionaries.  At the same time, we seek to integrate the DDL2
symmetry dictionary into the core dictionary, because both msCIF and
the draft magCIF dictionaries build on datanames defined within it.
As a DDL2 dictionary, the symmetry dictionary defined a looped space
group category, although the msCIF and magCIF uses of it assume a
single overall space group.  See below under 'legacy issues' for
further discussion of this.


Any change in the single-valuedness of a dataname must meet the
following practical requirements:

(1) The interpretation of existing datablocks must not change (after
transformation of datanames according to aliases)
(2) Existing software must either fail or correctly interpret
datablocks written according to the new standard.

We immediately conclude that any datanames whose interpretation relies
on an overall value for some particular dataname *may not appear* in
datablocks that have instead multiple values for that dataname.
Otherwise, a pre-existing program may read in the values of these
dependent datanames, unaware that they are to be interpreted in
conjunction with a particular value of the newly looped dataname,
leading to implementation-dependent failure or, in the worst case,
incorrect results (for example, generation of too many
symmetry-equivalent atomic positions).

Evaluation of proposed change against the requirements

(1) The interpretation of already existing datablocks is not changed
by the above modifications.  As no category key historically existed
for categories with non-looped datanames, condition (2) in the new
definition cannot be met (as no datanames pointing to the non-existent
category key could have existed) and so all datanames are interpreted
as for the old DDL1 scheme.

(2) Existing software expects a single value for the previously
unlooped dataname. When confronted with multiple values, it will
either fail (which is acceptable) or choose a particular value. As the
remainder of the datanames in the datablock did not exist when the
existing software was written, the software will not be able to
proceed to perform any calculations or retrieve information that is
liable to misinterpretation.  The one use case for which
misinterpretation is possible is that in which only information from
the single, newly-looped category is sought, for example, collection
of space-group statistics.

Legacy issues

The symmetry dictionary includes two categories, space_group_symop and
space_group_Wyckoff, that include category keys that point to the
overall space_group category key and therefore meet the requirements
of section (2) of the new definition.  It is thus possible to produce
datafiles containing multiple space groups and symmetry operator lists
that may be misinterpreted by existing software if, as proposed,
space_group_symop datanames are aliased to the symmetry_equiv category
in DDL1, violating our requirement (2).  However, as discussed above,
no other currently defined space-group-dependent datanames may appear
in such multi-spacegroup files and so the potential for
misinterpretation is restricted to applications that expect a single
space group and only deal with symmetry operators, which would appear
to be an unusual use-case.

To mitigate any ongoing problems from this legacy issue, we propose
prominently suggesting that software authors explicitly check for
multiple values when reading any items from the space group category.
Ideally, the space_group_symop category would also be renamed in the
symmetry dictionary, but such renaming causes problems for existing
software authors and would need to be conducted only after
consultation with the relevant community (e.g. the Bilbao
crystallographic server).

Note that choosing different datanames for the datanames contained
in the DDLm core_CIF dictionary is not a desirable option, as both
the magCIF and msCIF dictionaries base their naming schemes off
the symmetry CIF dictionary.

Future development

The proposed change slightly reduces definition proliferation by
allowing both single-valued and multiple-valued versions of a dataname
to share the same definition.  However, all categories and datanames
that depend on the single-valued dataname must still have alternative
names defined when each single-valued dependency becomes
multiple-valued, leading to proliferation of definition blocks that
add very little information. Future work would create (e.g.)  DDLm
category attributes that auto-defined category datanames based on the
contents of other categories.  Using these attributes, add-on
dictionaries could be created economically and semi-automatically.

The restriction (2) in the new definition is excessive, in that some
categories may never have depended on overall space group (e.g. audit
information) but will nevertheless be excluded from datablocks. Future
work would develop a way to list explicitly the (known) dependencies
on overall values within each category - for calculated values, this
information is already automatically extractable from dREL methods -
to allow partial relaxation of (2). In practice, we expect that
categories that 'obviously' do not depend on the overall values will
still be included in datablocks, but it would be good to capture this
information in an attribute.


T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.