Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Further discussion of proposal #2

Dear Brian and colleagues,

On 21 June 2016 at 03:03, Brian McMahon <bm@iucr.org> wrote:

(2) It seems to me that a formal approach to the distinction might be to define a Set as a category of data items that - if looped without an explicit key - assume a default value ('') of the category key item.
This obliges dictionary writers to specify a data name that plays the role of a formal key for *every* category, but it does not require data files to carry instances of every such key data name. [Or maybe it's a
little more forgiving than that: "lazy" dictionary writers only need to
specify key data names when real use cases demand looping of what had been expected to be single-value values; but then it is incumbent on
them to stir out of their laziness and ensure that all consequent child key relationships are consistent across the new use cases that have arisen.]

The proliferation of child keys is exactly what I'd like to avoid - e.g. do we really need keys in atom_site pointing to cell and space group?  As I've said before, the real point of 'Set' categories is to provide a simplification for dictionary writers and software authors. I would like to keep the simplicity, but provide a route to gracefully remove the simplification.  So, any option that makes the 'Set' category scenario more difficult to work with (e.g. extra key definitions, extra child keys, extra things for a dictionary writer or software author to do) goes against the whole reason for having 'Set' categories, which is why my proposal #2 moved all the additional complexity into add-on dictionaries that only require engagement for non-default _audit.schema.
 

(3) The _audit.schema proposal has its attractions, though I'm not sure
how it works in practice. I mean, suppose I define an "INCOMMENSURATE"
schema to indicate that multiple space groups describe multiple discernible symmetries in a real atomic (quasi-)lattice, and a "TABLES"
schema to indicate that this is just a list of symmetry operations in
all the distinct space groups. It could be useful for validation purposes to know that "INCOMMENSURATE" also requires additional information/relations between other categories (e.g. are there different
origins or orientation matrices associated with each space group?). Is _audit.schema necessary and sufficient to capture these additional requirements? If not, could it be made so? I think this is moving into the sort of thing that Simon is interested in - can we elegantly define application profiles that say "this is a single-crystal untwinned structure", "this is an incommensurate powder structure with twinning", "this is a structure refinement with its own database of neutron absorption coefficients"?

The _audit.schema proposal as it currently stands (although I'm waiting for positive responses to this) is for software to behave as follows:
(1) Always check _audit.schema. If absent or default value, software may interpret datanames in the datablock according to those definitions found in dictionaries associated with the default schema with no further runtime checks of dictionaries necessary.  This rule captures the mainstream path, under which 'Set' categories simplify our life.
(2) If _audit.schema is not default, software is only guaranteed to correctly interpret datanames if it can handle the 'Set' category child key list provided in the versions of dictionaries given in _audit_conform. This is the path that reveals the complexity hidden behind the 'Set' behaviour. The _audit_conform check is lifted by condition (4) below.

Consideration of your example leads me to suggest the following supporting rules:

(3) All dictionaries must indicate which _audit.schema they are associated with (a new dictionary-level DDLm tag)
(4) All dictionaries must define child keys of their looped 'Set' categories for all relevant looped categories that they import from other dictionaries. This reduces the reliance on _audit_conform in (2), and note that all dictionaries will import cif_core if they loop any 'Set' categories from cif_core.
(5) All definitions appear in one dictionary only (this is probably already a rule).

Now let's flesh out your example: suppose that INCOMMENSURATE and TABLES both belong to the same _audit.schema as they both involve only looping space_group.  We have a TABLES dictionary which, by looping 'space_group', is required by rule (4) to add child_keys to all core_cif categories for which space_group is relevant. Our INCOMMENSURATE dictionary is forced to do the same when it loops 'space_group' (by rule (5), in practice INCOMMENSURATE would import TABLES).  Therefore, software which is written expecting the 'space group looped' schema will not misinterpret datablocks based on either dictionary.  Of course, it will also *not* be able to distinguish the INCOMMENSURATE and TABLES cases using _audit.schema, although note that the space group tabulation software will be able to correctly extract space group tabulation material from an incommensurate file - this is the behaviour that we enable with the _audit.schema idea.

A more complex example: suppose that Herbert subsequently comes along with his 'Variant' schema.  An dictionary corresponding to [Variant] would define a dictionary with 'variant' child keys for almost all cif_core looped categories. An dictionary corresponding to [Variant Space_group] would additionally provide a variant key to the space_group category and space_group keys as before to all core_cif definitions.  So, what happens to looped categories defined in the INCOMMENSURATE dictionary but not present in the [Variant Space_group] dictionary?  Well, the incommensurate dictionary conforms to the [space_group] schema, so is not compatible with the [Variant space_group] schema.  A *further* dictionary must be defined which imports the INCOMMENSURATE and Variant dictionaries and adds Variant child keys to all the incommensurate loop categories.

As a side note, programs written for schema [a b c] automatically handle all combinations of a, b and c, and software that additionally/instead examines the dictionaries provided in _audit_conform can provide universal schema compatibility for all non-key datanames that it was programmed to expect.  For this reason I believe that dREL methods can be made schema-independent.

Anyway, end up with many dictionaries, each corresponding to a combination of schema and full of child keys and perhaps a few original categories.  There is perhaps scope for us therefore to define a virtual dictionary creation protocol where e.g. the dictionary header just lists all of the imported looped categories that require a child key and the key names are generated automatically. I would prefer to leave this sort of discussion to a later date, if and when dictionary proliferation becomes a problem.


(4) Probably an obtuse question, but is it possible to retain in the DDLm version of the core a SYMMETRY category that is a Set, and a separate SPACE_GROUP category that is a Loop? Hardly elegant, but a way of owning up to the historical mistake? Then the relationship between the different datanames would not be through the alias mechanism, but rather by some dREL transformation?

The issue is that SPACE_GROUP is used in practice as a Set (it'd be good if you could check IUCr archives for statistics on this) and is therefore assumed to provide global information, even though such behaviour for a Loop category is nowhere specified (yet).  On the other hand, if space_group *is* looped in a datafile a whole host of categories become ambiguous.  While this objective loss of meaning *might* be enough to stop users attempting to mix looped space groups and e.g. atom_site lists, as a standards body we have to specify how to handle such cases or better still make sure it never happens. As additional fuel to the fire, magCIF extends SPACE_GROUP based on long-entrenched code in the magnetic community (assuming Set behaviour) *and* there have been requests from other quarters to preserve SPACE_GROUP loopability.  Believe me, I'd much rather do as you suggest but that would be simply ignoring the definitional problems that we have.


(5) So I've not commented specifically on the 'Global' proposal below. As I understand it, the change in name is designed to make clearer the
circumstances in which, as it were, you want to force a category not to loop its values. If 'globality' is indeed the only reason that you would enforce such a constraint, and if that helps programmers to understand what's going on, I'd be in favour of it; but I want to think some more about it before committing myself to that first opinion!

Brian




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.