Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Second proposal to allow looping of 'Set'categories

Dear All,

I have commented on John's comments below.

On 15 June 2016 at 07:12, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Dear all,

I have waxed rather long-winded, so here's the executive summary: as proposed, _audit.schema would create a modest data design liability in exchange for a modest practical advantage.  It is moreover inadequate to cover some plausible future changes.  The existing audit_conform category, though currently little used, could solve the same problem and cover *all* future changes.

Please find detailed comments inline below.  Because this is already long, I will again follow up with additional comments in a separate message.
 
I guess my executive summary is that I do not see it as inadequate to cover some plausible future changes, and that a modest practical advantage is better than none.
 
Regards,

John


> On Saturday, June 11, 2016 3:32 AM, James Hester wrote:
>> On 11 June 2016 at 00:00, Bollinger, John C <John.Bollinger@stjude.org> wrote:
> [...]
>> At first I thought the key idea there was that CIF data files that make use of loopability of Set categories should affirmatively declare that they are doing so, on a category-by-category basis.  Perhaps that was indeed the intent, but CIF data files express that same thing more effectively and less redundantly by simply providing the looped data.  Use of an additional item provides no advantage with respect to interpreting data files, and especially not with respect to existing software avoiding misinterpretation of new data files.
>
> I agree that the new dataname provides no semantic advantage in the sense that it simply summarises the information available in the datablock. However, it is intended to provide a considerable *practical* advantage to CIF readers. Consider: all we are doing is asking software authors to adjust their software to read a single extra dataname and check the value. Without this dataname, they would have to check that all unlooped categories were indeed unlooped, or at least those that they know from the dictionary or from their own understanding might one day affect the looped categories that they read. As a programmer I know which I would prefer, and bear in mind that many programmers of CIF reading software have a primary focus elsewhere, and don't want to spend days or even hours rewriting the CIF input portion. Checking a dataname and value - that is easy.


I'm inclined to believe that the practical advantage is less than you suppose.  Well-built software will check whether its expectations are satisfied regardless of any assertion to that effect embedded in the data.  For example, suppose a program anticipates that multiple space_groups may be presented and wants to reject CIFs for which that is the case.  That program should be prepared for the possibility that the value provided by _audit.schema or assumed based on its absence does not describe the true state of the file.  And software that wants to be maximally accepting will do the same, for the value(s) of _audit.schema could describe the *potential* for, say, multiple space groups, even where in actuality a given block expresses only one.  And that's not so expensive anyway, for most CIF parsers I've written or seen would one way or another alert the host program when it tried to access multi-valued data as if there could be only one value, even if the host program wasn't specifically watching for that possibility.

I think it is unreasonable to expect scientific software writers to perform comprehensive validation on files. While there is a good chance that most software will detect an unexpectedly looped dataname, there is a lower chance that the same software will detect that the combination of datanames that should form a key in a loop is not actually a key anymore due to the addition of an extra key dataname.  Two examples from a Google search on github for 'atom_site':

Example 1
========

https://github.com/atztogo/cogue/blob/master/cogue/interface/cif.py

Now I am not interested in criticising the approach taken by these authors: they have a particularly well-laid out CIF in mind, that has an empty line or another loop after every loop, and key-value pairs on a single line - it will mostly fail if these assumptions are not met, which is better than silently getting the wrong value.  My point is that this software reads the symmetry operators and applies them to the atom sites to construct the unit cell. There is no check for uniqueness of the operators, or for _symmetry_space_group_name etc, and so this little program will calculate incorrect unit cells if space groups are looped.  It is just conceivable that such software efforts could be encouraged to read _audit.schema and bail if necessary (cut and paste 4 lines; put in the dataname; if the value is not equal to 'Structural', exit. That would be 1 minutes' work); it is less certain that they would be persuaded to check all loop keys for uniqueness.

Example 2
=========

So let's look at some popular 'serious' software. diffpy (www.diffpy.org) is a large project based at Brookhaven which at one time was the recipient of US government largesse. Looking at https://github.com/diffpy/diffpy.Structure/blob/master/diffpy/Structure/Parsers/P_cif.py it becomes clear that there is never any check that atomic label is unique (follow through call to self.stru.AddNewAtom and then Atom constructor). So this code would also fail in the face of an additional key dataname in the atom_sites loop (e.g. Variant).  Again, this is in no way a criticism of the software, rather it shows that CIF software authors feel no need to check that the promises made in the standard are kept.

If these examples do not convince that (a) dictionary-aware CIF software is rare (b) the effect of newly-looped Set categories will not be detected in a significant number of cases for a significant amount of software then I can continue to produce more (could look at GSAS-II, pymol, jmol, ...)


I generally favor having a single source of truth for every fact, and there is already another, more fundamental source of truth for whether multiple values are provided for any given item in any given data block.  I don't imagine I would bother checking _audit.schema at  all for anything I write.  It would be more work, not less, albeit not much more.  Of course, just because I wouldn't use it doesn't mean that nobody would or should; I'm just saying that I don’t see it as the clear win that you seem to do.

You are overlooking the fact that software can legitimately never need to read a 'Set' dataname and still be tripped up, as in Example 1 above. The space group is never read, but the symmetry operators are, so the addition of another key to the symmetry operator list would be missed and many more operators than are present for a single space group would be applied.  You are indeed thorough if your software is prepared to check all 'Set' datanames that a given category may one day depend upon, but I suspect you instead check that keys are really unique for each row.  I think the examples above give an indication that this is far from universal.
 

Additionally, "asking software authors to adjust their software" is akin to, albeit less disruptive than, adding new names for existing items.  It supports the perception of many developers that CIF is a moving target.

Indeed, checking a single additional dataname (whether audit.conform or audit.schema) is an order of magnitude less intrusive than saying "all of these datanames can now be written this way as well" (as we are planning to do with cif_core), and even then I would dearly love to be able to get away without asking for it. But I think we have convinced ourselves that that would be impossible given our other constraints.


> Also built-in to proposal #2 is robustness against dictionary expansion. If, say, we discover the concept of twinning, we can define a new 'Set' category that lists twin individuals.  This category by definition is single-valued (i.e. only a single individual i.e. no twinning) for all values of _audit.schema previously defined and so no change in software past,present or future would be required to cope with blocks that had more than a single twin individual. However, without '_audit.schema', any time a new 'Set' category is defined all CIF-reading software would need to be updated to additionally check that this category was not looped.  So I would emphasise the strong practical benefits of the _audit.schema part of the proposal.


I think you're mixing concepts there.  A category that has not yet been defined is not necessarily a Set.  All we can say about it for the purposes of the present discussion is that no defined category has an undefined category's key as part of its own key.  Defining a new category as a Loop presents a problem only when that requires other categories' keys to be modified.  Surely the case where other categories' keys do not need to be modified has been exercised many times in the past with no ill effects, especially in the various DDL2 dictionaries.

I indeed assume that undefined categories that have an effect on already-defined categories are always 'Set' categories, that is, have constant values for already-defined categories. This is based on fundamental considerations. If the non-key datanames in a currently-defined category take values that depend on a hidden key in addition to the values of the explicit keys, then it is possible for identical values of the currently-defined keys to produce different row contents if the 'hidden' key had a non-constant value. This would be a data modelling error, in which case we abandon the old category and define a new one.  The implication is that software written using the old category is also incorrect. 

As an example of this, suppose that, by analogy to X-rays, we tabulate neutron scattering cross-section against atomic number.  Later on, we discover that both atomic number and atomic weight combine to determine scattering cross-section. There is no way that any previous loops could be transformed by the simple addition of atomic weight as an extra column, as the scattering values are all different.  So in this case we abandon our old datanames and define new ones that describe our updated view of the world, with appropriate transformations provided between the two.

If you have a counter-example where a newly-defined Loop category (not 'Set' category) can add keys to a previously-defined 'Loop' category, I would be very interested to discuss it as I don't see how it can exist without fundamentally altering the meaning of the rest of the datanames.

It is true, however, that we might someday want to introduce a new Loop category that does require other categories' keys to be modified.  The twin component example is apt here.  But suppose we afterward introduce another new category, and it requires some of the same categories' keys to be modified.  As proposed, _audit.schema cannot distinguish between data files using one of the new categories, those using the other, and those using both, at least with respect to any given category that ends up with keys to both.  That leaves us back right where we are now with respect to software that can handle one, but not the other.

I believe in your first sentence you mean that we introduce a new category that was notionally a 'Set' category with default value in previous datafiles?  I think the proposal should handle the case you describe, as the intention is that _audit.schema values will correspond to a list of all those set categories that are now looped. So if we have twinning and variants as the two newly-looped Set categories, there would be a separate enumerated value for 'Variants', 'Twinning' and 'Variants + Twinning'.  We would not expect all software to handle all combinations, although we could provide automatic transformation tools.
 

It seems that the way around that would be to use a code instead of a category name to disambiguate, or maybe additional data, but at that point we've come around pretty close to audit_conform.  And indeed, audit_conform already exists and could do the same job, if only CIF writers would use it.  I don't see any reason to expect that _audit.schema would be more used than audit_conform.

It is worth noting, however, that audit_conform is ironically and inexplicably defined as a *Set* category in the DDLm version of the core.  Both mmCIF and the DDL1 core define audit_conform as a Loop category.

I think I can see the outline of your alternative proposal. As newly-looped Set categories would be described in a separate dictionary, audit_conform.dictionary would be a reasonable signal that the looping is taking place. I will address this in my reply to your next email.

I agree that audit_conform needs a closer look as far as being a Set or Loop category. As I wrote to Simon, I surmise that the intention was that 'import' now covers dictionary merging in a disciplined way, and so multiple dictionaries should never need to appear.  We also need to have due regard to the architecture of dREL and the notion of a 'Root' category. It is nevertheless possible to imagine two non-overlapping dictionaries (e.g. core_cif and planetary_exploration) and it seems pointless to create (where?), a trivial dictionary that simply imports both.  Indeed, if audit_conform is looped, then we can define it to mean that each dictionary is considered to be an import with appropriate settings for duplicates.
 


> [...]
>
>> () The proposal has no particular provision for accommodating the implicit relationships between each Set category and every other category.
>>
>> I’m talking here about the relationships that arise simply by virtue of categories being Sets -- all other items in the same container are at least potentially associated with every set that appears in the container.  These relationships can be expressed in English in the form "The FOO appearing in the same data block".  In effect, DDLm Sets are like global variables.
>>
>> We rely on this all over the place -- for example the REFLNS (Set) and REFLN (Loop) categories rely on the DIFFRN (Set) category to provide the associated experimental details.  If DIFFRN were looped, then both of these categories (and potentially many others) would need child keys, too.
>
> Yes, this is true and is exactly the problem we're trying to deal with. The intention is that these relationships are made explicit at the point that a looped application of the Set category is defined, at which point all Set or Loop categories that depend on the newly-looped Set category have their child keys defined, but in a separate dictionary related to the application to avoid cluttering the main dictionary with extra, rarely-used keys everywhere.  We would have to go to considerable effort now to add all those keys, for no benefit until somebody comes up with a use case. I would rather that those with an unusual use-case make that effort when creating the new dictionary.


I don't have any objection to putting all the additional keys in a separate dictionary if we indeed go with the plan of keeping existing Sets as Sets, but making them provisionally allow multiple values.  Nor do I object to deferring writing definitions for those keys until we discover a need for them.  These are both advantages of proposal #2.

I guess my point was that we cannot rely on future changes that would interact with this proposal to be limited to requiring just one Set to be made loopable, nor even just one Set category and its children.  This is not inherently a flaw in the proposal, and I bring it up only to say that if there were an alternative that handled this more gracefully -- and I'm not necessarily saying there is -- then that would be a point in its favor.

I think that multiple looped Set categories are easily handled by this proposal, and, as stated above, that newly-defined Loop categories will almost always have no implications for the keys of already-defined Loop categories. 
 
>> Overall, any proposal that requires COMCIFS’s or a DMG’s intervention to enable new usages of existing data names, and that causes such changes to have global scope, as proposal #2 does, destabilizes CIF by increasing the frequency of disruptive changes.  I think it would be better to find an alternative that solves the problem once for all.  Adopting such an approach probably would mean relinquishing some of the control that the present proposal would afford us, but I think that’s an essential aspect of the problem space: the more control we exert over what data can be expressed, the more occasions will arise when we need to make changes to allow more or different data expressions.
>
> My hope was that the single '_audit.schema' dataname would once and for all remove the disruptive effect of such looped 'Set' changes, at the one-off, 15 minute time cost of programming a check for an additional dataname.  Do you disagree that _audit.schema would minimise disruption?


I think your idea is that each piece of software that makes use of _audit.schema would have an internal list of the values of that item that it can cope with.  I agree that implementing a check against such a list would be fairly quick.  Inasmuch as most software would start with an empty list, implementing the list itself would be very quick.  Authors that never want to consider any future changes of the type with which we're now struggling, and who want to rely on _audit.schema to recognize data files that rely on those, will not have need to adapt to future changes.  In that sense, proposal #2 would minimize disruption for those authors.

Proposal #2 would have no particular benefit, but also no significant cost for those like me who wouldn't rely on it anyway, whether because they detect unsupported data files by other means or because they simply ignore the whole issue.

But my point was different.  I was arguing that dictionaries that rely on this approach (the effective dictionaries arising from merging the base definitions with the secondary dictionaries containing all the extra keys) are likely to go through many incompatible revisions along this path because each Set that acquires a category key requires such a change.  Upon further reflection, however, I am not confident that this is a problem we can solve.

I am not sure it is a problem that needs to be solved; it is sufficient to convince ourselves that we are not making useful things impossible. Yes, anybody who relies on a looped 'Set' category had better pay close attention to the dictionary version provided in _audit.conform. Yes, there is a combinatorial explosion possible. So there could be the 'Variant' dictionary adding a variant_id to many categories, and the twinning dictionary adding a twin_id to several categories. In order for the variant + twin system to work, a further (small) dictionary would have to be defined that imported cif_core, variant, and twin dictionaries, then added (at least) variant_id to the twin_individual category. My philosophy is that each of these applications is niche (thus they have been kicked down the road for 20 years) and so the onus is on those that want this functionality to do the work.  It is also worth noting that each of the variations may be mechanically converted back to the standard form given only the list of looped 'Set' categories, and the dictionaries.

________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.