Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Second proposal to allow looping of'Set' categories

Dear all,

I have waxed rather long-winded, so here's the executive summary: as proposed, _audit.schema would create a modest data design liability in exchange for a modest practical advantage.  It is moreover inadequate to cover some plausible future changes.  The existing audit_conform category, though currently little used, could solve the same problem and cover *all* future changes.

Please find detailed comments inline below.  Because this is already long, I will again follow up with additional comments in a separate message.



> On Saturday, June 11, 2016 3:32 AM, James Hester wrote:
>> On 11 June 2016 at 00:00, Bollinger, John C <John.Bollinger@stjude.org> wrote:
> [...]
>> At first I thought the key idea there was that CIF data files that make use of loopability of Set categories should affirmatively declare that they are doing so, on a category-by-category basis.  Perhaps that was indeed the intent, but CIF data files express that same thing more effectively and less redundantly by simply providing the looped data.  Use of an additional item provides no advantage with respect to interpreting data files, and especially not with respect to existing software avoiding misinterpretation of new data files.
> I agree that the new dataname provides no semantic advantage in the sense that it simply summarises the information available in the datablock. However, it is intended to provide a considerable *practical* advantage to CIF readers. Consider: all we are doing is asking software authors to adjust their software to read a single extra dataname and check the value. Without this dataname, they would have to check that all unlooped categories were indeed unlooped, or at least those that they know from the dictionary or from their own understanding might one day affect the looped categories that they read. As a programmer I know which I would prefer, and bear in mind that many programmers of CIF reading software have a primary focus elsewhere, and don't want to spend days or even hours rewriting the CIF input portion. Checking a dataname and value - that is easy.

I'm inclined to believe that the practical advantage is less than you suppose.  Well-built software will check whether its expectations are satisfied regardless of any assertion to that effect embedded in the data.  For example, suppose a program anticipates that multiple space_groups may be presented and wants to reject CIFs for which that is the case.  That program should be prepared for the possibility that the value provided by _audit.schema or assumed based on its absence does not describe the true state of the file.  And software that wants to be maximally accepting will do the same, for the value(s) of _audit.schema could describe the *potential* for, say, multiple space groups, even where in actuality a given block expresses only one.  And that's not so expensive anyway, for most CIF parsers I've written or seen would one way or another alert the host program when it tried to access multi-valued data as if there could be only one value, even if the host program wasn't specifically watching for that possibility.

I generally favor having a single source of truth for every fact, and there is already another, more fundamental source of truth for whether multiple values are provided for any given item in any given data block.  I don't imagine I would bother checking _audit.schema at  all for anything I write.  It would be more work, not less, albeit not much more.  Of course, just because I wouldn't use it doesn't mean that nobody would or should; I'm just saying that I don’t see it as the clear win that you seem to do.

Additionally, "asking software authors to adjust their software" is akin to, albeit less disruptive than, adding new names for existing items.  It supports the perception of many developers that CIF is a moving target.

> Also built-in to proposal #2 is robustness against dictionary expansion. If, say, we discover the concept of twinning, we can define a new 'Set' category that lists twin individuals.  This category by definition is single-valued (i.e. only a single individual i.e. no twinning) for all values of _audit.schema previously defined and so no change in software past,present or future would be required to cope with blocks that had more than a single twin individual. However, without '_audit.schema', any time a new 'Set' category is defined all CIF-reading software would need to be updated to additionally check that this category was not looped.  So I would emphasise the strong practical benefits of the _audit.schema part of the proposal.

I think you're mixing concepts there.  A category that has not yet been defined is not necessarily a Set.  All we can say about it for the purposes of the present discussion is that no defined category has an undefined category's key as part of its own key.  Defining a new category as a Loop presents a problem only when that requires other categories' keys to be modified.  Surely the case where other categories' keys do not need to be modified has been exercised many times in the past with no ill effects, especially in the various DDL2 dictionaries.

It is true, however, that we might someday want to introduce a new Loop category that does require other categories' keys to be modified.  The twin component example is apt here.  But suppose we afterward introduce another new category, and it requires some of the same categories' keys to be modified.  As proposed, _audit.schema cannot distinguish between data files using one of the new categories, those using the other, and those using both, at least with respect to any given category that ends up with keys to both.  That leaves us back right where we are now with respect to software that can handle one, but not the other.

It seems that the way around that would be to use a code instead of a category name to disambiguate, or maybe additional data, but at that point we've come around pretty close to audit_conform.  And indeed, audit_conform already exists and could do the same job, if only CIF writers would use it.  I don't see any reason to expect that _audit.schema would be more used than audit_conform.

It is worth noting, however, that audit_conform is ironically and inexplicably defined as a *Set* category in the DDLm version of the core.  Both mmCIF and the DDL1 core define audit_conform as a Loop category.

> [...]
>> () The proposal has no particular provision for accommodating the implicit relationships between each Set category and every other category.
>> I’m talking here about the relationships that arise simply by virtue of categories being Sets -- all other items in the same container are at least potentially associated with every set that appears in the container.  These relationships can be expressed in English in the form "The FOO appearing in the same data block".  In effect, DDLm Sets are like global variables.
>> We rely on this all over the place -- for example the REFLNS (Set) and REFLN (Loop) categories rely on the DIFFRN (Set) category to provide the associated experimental details.  If DIFFRN were looped, then both of these categories (and potentially many others) would need child keys, too.
> Yes, this is true and is exactly the problem we're trying to deal with. The intention is that these relationships are made explicit at the point that a looped application of the Set category is defined, at which point all Set or Loop categories that depend on the newly-looped Set category have their child keys defined, but in a separate dictionary related to the application to avoid cluttering the main dictionary with extra, rarely-used keys everywhere.  We would have to go to considerable effort now to add all those keys, for no benefit until somebody comes up with a use case. I would rather that those with an unusual use-case make that effort when creating the new dictionary.

I don't have any objection to putting all the additional keys in a separate dictionary if we indeed go with the plan of keeping existing Sets as Sets, but making them provisionally allow multiple values.  Nor do I object to deferring writing definitions for those keys until we discover a need for them.  These are both advantages of proposal #2.

I guess my point was that we cannot rely on future changes that would interact with this proposal to be limited to requiring just one Set to be made loopable, nor even just one Set category and its children.  This is not inherently a flaw in the proposal, and I bring it up only to say that if there were an alternative that handled this more gracefully -- and I'm not necessarily saying there is -- then that would be a point in its favor.

>> Overall, any proposal that requires COMCIFS’s or a DMG’s intervention to enable new usages of existing data names, and that causes such changes to have global scope, as proposal #2 does, destabilizes CIF by increasing the frequency of disruptive changes.  I think it would be better to find an alternative that solves the problem once for all.  Adopting such an approach probably would mean relinquishing some of the control that the present proposal would afford us, but I think that’s an essential aspect of the problem space: the more control we exert over what data can be expressed, the more occasions will arise when we need to make changes to allow more or different data expressions.
> My hope was that the single '_audit.schema' dataname would once and for all remove the disruptive effect of such looped 'Set' changes, at the one-off, 15 minute time cost of programming a check for an additional dataname.  Do you disagree that _audit.schema would minimise disruption?

I think your idea is that each piece of software that makes use of _audit.schema would have an internal list of the values of that item that it can cope with.  I agree that implementing a check against such a list would be fairly quick.  Inasmuch as most software would start with an empty list, implementing the list itself would be very quick.  Authors that never want to consider any future changes of the type with which we're now struggling, and who want to rely on _audit.schema to recognize data files that rely on those, will not have need to adapt to future changes.  In that sense, proposal #2 would minimize disruption for those authors.

Proposal #2 would have no particular benefit, but also no significant cost for those like me who wouldn't rely on it anyway, whether because they detect unsupported data files by other means or because they simply ignore the whole issue.

But my point was different.  I was arguing that dictionaries that rely on this approach (the effective dictionaries arising from merging the base definitions with the secondary dictionaries containing all the extra keys) are likely to go through many incompatible revisions along this path because each Set that acquires a category key requires such a change.  Upon further reflection, however, I am not confident that this is a problem we can solve.


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.