Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Second proposal to allow looping of 'Set'categories

Dear John and DDLm-group:

I have commented inline below.


On 11 June 2016 at 00:00, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear all,

 

These distinguishing details of the James’s new proposal, "Proposal #2", stand out to me (comments interspersed):

 

() It depends on a new data name, which must be assumed to be well-known to all CIF processors, regardless of which dictionary, if any, actually contains its definition.

 

() The proposal gives COMCIFS (or our delegate) the responsibility to maintain a controlled vocabulary for values of the new data name.

 

() The value, if any, associated with the new data name modulates the definitions of other items appearing in the same data block (or save frame?).

 

() New data names representing category keys and child keys must be created in conjunction with maintaining the vocabulary for the new data name.

 

At first I thought the key idea there was that CIF data files that make use of loopability of Set categories should affirmatively declare that they are doing so, on a category-by-category basis.  Perhaps that was indeed the intent, but CIF data files express that same thing more effectively and less redundantly by simply providing the looped data.  Use of an additional item provides no advantage with respect to interpreting data files, and especially not with respect to existing software avoiding misinterpretation of new data files.


I agree that the new dataname provides no semantic advantage in the sense that it simply summarises the information available in the datablock. However, it is intended to provide a considerable *practical* advantage to CIF readers. Consider: all we are doing is asking software authors to adjust their software to read a single extra dataname and check the value. Without this dataname, they would have to check that all unlooped categories were indeed unlooped, or at least those that they know from the dictionary or from their own understanding might one day affect the looped categories that they read. As a programmer I know which I would prefer, and bear in mind that many programmers of CIF reading software have a primary focus elsewhere, and don't want to spend days or even hours rewriting the CIF input portion. Checking a dataname and value - that is easy.

Also built-in to proposal #2 is robustness against dictionary expansion. If, say, we discover the concept of twinning, we can define a new 'Set' category that lists twin individuals.  This category by definition is single-valued (i.e. only a single individual i.e. no twinning) for all values of _audit.schema previously defined and so no change in software past,present or future would be required to cope with blocks that had more than a single twin individual. However, without '_audit.schema', any time a new 'Set' category is defined all CIF-reading software would need to be updated to additionally check that this category was not looped.  So I would emphasise the strong practical benefits of the _audit.schema part of the proposal.
 

I later decided that the primary effect of requiring looped-Set usage to be explicitly declared would be to maintain central control over which Set categories can be presented as multi-packet loops.  Leaving aside for the moment the question of whether that’s an appropriate objective, the proposal still assumes that definitions of the relevant parent and child keys will be created, and that provides the same measure of control by itself.

 

The only other purpose I have come up with for the proposed new item is to support cross validation.  That is, given a CIF data file containing a multi-packet loop of items belonging to a Set category, one could consult the new item to confirm that the looped data were presented as such intentionally, with knowledge that the usage of the category is out of the ordinary.  I can accept that as a rationale, but I find it pretty weak.

 

() The proposal retains the distinction between Set and Loop categories, while nevertheless allowing Set categories to be presented as multi-packet loops under some circumstances.

 

I think I understand why the proposal does this: it maintains a distinction between categories that ordinarily are not looped and those that ordinarily are looped.  It also helps support the restrictions on which categories may be presented as multi-packet loops, as discussed above.  I am not yet persuaded, however, that this approach should be preferred over simply making most or all categories defined by data dictionaries (as opposed to DDLm itself) be Loops.  It also maintains a bias towards an ordinary / customary uses of items that may or may not actually be warranted – that’s what got us into this situation in the first place, after all.


You have correctly identified the reasons for maintaining the distinction.  If you like, a 'Set' category is a 'Loop' category with special behaviour when there is only one packet.  Ths special behaviour is very useful and widespread, thus it is worth describing separately.

 

() Permission to omit category keys of Set categories is expressed in prose, not machine-readable form.

 

This would by no means be the only aspect of CIF data definitions whose expression is not machine-readable, but if there were a way to express this aspect in machine readable form -- and I think there is -- then that would be preferable. 


I was hoping to be able to link the value of _audit.schema to a list but haven't spent time working out how to do that in DDLm. Is that what you mean?

 

() The proposal has no particular provision for accommodating the implicit relationships between each Set category and every other category.

 

I’m talking here about the relationships that arise simply by virtue of categories being Sets -- all other items in the same container are at least potentially associated with every set that appears in the container.  These relationships can be expressed in English in the form "The FOO appearing in the same data block".  In effect, DDLm Sets are like global variables.

 

We rely on this all over the place -- for example the REFLNS (Set) and REFLN (Loop) categories rely on the DIFFRN (Set) category to provide the associated experimental details.  If DIFFRN were looped, then both of these categories (and potentially many others) would need child keys, too.

 

Yes, this is true and is exactly the problem we're trying to deal with. The intention is that these relationships are made explicit at the point that a looped application of the Set category is defined, at which point all Set or Loop categories that depend on the newly-looped Set category have their child keys defined, but in a separate dictionary related to the application to avoid cluttering the main dictionary with extra, rarely-used keys everywhere.  We would have to go to considerable effort now to add all those keys, for no benefit until somebody comes up with a use case. I would rather that those with an unusual use-case make that effort when creating the new dictionary.
 

Overall, any proposal that requires COMCIFS’s or a DMG’s intervention to enable new usages of existing data names, and that causes such changes to have global scope, as proposal #2 does, destabilizes CIF by increasing the frequency of disruptive changes.  I think it would be better to find an alternative that solves the problem once for all.  Adopting such an approach probably would mean relinquishing some of the control that the present proposal would afford us, but I think that’s an essential aspect of the problem space: the more control we exert over what data can be expressed, the more occasions will arise when we need to make changes to allow more or different data expressions.


My hope was that the single '_audit.schema' dataname would once and for all remove the disruptive effect of such looped 'Set' changes, at the one-off, 15 minute time cost of programming a check for an additional dataname.  Do you disagree that _audit.schema would minimise disruption? As for going through COMCIFS, this is simply a service we provide to allow CIF writers and readers to produce a mutually-understandable file without contacting each other directly. We can have the best of both worlds by defining an _audit.schema value prefix for uncontrolled use and add official values for those communities who request it.

 

It will be obvious by this point that I have significant reservations about proposal #2.  Lest I seem relentlessly negative, I do have a general idea for an alternative.  This e-mail is already more than long enough, however, so I will present that separately.

  

Best regards,

 

John

 

 

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.