Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Proposal to enhance the behaviour of a DDLm "Set"category: please consider

Dear James and Group,

Comments inline below.

Best Regards,

John


On Friday, May 27, 2016 9:48 PM James Hester wrote:

> Dear John and group,
> I've discussed John's points within his email, below.
>
> On 26 May 2016 at 00:52, Bollinger, John C <John.Bollinger@stjude.org> wrote:

[...]

>>  (*) To enable the desired kinds of re-use, it is proposed that the 'Set' category class be redefined to require uniqueness only with respect to a category key.  New constraints are placed on the other categories that can appear in the same block or frame, so as to ensure that each datum can be associated with at most one value for any item in any 'Set' category.
>
> This is correct, although note that it is already trivially the case that each datum is associated with at most one value for any item in the 'Set' category.


Acknowledged.


> The idea is now to allow *only* those categories that explicitly depend on the 'Set' category key to be included in the data block when the 'Set' category is looped, to maintain this unique association.  Thus existing users of the 'Set' category see no change (as the categories they use did not, and will not, have a key dataname pointing to the 'Set' category), and so the onus is on those who would loop the 'Set' category to come up with new categories that include the 'Set' category key within their loops.


Ok, I think I've got it.


>> Based on that understanding of the proposal:
>>
>> 1. I am concerned about the proposed new constraint on other categories that may appear in the same container with a 'Set' category.  I think I understand the purpose, but I also think this will be easier to get wrong and more complicated to validate.  Moreover, it introduces an unresolved conflict with categories that really ought to be 'Sets' as they currently are defined, as the proposal itself acknowledges with respect to the AUDIT category.
>
> If a category key is not defined for a 'Set' category, then it behaves as it always has. So, if 'AUDIT' is a Set category, and no category key is defined, then no changes occur. The validation complexity is essentially "1. Is this 'Set' category looped? 2. If so, list the categories in the datablock. Do they all have a dataname that points to the 'Set' category."  This cost in complexity (which is almost exclusively borne by those who have the unusual use-cases, as it only occurs if step 1 gives 'yes') is what we should balance against the benefit of increased utility.


I think I understand.  I also think that's not going to be sufficient.  That is, it could work for allowing reuse of the DDLm Core dictionary as desired in a DDLm version of the symmetry dictionary, but I think we will ultimately want more flexibility than that.

In fact, let's look at the current state of affairs.  The DDL1 core dictionary has _space_group_id, defined as "This is an identifier needed if _space_group_ items are looped."  That looks like a category key to me, but the DDLm version of the core defines SPACE_GROUP as a Set.  For its part, the DDLm core has _space_group.id, which it aliases to _space_group_id.  That seems odd for a Set category.  The item is defined as "Code identifying a space group if multiple symmetries.  See _exptl_crystals.key", which appears to conflict with defining SPACE_GROUP as a Set.

I'm inclined to think that this apparent conflict arises from the same inherent tension between Set and Loop that inspired the present proposal.  Observe, however, that the definition hypothesizes a specific scenario in which items from SPACE_GROUP would appear looped in a Core CIF document (though I'm having trouble finding item _exptl_crystals.key in the DDLm core).  Even if there are no other such Sets in the present DDLm Core dictionary, I think it's likely that others will be discovered as the dictionary is developed and expanded.  Indeed, the use case suggested by the _space_group.id definition seems to imply that those categories would be the rule, not the exception.  As I understand it, your proposal would not lend itself well to such uses, and especially not to using more than one unrelated Set category in looped form in the same CIF.


>> 2. The proposed change almost completely erases the distinction between 'Set' and 'Loop' categories.  I am not convinced that retaining the two as separate classes with such a fine distinction between them is the best course of action.
>
> DDL2 adopted the position that 'everything is a loop', and indeed, in ontological terms any concept you can think of is 'looped', that is, you can imagine situations where multiple values are available.  'Set' categories are simply a space-saving device.

[...]

> So the distinction is not so much 'Set' and 'Loop', but 'overall information' and 'per datum information'.  What the proposal describes is under what circumstances you can re-use the 'overall information' dataname as a 'per datum' dataname.


Understood.  However, I prefer an approach based as much as possible on defining what various combinations of data mean, and as little as possible on limiting what combinations of data are permitted.  Perhaps that means that indeed most categories defined in data dictionaries should be loops, or perhaps that our data dictionaries would benefit from greater distinction between defining data and defining relationships between data.


>> 3. I am not fond of how conditional the proposed new definition text is.
>
> Neither am I, but I don't believe it is possible to come up with anything simpler that meets the requirements.  Whether the requirements are justified may also be debated, of course.
>
>> 4. It seems likely that all existing methods of current 'Set' items would be broken by the proposed change.
>
> Actually, a lot less than you might expect. If you look at the definition for _cell.volume, it works just as well on a list of cells as on a single cell.  This is inherent in the design of dREL.  The methods that fail are those that expect overall ('Set') information in a Looped category, so for example, when the model sites are calculated (see category method for model_site) they assume a single value of _atom_sites_Cartn_transform.mat. This matrix will be looped if the cell parameters are looped, and so the model_site method will fail.  For this reason I have stipulated the condition that any category appearing with a looped 'Set' category must explicitly know that the set category can be looped by defining a key that points to the set category key.  This ringfences the currently existing categories, such as model_site, that assume overall information, from any effect of this change.


My dREL is shaky, and it may be that my concern is more about implementing dREL itself than about specific methods.  If so, that might be considered a more tractable problem.  What I'm considering, however, is how you identify the wanted items from a Set category if that category has a category key.  In principle, you ought to identify the items by name and category key, but no existing method does that (or maybe it's dREL that doesn't) because Set categories don't presently have category keys.


>> My present thinking is that changing specific 'Set' categories into bona fide 'Loop' categories would be better than making all 'Sets' loop-like without actually making them  'Loops'.  This could be reconciled with existing data files by introducing a mechanism for defaulting category key values or by allowing category keys to be omitted from category data when only one set of date from that category is presented.  I think an approach along these lines could solve the problem at hand while addressing my concerns 1-3.  I am uncertain whether a solution is possible that fully addresses my concern #4, but if we convert  'Sets' into 'Loops' only selectively, then at least we narrow the scope of the problems with methods, and perhaps also allow an incremental approach to be taken for updating dictionaries.
>
> Unfortunately I don't believe that your suggested change would work, because (i) we need to take into account the response of existing software to datafiles with looped 'Set' categories (ii) we still need a way to indicate overall data in order to match DDL1 dataname meanings.
>
> Regarding (i),  while a system of defaulting key values and omitting them if only a single item is defined is a consistent description of currently-existing datafiles, the real issue is the opposite: what will happen when current software is faced with a file that *does* have multiple values in a 'Set' category?  Are we sure that it will not silently e.g. calculate too many atomic sites because we have listed symmetry operators from multiple spacegroups?


I think you're talking about the behavior of existing software when dealing with data that would not be valid with respect to the current DDLm core dictionary, but that would be made valid with respect to that dictionary by the kind of change I suggest.  In effect, you're arguing that whatever changes we make to DDLm and the DDLm Core dictionary should not allow future data files to be validated that present software is unprepared to accept.  I don't take that as given.  Furthermore, to the extent that anything we discuss involves modifying DDLm, are we not discussing causing the same or worse problems anyway, but for a different set of software?

Additionally, I'm not certain how large or new this issue really is.  Take our space group example: the DDL1 Core already makes provision for the items in this category and its child categories to be looped, even though they rarely are.  Nothing prevents anyone from writing a valid CIF now, using those facilities, and presenting it to a piece of software that misunderstands it.


> Regarding (ii), we gain nothing by use of the PDB 'entry.id' trick as we are once again left with single-valued categories that can't have more than one datum.


I didn't go into my proposed alternative any detail, but what I have in mind does allow for multiple data per item.  It would work like this (taking SPACE_GROUP as an example):

 (*) we change SPACE_GROUP from a Set category to a Loop category, consistent with the DDL1 Core.
 (*) we define the existing item _space_group.id to be the category key.
 (*) we define a default value (_enumeration.default) for _space_group.id, maybe an empty string; this permits one set of items from the category to be presented in a CIF that does not provide any explicit _space_group.id, without preventing multiple sets being provided by including _space_group.id values explicitly.  Multiple sets require the key to be provided explicitly, for otherwise key values would be duplicated.
 (*) when a method references SPACE_GROUP outside the context of any child key referring to _space_group.id, the related items are identified based on the defined default value of _space_group.id; the data are invalid if that key does not appear (whether explicitly or implicitly) in the file

Of those, only the last requires a change to DDLm itself, and even that change could be avoided if we were willing to add actual child keys to all other categories.  The potential combinatorial problems involved in doing that could be significantly mitigated if there is a way to create a shared dictionary module containing all the needed definitions, and importing just that module into all/most categories.


> Regarding altering dREL methods, I think it is worth realising that the dREL methods express the mathematical relationships that are 'out there' in correctly-written user software. We can no more change a correctly-written dREL method than we can change all the software that expresses that method, and as soon as any proposal involves rewriting dREL methods, that proposal is effectively a non-starter for core_CIF.  The converse is that, if we somehow control all the software that expresses that method, then we can change it.


Is it fair to say that for the purposes of the present discussion, what you're saying here means that adopting your proposal would require existing dREL implementations to be updated?  I'm again thinking of how to resolve item associations when at least one item belongs to a Set category with a category key, and the other to a different category without a corresponding child key.


> A clean alternative is simply to define a whole new set of datanames corresponding to a 'Set' category becoming looped.  This could be economically done with (i) a category definition using a single new DDLm attribute to say that all the datanames in the new category are derived from the datanames in the 'Set' category by just changing the name.category_id component (ii) a similar new category definition for each category which now needs a key dataname pointing to the newly-looped category.  To be honest, this would be my preferred option, although we would be stuck with space_group as a legacy category.  Anyway, under this alternative, machine transformation of datafiles between the 'Loop' and 'Set' versions would be straightforward, and capture all the changes in dependent categories (so the space group column in the atom_site_with_spacegroup category would disappear upon transformation to _atom_site, and multiple data blocks, one for each space group, would be generated).
> What do you think of that?


I like that idea much better than the original proposal.  I can conceptualize it in terms of distinguishing the nature of a thing (category) from how that thing is used in a particular context (dictionary), which makes more sense to me.  And I think it addresses all the concerns I enumerated in my previous message.  I would like it especially well if there were a way to leverage _import.get for this purpose instead of modifying DDLm.

On the other hand, I am not certain I prefer it to making most categories in data dictionaries be bona fide Loops, with an internally-consistent mechanism to elide category keys where unneeded, as I proposed.  In fact, I'm not confident that we will avoid wanting such a mechanism in any case.  Nevertheless, if the group is not prepared to go there at this point then that idea can go back on the shelf.


________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.