Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Proposal to enhance the behaviour of a DDLm "Set"category: please consider

Dear Colleagues,

  In the 1970's we had a serious conflict in the database world between people who accepted Codd's relational database model, and a very large number of supposedly more powerful and flexible alternatives.  In the end it turned out that Codd was right and the best way to represent databases of information in a way that would allow for simultaneous readers and writers with reasonable efficiency and reliability was to keep _everything_ in relational tables in which particular rows are identified by the unique combination of key values in each row.

  Ultimately, if we are going to get the most use from the information we are gathering in CIFs it would be helpful in the CIFs could easily be mapped into relational tables.  That works well will DDL2.  It would be nice if the characteristics if DDL2 that permit such easy use there could be adhered to in the DDLm core dictionary. 

  I don't see that any harm will arise from allowing appropriate keys to be defined for all categories and allowing looping of any categories for which keys have been defined.  Even if a key is added to a category for which there are existing datasets for without that key, having DDLm and dREL we can easily provide a default value to be used.  If the catgeory has not been looped in a particular dataset any default value will do.  If it has been looped the category must already have a key with unique values for each row.  Yes, a set is different from a relational table, but it cen be effectively represented the same way as a relational table.  The distinction is in the semantics in the dictionary, not in the data file.  We invalidate nothing by allowing some CIFs with unlooped versions of the same category that is looped in other CIFs.  Failing to declare as an error something that has a clear an unambiguous meaning would not be a loss to anybody.

  James' suggestion does not introduce complexity.  It removes some.

  Regards,
    Herbert

On Wed, Jun 1, 2016 at 10:33 PM, James Hester <jamesrhester@gmail.com> wrote:
Dear John and group,

Here is a summary of my latest response in light of John's comments. Read my interspersed comments below for further details.

* My original proposal is flawed because
    (i) it excludes categories from a datablock unnecessarily
    (ii) every new category created implies a further aliased dataname for CIF readers to check even for common use cases
    (iii) some Set category dREL methods may need to be rewritten, and dREL behaviour may change subtly when using Set categories

* John's alternative is flawed because it changes the interpretation of already existing datanames

* My "synthetic category" alternative is flawed in the same way as (ii) above

I am working on an alternative proposal which I will present shortly.
 
Read on for detailed comments interspersed with John's most recent reply, including some CIF archaelogy. I've edited out areas where there is no issue.

On 1 June 2016 at 06:47, Bollinger, John C <John.Bollinger@stjude.org> wrote:


On Friday, May 27, 2016 9:48 PM James Hester wrote:



>> Based on that understanding of the proposal:
>>
>> 1. I am concerned about the proposed new constraint on other categories that may appear in the same container with a 'Set' category.  I think I understand the purpose, but I also think this will be easier to get wrong and more complicated to validate.  Moreover, it introduces an unresolved conflict with categories that really ought to be 'Sets' as they currently are defined, as the proposal itself acknowledges with respect to the AUDIT category.
>
> If a category key is not defined for a 'Set' category, then it behaves as it always has. So, if 'AUDIT' is a Set category, and no category key is defined, then no changes occur. The validation complexity is essentially "1. Is this 'Set' category looped? 2. If so, list the categories in the datablock. Do they all have a dataname that points to the 'Set' category."  This cost in complexity (which is almost exclusively borne by those who have the unusual use-cases, as it only occurs if step 1 gives 'yes') is what we should balance against the benefit of increased utility.


I think I understand.  I also think that's not going to be sufficient.  That is, it could work for allowing reuse of the DDLm Core dictionary as desired in a DDLm version of the symmetry dictionary, but I think we will ultimately want more flexibility than that.

In fact, let's look at the current state of affairs.  The DDL1 core dictionary has _space_group_id, defined as "This is an identifier needed if _space_group_ items are looped."  That looks like a category key to me, but the DDLm version of the core defines SPACE_GROUP as a Set.  For its part, the DDLm core has _space_group.id, which it aliases to _space_group_id.  That seems odd for a Set category.  The item is defined as "Code identifying a space group if multiple symmetries.  See _exptl_crystals.key", which appears to conflict with defining SPACE_GROUP as a Set.

This has prompted me to do some CIF archaelogy, and so I find that the space_group category was introduced to core_cif in 2003 and was intended to supersede the symmetry category (see http://www.iucr.org/__data/iucr/lists/coredmg/msg00208.html). Following the initial introduction of this change in http://www.iucr.org/__data/iucr/lists/coredmg/msg00176.html, David Brown in msg00178 outlines the background, but there seems to have been no further public discussion beyond a concern that symmetry operators could now be referred to using a character rather than an integer.  I assume that mmCIF in providing matching datanames for those found in core_CIF likewise included the space_group category found there.

Volume G summarises the information in the preceding paragraph in section 3.2.4.4 (p 110) and reiterates that the symmetry category is deprecated.

So, as of 2005, we had the situation where two ways of defining space group information were present in core_cif.  As long as space_group was not looped, the symmetry and space_group categories were indeed interchangeable. As soon as space_group is looped, of those categories describing structural information only the space_group_symop category makes sense (as it has a dataname giving the space_group id) and space_group is *not* a proper replacement for symmetry.  This situation has presumably gone unremarked simply because we have provided no coherent way to have multiple space groups in a structural file so nobody does it. Or perhaps software simply continues to output the old symmetry items.

(The following is a reconstruction, not based on specific discussions with the DDLm developers)

Enter DDLm and, more importantly, dREL.  The dREL method expresses the precise relationship between datanames for the general case.  If space_group could be looped, then you must everywhere write your dREL as if it were looped. That's fine, but it requires that every category that uses space group information has a dataname that points to the particular space group that is used.  And these don't exist in DDL1 cif_core, so in order to capture the cif_core behaviour, the original "overall value" behaviour of the symmetry category has to be retained.

Rather than jettison space_group, however, the original DDLm draft prepared by the Perth group used the DDLm 'ref-loop' concept to allow looping of space groups: the category 'SPACE_GROUPS' was defined as a ref-loop category, that is, a loop of references to save frames containing only items from child categories (in this case SPACE_GROUP and SPACE_GROUP_SYMOP).  In this way DDLm allowed multiple space groups and corresponding symops to be contained in a datablock, firewalled from the rest of the datablock inside save frames. If you wanted a structural description, you could use the symmetry category or the space_group category, both of which were defined as Set categories and therefore take only a single value in a single save frame or data block.

Whatever your opinion of the original DDLm solution, it is not available to us as we have rejected save frames in CIF2 data blocks.  And so we find ourselves here, trying to re-solve this issue.


I'm inclined to think that this apparent conflict arises from the same inherent tension between Set and Loop that inspired the present proposal.  Observe, however, that the definition hypothesizes a specific scenario in which items from SPACE_GROUP would appear looped in a Core CIF document (though I'm having trouble finding item _exptl_crystals.key in the DDLm core).  Even if there are no other such Sets in the present DDLm Core dictionary, I think it's likely that others will be discovered as the dictionary is developed and expanded.  Indeed, the use case suggested by the _space_group.id definition seems to imply that those categories would be the rule, not the exception.  As I understand it, your proposal would not lend itself well to such uses, and especially not to using more than one unrelated Set category in looped form in the same CIF.


The mention of exptl_crystals.key is a vestige of an analogous 'ref-loop' for crystals, and so your supposition that there might be other cases where this is desired is correct.  My proposal does have the flaw that it blanket excludes all categories that don't contain a dataname that refers to the newly-looped category, and so e.g. if you want multiple crystals and multiple space groups it would exclude every category that didn't have keys for both, even though few categories intrinsically have a dependency on both.  So, if we go down this path, we need to create a new DDLm category-level attribute that lists all the categories that the given category depends upon, and then the condition on the other categories present in the same datablock as a Set category would become "only those categories that do not depend on the 'Set' category, or depend on the 'Set' category and contain a dataname pointing to the 'Set' category, may be present in the same datablock".

Note that this makes the conditionality somewhat more complex, but at the same time this is essentially a statement of the hidden status quo in DDL1 for space_group.
 

>> 2. The proposed change almost completely erases the distinction between 'Set' and 'Loop' categories.  I am not convinced that retaining the two as separate classes with such a fine distinction between them is the best course of action.
>
> DDL2 adopted the position that 'everything is a loop', and indeed, in ontological terms any concept you can think of is 'looped', that is, you can imagine situations where multiple values are available.  'Set' categories are simply a space-saving device.

[...]

> So the distinction is not so much 'Set' and 'Loop', but 'overall information' and 'per datum information'.  What the proposal describes is under what circumstances you can re-use the 'overall information' dataname as a 'per datum' dataname.


Understood.  However, I prefer an approach based as much as possible on defining what various combinations of data mean, and as little as possible on limiting what combinations of data are permitted.  Perhaps that means that indeed most categories defined in data dictionaries should be loops, or perhaps that our data dictionaries would benefit from greater distinction between defining data and defining relationships between data.


I would go as far as saying that a datablock is completely characterised by the list of loopable categories of that datablock. If we had our time over, we would have a compulsory dataname or keyword that would identify this list, and software would check this before proceeding.  We can't retrofit this, as current software would not check it and thus risks misinterpreting other datablocks.


>> 3. I am not fond of how conditional the proposed new definition text is.
>
> Neither am I, but I don't believe it is possible to come up with anything simpler that meets the requirements.  Whether the requirements are justified may also be debated, of course.
>
>> 4. It seems likely that all existing methods of current 'Set' items would be broken by the proposed change.
>
> Actually, a lot less than you might expect. If you look at the definition for _cell.volume, it works just as well on a list of cells as on a single cell.  This is inherent in the design of dREL.  The methods that fail are those that expect overall ('Set') information in a Looped category, so for example, when the model sites are calculated (see category method for model_site) they assume a single value of _atom_sites_Cartn_transform.mat. This matrix will be looped if the cell parameters are looped, and so the model_site method will fail.  For this reason I have stipulated the condition that any category appearing with a looped 'Set' category must explicitly know that the set category can be looped by defining a key that points to the set category key.  This ringfences the currently existing categories, such as model_site, that assume overall information, from any effect of this change.


My dREL is shaky, and it may be that my concern is more about implementing dREL itself than about specific methods.  If so, that might be considered a more tractable problem.  What I'm considering, however, is how you identify the wanted items from a Set category if that category has a category key.  In principle, you ought to identify the items by name and category key, but no existing method does that (or maybe it's dREL that doesn't) because Set categories don't presently have category keys.

Actually you are right. My example of _cell.volume involved a calculation ultimately within the same notional category ('cell'), where the dREL implicitly uses values on the same row, so there is no need for a reference to a category key.  When obtaining an item from another looped category, a key would be required unless it was an aggregate calculation (i.e. a dREL Loop).  It is not clear to me that there are many instances of this cross-category referencing at present, partly because it is not clear which 'Set' categories would actually belong together in a single loop, but certainly what you say is potentially correct.


>> My present thinking is that changing specific 'Set' categories into bona fide 'Loop' categories would be better than making all 'Sets' loop-like without actually making them  'Loops'.  This could be reconciled with existing data files by introducing a mechanism for defaulting category key values or by allowing category keys to be omitted from category data when only one set of date from that category is presented.  I think an approach along these lines could solve the problem at hand while addressing my concerns 1-3.  I am uncertain whether a solution is possible that fully addresses my concern #4, but if we convert  'Sets' into 'Loops' only selectively, then at least we narrow the scope of the problems with methods, and perhaps also allow an incremental approach to be taken for updating dictionaries.
>
> Unfortunately I don't believe that your suggested change would work, because (i) we need to take into account the response of existing software to datafiles with looped 'Set' categories (ii) we still need a way to indicate overall data in order to match DDL1 dataname meanings.
>
> Regarding (i),  while a system of defaulting key values and omitting them if only a single item is defined is a consistent description of currently-existing datafiles, the real issue is the opposite: what will happen when current software is faced with a file that *does* have multiple values in a 'Set' category?  Are we sure that it will not silently e.g. calculate too many atomic sites because we have listed symmetry operators from multiple spacegroups?


I think you're talking about the behavior of existing software when dealing with data that would not be valid with respect to the current DDLm core dictionary, but that would be made valid with respect to that dictionary by the kind of change I suggest.  In effect, you're arguing that whatever changes we make to DDLm and the DDLm Core dictionary should not allow future data files to be validated that present software is unprepared to accept.  I don't take that as given.  Furthermore, to the extent that anything we discuss involves modifying DDLm, are we not discussing causing the same or worse problems anyway, but for a different set of software?

My objection is more subtle - I don't mind current software being unprepared to accept a file (i.e. error message or clear failure) but I do mind current software silently processing what it thinks is a file conforming to its expectations (as all the datanames it knows about are present) and giving the wrong answer.
 

Additionally, I'm not certain how large or new this issue really is.  Take our space group example: the DDL1 Core already makes provision for the items in this category and its child categories to be looped, even though they rarely are.  Nothing prevents anyone from writing a valid CIF now, using those facilities, and presenting it to a piece of software that misunderstands it.

Indeed it is true that we have survived 13 years with this problem embedded in our standard, but I would like not to perpetuate it. My proposal (with the new condition added above) simply describes how, in actual fact, problems have been avoided.  Fortunately nobody has had the bright idea of looped space groups in the same datablock as reflection lists or structural information, or if they have, the lack of a space-group ID pointer has made them realise that they can't do it.
 


> Regarding (ii), we gain nothing by use of the PDB 'entry.id' trick as we are once again left with single-valued categories that can't have more than one datum.


I didn't go into my proposed alternative any detail, but what I have in mind does allow for multiple data per item.  It would work like this (taking SPACE_GROUP as an example):

 (*) we change SPACE_GROUP from a Set category to a Loop category, consistent with the DDL1 Core.
 (*) we define the existing item _space_group.id to be the category key.
 (*) we define a default value (_enumeration.default) for _space_group.id, maybe an empty string; this permits one set of items from the category to be presented in a CIF that does not provide any explicit _space_group.id, without preventing multiple sets being provided by including _space_group.id values explicitly.  Multiple sets require the key to be provided explicitly, for otherwise key values would be duplicated.
 (*) when a method references SPACE_GROUP outside the context of any child key referring to _space_group.id, the related items are identified based on the defined default value of _space_group.id; the data are invalid if that key does not appear (whether explicitly or implicitly) in the file

Of those, only the last requires a change to DDLm itself, and even that change could be avoided if we were willing to add actual child keys to all other categories.  The potential combinatorial problems involved in doing that could be significantly mitigated if there is a way to create a shared dictionary module containing all the needed definitions, and importing just that module into all/most categories.

While this would work prospectively, it still doesn't fix the problem of old software that doesn't know about the new child keys. Adding new keys to a category unequivocally changes the meaning of all of the non-key datanames in that category, and old software will operate using the old meanings.  If this is a point of disagreement, I will write a blog post about it as I will need to use some pictures to explain why I think this.
 

> Regarding altering dREL methods, I think it is worth realising that the dREL methods express the mathematical relationships that are 'out there' in correctly-written user software. We can no more change a correctly-written dREL method than we can change all the software that expresses that method, and as soon as any proposal involves rewriting dREL methods, that proposal is effectively a non-starter for core_CIF.  The converse is that, if we somehow control all the software that expresses that method, then we can change it.


Is it fair to say that for the purposes of the present discussion, what you're saying here means that adopting your proposal would require existing dREL implementations to be updated?  I'm again thinking of how to resolve item associations when at least one item belongs to a Set category with a category key, and the other to a different category without a corresponding child key.

Yes. As far as I know there are only 2 dREL implementations in existence, mine in PyCIFRW and Doug's Javascript so altering the implementation is possible. But my point was more that the dREL method in the dictionary expresses relationships that *any* crystallographic software already implements, not because of what is written in the dREL method but simply based on the human-readable definitions - so the dREL for 'cell.volume' describes what software already does just as much as it defines what software should do.  While some dREL rewriting would admittedly be purely technical and make no change, most changes would imply that the way that any software (not just dREL interpreters) should calculate something has changed. 

> A clean alternative is simply to define a whole new set of datanames corresponding to a 'Set' category becoming looped.  This could be economically done with (i) a category definition using a single new DDLm attribute to say that all the datanames in the new category are derived from the datanames in the 'Set' category by just changing the name.category_id component (ii) a similar new category definition for each category which now needs a key dataname pointing to the newly-looped category.  To be honest, this would be my preferred option, although we would be stuck with space_group as a legacy category.  Anyway, under this alternative, machine transformation of datafiles between the 'Loop' and 'Set' versions would be straightforward, and capture all the changes in dependent categories (so the space group column in the atom_site_with_spacegroup category would disappear upon transformation to _atom_site, and multiple data blocks, one for each space group, would be generated).
> What do you think of that?


I like that idea much better than the original proposal.  I can conceptualize it in terms of distinguishing the nature of a thing (category) from how that thing is used in a particular context (dictionary), which makes more sense to me.  And I think it addresses all the concerns I enumerated in my previous message.  I would like it especially well if there were a way to leverage _import.get for this purpose instead of modifying DDLm.

Perhaps you have misunderstood me: the categories constructed in this way would be distinct categories with distinct datanames, and while they could be defined in separate dictionaries, they could also appear in the same dictionary.  In terms of this latest proposal, the space_group category as originally created in cif_sym was the looped version of the symmetry category, and should simply never have been presented as perfectly equivalent to the symmetry category.

On the other hand, I am not certain I prefer it to making most categories in data dictionaries be bona fide Loops, with an internally-consistent mechanism to elide category keys where unneeded, as I proposed.  In fact, I'm not confident that we will avoid wanting such a mechanism in any case.  Nevertheless, if the group is not prepared to go there at this point then that idea can go back on the shelf.


If it weren't for the legacy software issue and the need to mimic the behaviour of the old DDL datanames, this would be a viable alternative. Indeed, the imgCIF dictionary works in exactly this way, allowing scan_id keys to be elided if there is only one scan.

Which brings up another consequence of the synthetic proposal that I hadn't really considered.  If we consider the space_group example, a programmer wishing to obtain the symmetry operators currently has to check the old symmetry category as well as the new space_group_symops category if there is only one space_group category entry (although I doubt anybody actually bothers to check the latter in structural work).  So even though we might create a bunch of synthetic categories purely for a particular, obscure use case, everybody who reads in the original categories will still need to check the new categories, as they become equivalent to the original category when there is one entry in the category created from the original 'Set' category (the same situation in which the key elision could operate).  This annoyance is also the case for my original proposal - because although the original 'Set' category keeps the same name, the new categories dependent on it will have different names but be equivalent for a single entry in the 'Set' category.

Which leads to me to think that we need a new proposal to tackle this, as I think we are already annoying CIF-reader programmers that have to deal with an ever-increasing list of equivalent datanames for common items.

James.

________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.