[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Further discussion of proposal #2

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2
From: Brian McMahon <[email protected]>
Date: Mon, 20 Jun 2016 18:03:53 +0100
In-Reply-To: <CAM+dB2fu4OXVqGf1X=s+Q0dRD1ftzPUvQ2+iEG-V5rhZKASCow@mail.gmail.com>
References: <CAM+dB2c4XhGDZQ7PBAHhUfmTXc7X7H2PboWBH3s1dapp0Gh_KQ@mail.gmail.com><BY2PR0401MB093685CBF929AF951C0626FCE0570@BY2PR0401MB0936.namprd04.prod.outlook.com><CAM+dB2fu4OXVqGf1X=s+Q0dRD1ftzPUvQ2+iEG-V5rhZKASCow@mail.gmail.com>
Dear Colleagues
I have had little time to keep pace with the discussion as it developed, and have tried to review the whole conversation today,but there's too much for me to assimilate in one sitting. Just acouple of comments and queries to help clear away some of my confusion.
(1) I'm very much in favour of the proposal to make AUDIT_CONFORM aLoop and not a Set.
(2) It seems to me that a formal approach to the distinction might be to define a Set as a category of data items that - if looped without an explicit key - assume a default value ('') of the category key item.This obliges dictionary writers to specify a data name that plays the role of a formal key for *every* category, but it does not require data files to carry instances of every such key data name. [Or maybe it's alittle more forgiving than that: "lazy" dictionary writers only need tospecify key data names when real use cases demand looping of what had been expected to be single-value values; but then it is incumbent onthem to stir out of their laziness and ensure that all consequent child key relationships are consistent across the new use cases that have arisen.]
(3) The _audit.schema proposal has its attractions, though I'm not surehow it works in practice. I mean, suppose I define an "INCOMMENSURATE"schema to indicate that multiple space groups describe multiple discernible symmetries in a real atomic (quasi-)lattice, and a "TABLES"schema to indicate that this is just a list of symmetry operations inall the distinct space groups. It could be useful for validation purposes to know that "INCOMMENSURATE" also requires additional information/relations between other categories (e.g. are there differentorigins or orientation matrices associated with each space group?). Is _audit.schema necessary and sufficient to capture these additional requirements? If not, could it be made so? I think this is moving into the sort of thing that Simon is interested in - can we elegantly define application profiles that say "this is a single-crystal untwinned structure", "this is an incommensurate powder structure with twinning", "this is a structure refinement with its own database of neutron absorption coefficients"?
(4) Probably an obtuse question, but is it possible to retain in the DDLm version of the core a SYMMETRY category that is a Set, and a separate SPACE_GROUP category that is a Loop? Hardly elegant, but a way of owning up to the historical mistake? Then the relationship between the different datanames would not be through the alias mechanism, but rather by some dREL transformation?
(5) So I've not commented specifically on the 'Global' proposal below. As I understand it, the change in name is designed to make clearer thecircumstances in which, as it were, you want to force a category not to loop its values. If 'globality' is indeed the only reason that you would enforce such a constraint, and if that helps programmers to understand what's going on, I'd be in favour of it; but I want to think some more about it before committing myself to that first opinion!
Brian

On 20/06/2016 08:35, James Hester wrote:> Dear John et. al.>> To summarise at the top, my principal objection to the 'default key'> proposal is that it produces more complex dictionaries (more keys) with> interactions that are initially surprising to a casual reader.>> Now in detail:>> I think our goal here is to come up with semantics that can (i)> replicate DDL1/DDL2 'global category' behaviour and (ii) allow these> global categories to become multi-packeted, with simultaneous loss of> 'globality'.  'Global' categories (what I have referred to previously as> 'Set' categories) are just a tool for simplification of dictionaries,> and so the more complex we make their operation, the less benefit they> provide.  Likewise, the mainstream behaviour of feature (i) should be as> easy as possible to use.>> Proposal #2 as it currently stands (the 'Set' proposal) envisaged that> the 'globality' of a category would be removed when using datanames> defined within a separate dictionary (mostly key datanames), and> software should use the _audit.schema dataname and potentially> _audit.conform to shield itself from the change in meaning that this> entails.>> The 'default keys' proposal that John has outlined instead envisages> making almost all 'Set' categories into 'Loop' categories, defining keys> for them, and giving those keys default values.  John has suggested that> this does not now involve a change in DDLm, because the semantics of> having a default key are clear - the dataname can be left out if there> is only one packet. However, a 'global' category with only one packet> does *not* (currently) act like a 'Loop' category with only one packet,> because (unlike a single-packet 'Loop' category) the values appearing as> non-key datanames in the 'global' category may be assumed when> interpreting values from all other datanames in all other loops.> 'Global' categories really are different to 'Loop' categories for this> reason, regardless of whether or not a key dataname is provided.>> This difference between 'Global' and 'Loop' categories could be removed> completely if all of the global category child keys were defined in> parallel. In this case, the 'Global' category no longer acts 'Globally'> but only in those categories for which a child key is defined.  This> 'simplification' comes at the expense of a whole lot of keys - in some> categories, a key for every 'Set' category currently defined.  At this> point we have lost the practical simplification that we had obtained> from 'Set' categories to start with.   So, either you accept a change in> DDLm (additional consequences of a default key) and define the child> keys at a future date in another dictionary, or you keep DDLm unchanged> and include the child keys in the main dictionary immediately, throwing> out the considerable simplification afforded by having global values. I> would be against the latter option as it introduces a bunch of> rarely-used key definitions into the main dictionary and is likely to be> confusing to a casual programmer.>> (We could of course alternatively adopt the blanket rule that values> appearing in a single-packet loop act globally with identical> 'disappearing key' behaviour.  While this is true enough mathematically,> it now becomes permissible to drop keys that have up until now been> required even for single packet loops and loops with foreign keys that> point to those single-packet loops, and this would break current> software.  So I exclude this as an option, even if it is an elegant rule.)>> So, given that we are stuck with two types of 'Loop' category, I would> prefer communicating this clearly up front in the _definition.class tag,> rather than relying on the presence of a default key value.  What I> think might communicate better than the current 'Set' definition,> however, is a change from 'Set' to 'Global' (or 'Overall'), with a> definition something like:>> Global> ;>     A special type of 'Loop' category. When single-valued, (i.e.>     key-value pairs or single-row loops) datanames from a 'Global'>     category provide overall values for use in interpreting any>     other values in a datablock.  Global categories may only be>     looped where a key has been defined.> ;>> I'm not sure if this is more likely to meet with approval.>> I have added some more comments in John's email below.>>> On 18 June 2016 at 08:31, Bollinger, John C <[email protected]> <mailto:[email protected]>> wrote:>>     Dear James and Colleagues,>>     Comments in line below.>>>     On Thursday, June 16, 2016 9:23 PM, James Hester wrote:>     > I'm not at all concerned about tweaking DDLm. The proposed update to DDLm is a clarification and an extension, because the semantic interpretation of existing files would be unchanged.  Is there any particular reason you are concerned about such measured changes to DDLm?  From my point of view DDLm is the lowest-impact area of the framework - very few people actually care *how* we express the meaning of a dataname, as long as that meaning doesn't change, and those that do care deeply about DDL in general (in my experience, databases) have not done any work on DDLm yet.>>>     Perhaps my concerns are misplaced, but it seems to me that the DDLs>     are the locations of greatest semantic leverage in our framework.>     On one hand, that means that we can make a large impact with changes>     there, but on the other hand it means that even small changes there>     can have large unintended side effects.  Indeed, although I am>     unaware of any explicit assertion to this effect previously, it>     seems to me that we should have at least the same commitment to the>     stability of definitions in our DDL dictionaries that we do to the>     stability of definitions in our data dictionaries.  But perhaps we>     can relax that a bit for DDLm, given that its use is still small.>>> Very little DDLm software has been written, and mostly by those in this> group.  A lot of thought and negotiation (I believe) has gone into DDLm,> so we should not be too cavalier with our changes.  Now is the best time> to make them rather than later when we might hope for more widespread> adoption.>>>>>     > I'm not opposed to the concept of a default key value per se, I'm just unclear as to why you are arguing that this needs to be defined in a cif_core 'Set' category as opposed to an add-on dictionary.>>>     I'm arguing that a category that has a key and permits multiple>     values per item is a de facto Loop, and that it is best to in fact>     define such a category as a Loop so that that is clear.  In that>     case its key must be expressed in the dictionary that defines the>     category.  It would also be acceptable to classify such a category>     with some new label, but in that case I still think it would be most>     sensible to define the key in the same dictionary that defines the>     category itself.>>> See my comments at the top of the email.  I have provided a new label> and definition, which indicates that the category can be looped, and> under what conditions multiple packets may be expected.  Perhaps this is> acceptable?>>>     I'm also arguing against the "magic keys" aspect of Proposal #2.  I>     don't like magic, a.k.a. special cases, in specifications or in>     software, and I have presented a viable alternative in the form of>     default key values.>>> The reason for the special case 'Set' category is the considerable> simplification it offers.  We trade complexity of behaviour in one place> for simplicity elsewhere.  And we are ultimately stuck with it because> of DDL1.>>>     I'm furthermore arguing that even if we do give keys to Sets,>     wherever a category key or child key is itself defined is the proper>     place for any applicable default value for that key to be defined.>     The default value is an attribute of the definition of the key item,>     so I see only negatives to physically separating the two.>>> Absolutely, I wouldn't argue with this.>>>>     >> Let's consider the SPACE_GROUP category, since it sparked this whole discussion.  I append a cut at what I think we should do with it (only frames containing modifications are presented); I think I have marked all the changes and additions within via CIF comments.  I rarely wrangle dictionaries, so I apologize for any errors I have committed.  The key defaulting presented within formalizes how, when, and why SPACE_GROUP's category key and the associated child key in SPACE_GROUP_SYMOP can be omitted from data files.  To the best of my knowledge, nothing within relies on any DDLm changes.>     >>     > I think I understand your proposal to be using the existence of a default key value to signal that the key may be omitted in a single-value loop, *and* that child key datanames in other loops that would otherwise contain them may be omitted in this case.>>>     I guess you can describe it as a "signal".  I view it as deeper and>     more organic: where an explicit parent or child key may be omitted>     from data files, that is a direct consequence of the fact that it>     has a default value.  That dictionary-driven software should handle>     such omissions naturally is also a consequence.  These items can be>     omitted because they still take well-defined and suitable (default)>     values in that case.>>     I don't think I'm suggesting any change to the defined meaning of>     _enumeration.default; I'm just applying its existing meaning to the>     problem at hand in a way that we have not done before.  The>     significance pertains not to _enumeration.default itself, but to its>     combination with a category key.  That's not a change, it's a>     discovery.  Even so, the underlying idea is not actually new.  One>     can view it as a specific case of the same thing expressed by DDL2's>     _item.mandatory_code taking the value 'implicit'.>>> See my comments at the beginning for why I think there is more than just> logical consequences going on here i.e. there is global behaviour.>>>     > I'm not clear whether you propose that these changes should happen in cif_core, or in an add-on dictionary.>>>     For space_group, the dictionary changes should be applied to the>     core, in order to make the DDLm core consistent with our other>     dictionaries.  I am generally inclined to put future (re-)keyings of>     core categories directly into the core dictionary as well, but>     that's a weaker opinion.  Furthermore, I think there may be a way to>     do this so that we avoid an explosion of child keys, but I haven't>     worked all the way through that yet.>>> Your proposal on child keys would be interesting as I argue above that> an explosion of child keys is a drawback and essentially removes the> advantage gained by having global categories.>>>>>     >  In any case, I agree that this can be made precisely semantically equivalent to the 'Set' proposal, due to the fact that a default key value makes no sense in general and so the meaning of a default value for a key may be overloaded as you have done, with no implications elsewhere.  This is still a change to DDLm, because the presence of  _enumeration_default in certain definitions now has new implications (not that I'm opposed in principle to changing DDLm).>>>     I agree that the "magic keys" aspect of Proposal #2 and the default>     keys approach I have presented both enable categories to have keys>     that are not expressed explicitly in data files.  The former does it>     by fiat; the latter does it in a manner consistent with DDLm's>     existing semantics, even if our dictionaries have not exercised DDLm>     in quite that way before.>>     I agree that a default key value is not necessarily sensical for>     every present or conceivable category, but I disagree that I am>     overloading any definition, or that I am proposing a change to DDLm.>>> I would see no problem in a separate dictionary defining the key and> default value for a 'Global' category as I've defined above.> Essentially, conformance to this separate dictionary erases the 'Global'> nature of the category and turns it into a normal 'Loop' category with> default key, so that datafiles created according to the original> specification remain valid with the new dictionary - we have in fact> elegantly expanded the ontology.>>>>     Default key values do not make sense for categories that rely on>     natural keys, as does mmCIF's atom_type category, for example.>     Atom_type's key, _atom_type.symbol, is the chemical symbol for the>     element whose characteristics are described; it is a natural key>     because it has significance beyond distinguishing one atom_type from>     another.  In other words, it is not just a key, but also part of the>     data.>>     On the other hand, space_group does not use a natural key, but>     rather a surrogate key -- one whose values have no inherent meaning>     other than to distinguish between different space_groups presented>     in the same data file.  If only one space_group is presented then>     any key for it will do, because the keys are arbitrary.  A default>     value for such a key is perfectly sensible.>>     Now, consider this: what kind of key will any new category have if>     that category requires one or more existing Set categories to become>     looped?  We have previously discussed possibilities such as>     twin_component and variant, but as far as I can tell, these do not>     afford any clear, non-trivial, natural keys.  Addition of any>     category that relies on a natural key would require existing sets to>     be looped only if that category's key is inherently single-valued>     with respect to those sets.  I'm in fact having trouble seeing the>     circumstances under which it would make sense to add a new category>     that has a natural key and that requires existing sets to be>     looped.  But even if we did discover a new category with a>     non-trivial, natural, candidate key, we always have the option of>     choosing a surrogate key instead.  Indeed, that's what was done with>     space_group -- _space_group.name_Hall is a candidate key, I think,>     but we chose a surrogate key instead.  If we choose surrogate keys>     then default key values present no semantic problem.>>> I agree with this - I'm not arguing that default key values are somehow> bad or present problems, only that the 'global' behaviour is not captured.>>>>     > My preference would still be for the 'Set' proposal, because the semantics are wrapped up in a single enumerated value, at category level, rather than arising from an interaction between attributes of a particular dataname inside that category.  I do not see any other distinguishing features.  I believe that for programmers, dictionary authors, and casual dictionary readers, the 'Set' proposal is more accessible, as the particular special behaviour of the category is flagged explicitly and concisely, in the category definition, and described in a single place in the DDLm attribute dictionary.>>     [...]>     Moreover, even if we did provide magic key behavior for Sets, I am>     not convinced that all the constituencies named would necessarily>     consider that a win, because it weakens the concept of a Set.  There>     is a tremendous difference between "the items in a Set category take>     only one value each" and " the items in a Set category *ordinarily*>     take only one value each", especially when "ordinarily" really means>     when the data describe a particular kind of thing to which we have>     ascribed special status.  In many respects, programming for, using,>     or interpreting the latter (the magic keys version) are all more>     difficult than programming for, using, or interpreting the former>     (the current version).>>> OK, point taken, I did say my objection wasn't critical.>>>>     >  You will notice there is semantic convenience in referring to a category as a 'Set' category, rather than 'a category that has a default key value defined'. If you propose changing the cif_core dictionary rather than using an add-on dictionary, then the 'Set' proposal involves zero changes, whereas the default_value proposal involves a single extra key definition and adjustment to the definitions for each 'Set' category.  Both these objections are not particularly critical, of course.>>>     The semantic convenience described comes at the cost of weakening>     the concept of a 'Set', and as a result, the comparison presented>     involves inequivalent expressions.  The magic keys analog of  'a>     category that has a default key value defined' is 'a Set that has a>     category key defined'; these don't seem very different in weight to>     me.  If we suppose that a Set may have one or more keys defined in a>     different dictionary than the one in which the Set itself is>     defined, then additionally we may not even be certain which kind of>     Set we're talking about, and if that ever changes then we cannot be>     confident of being able to recognize that from the dictionary at>     hand.  That is of course where _audit.schema and audit_conform come>     in, but I am not much liking the idea that an applicable item>     definition, taken in context of its dictionary, may not completely>     define the given item.>>> That is where we started - if we are to allow datanames to used with> global meaning and in multi-packet loops, then we are talking about> different meanings, and only something like _audit.schema can insulate> software from that.  If we are to exclude changes in meaning, we have to> define all child keys up front and then we need _audit.schema even more> than before, as _audit.conform won't help.  In the 'all child keys> defined up front' scenario, we completely abandon global categories and> _audit.schema becomes the signal as to when a datablock can be> interpreted as for the old cif_core.>>     > Ultimately, this is going to be a matter of taste as the semantics can be made identical, and so I don't know quite what else you or I can say to convince each other on this point.  We may have to rely on our colleagues to decide.>>>     We do seem to have both settled into our positions.  Would it sway>     you at all if I successfully devised a solution to the child key>     proliferation problem?  I have some ideas in that direction that I>     haven't fleshed out yet.>> It could indeed sway me as I think this is at the core of my objection.> If we could effectively define all the child keys, while at the same> time keeping the key definitions from swamping out the meat of the> dictionary, and allow for the appearance of future 'used to be global'> categories like twinning and variants adding their own child keys, then> it would be worth serious thought.  I'm pretty sure dREL can be brought> along with whatever variation you propose.>>>     >> Note, by the way, that I think the particular changes presented, or something very like them, are needed regardless of what we choose for the general case, because the DDL1 core and mmCIF are already structured this way.>     > I was perhaps too diplomatic or long-winded in previous messages. The incorporation of space_group into cif_core as a looped category was a mistake that we must *not* perpetuate. We either correct it by dropping it from DDLm cif_core, which is impossible due to widespread DDL1 usage (as a 'Set' category), or we fix the semantics.  So, in the case of space_group we can feel ourselves bound only by widespread current usage, not by the contradictory semantics of the DDL1 version.>>>     I accept that the deprecation of SYMMETRY and SYMMETRY_EQUIV in>     favor of SPACE_GROUP and SPACE_GROUP_SYMOP was a mistake, but>     whatever fix we contemplate should adhere to our policy of keeping>     definitions stable, at least as well as we are able to make it do.>     Moreover, how to deal with SPACE_GROUP is a somewhat separate issue,>     because it involves definitions that already exist, as opposed to>     definitions that we may write in the future.  It makes for a>     reasonable test case for our future direction, but it may be that a>     different solution is more suitable here than whatever we decide to>     do in the future, when we have no legacy definitions to deal with.>>> Our policy of keeping definitions stable is not an end in itself, but a> logical requirement born of the need to guarantee that software that is> already written remains valid. If everybody is using unlooped> SPACE_GROUP to read and write structures I don't see any issue in> fiddling with the meaning, as long as any changes are consistent with> that expectation of an unlooped value.>>>>     I'll have more to say about these particular cases, in a separate>     message.>>     [...]>     ________________________________>>> --> T +61 (02) 9717 9907> F +61 (02) 9717 3145> M +61 (04) 0249 4148>>> _______________________________________________> ddlm-group mailing list> [email protected]> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group>_______________________________________________ddlm-group mailing [email protected]http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Herbert J. Bernstein)

References:

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Prev by Date: Re: [ddlm-group] Further discussion of proposal #2

Next by Date: Re: [ddlm-group] Further discussion of proposal #2

Prev by thread: Re: [ddlm-group] Further discussion of proposal #2

Next by thread: Re: [ddlm-group] Further discussion of proposal #2

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Further discussion of proposal #2