[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Further discussion of proposal #2

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Further discussion of proposal #2
From: "Herbert J. Bernstein" <[email protected]>
Date: Mon, 20 Jun 2016 05:04:58 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=mime-version:in-reply-to:references:from:date:message-id:subject:to;bh=ZO8oSZodtrGkDOD4tNB4fKwk97ePM8dztL9QcHha+KE=;b=uwokmstxrMfjxoGsaDOwD0sE7hIVkfMbQY/Ebsjy04X1KOV2yP+RRSgQtXGjut2rn3Rn+I8r76IjYGAk5jq80JZe/NgfOiiWbS0AlZd8aIj9umiDLzz2qs7tSxXj9AJv4VqmNg7IISI8+w7yi4SGeLgsYIbg/nNeRT2bnX98QqNqQHOtJV/KPxVAzy5SRtesmzTdK3SE9QMn4rvLVIcr1If5d6qNwzGY4UxnJ+m2IIQv8XK6CYH1U8L8bwMcHcYOwZPVQkgAHyMrRoIXi5qI3jAlVOkBAQkMLhmhDfsVjkrLrHOAnLxFg5idSHCPCfrXehV8STYY51Qsb1Py25cE7w==
In-Reply-To: <CAM+dB2ev3n=Oa4xpG9OvNRoefz8cQuzmGP82A-ACS6BPkFv+-g@mail.gmail.com>
References: <CAM+dB2c4XhGDZQ7PBAHhUfmTXc7X7H2PboWBH3s1dapp0Gh_KQ@mail.gmail.com><BY2PR0401MB093685CBF929AF951C0626FCE0570@BY2PR0401MB0936.namprd04.prod.outlook.com><CABcsX25LQ4JVr=_AwH_iXNW7VZes0mM77mxwn8P0x88SPJgZzg@mail.gmail.com><CAM+dB2ev3n=Oa4xpG9OvNRoefz8cQuzmGP82A-ACS6BPkFv+-g@mail.gmail.com>

Dear James,

My thanks to you and to John for your efforts. I look forward to the results.

Regards,

Herbert

On Mon, Jun 20, 2016 at 3:47 AM, James Hester <[email protected]> wrote:

Hi Herbert,

We are working out the details of a proposal that goes beyond anything that DDL1 or DDL2 currently envisage (essentially an automated, future-proof change in database schema), so it necessarily involves detailed discussion. While users are obviously free to use CIF however they wish in the privacy of their own laboratory, as a standards body we need to ensure that *no* files written according to the standard are misinterpreted by *any* software also written according to the standard.   The user in your example may gleefully loop a Set, but that file could be woefully misunderstood by most software.

My feeling is that we are making progress, so please be patient for a little longer.

all the best,
James.

On 18 June 2016 at 11:23, Herbert J. Bernstein <[email protected]> wrote:
Dear Colleagues,

I am now totally lost as to what issue is being discussed, so please let me intrude with what I think are the issues relevant to the users of CIF as opposed to the users of DDL1, DDL2 or DDLm.

A given data CIF is a container for tags and associated values. Another way to think of it is as a container for objects as instances of classes. A particular atom_site is an object with properties such as an x, y and z coordinate, an element type, an atom name, an occupancy, etc. Most data CIFs will have lots of atom_site objects. For most users, their data CIF will have only one cell/space group object.   It does not matter to a user if we call some object a set or a loop. What matters to the user is that we have provided a container that will hold all the relevant object instances for the user's data. Now suppose a user decides to do a study of a crystal that happens to have two incommensurate lattices intertwined in the same crystal. Maybe they decide to handle this with quasi-crystal notation, but in some cases, they may simply decide to provide two cells and space groups for the same data CIF. All they want from us is to tell them a clean simple way to put both cells and space groups in the same CIF at the same time. That user could do a perfectly fine job just looping the two sets of data. In this case there is no ambiguity about the meaning.
No explicit keys are needed. However, having started the foray into such substances, our user decides to study a few more, and even make a small database of them. Now, it would be nice if the user can also put his database into one CIF. Now he needs some explicit keys.

If we have a clean way to do these three things -- present a single substance with one cell, present a single substance with two or more cells, present a database with multiple substances, each of which may
have multiple cells, we have made the user happy. It is obvious how the the user can do these things in a CIF, so some user is going to do it -- they are going to loop what DDLm is calling a Set both without and with new keys, and they are not going to pay any attention to protests that they should not do that. They'll just say they are using DDL2. If DDLm does not allow it, we need to change DDLm. If it makes understanding or programming for DDLm a little more complicated, we will have to live with that -- that is the entire point of having software, to let users do complicated things easily.

I have no trouble with either of James' proposals. I do have trouble with the idea of making life more difficult for users in order to make like easier for programmers.

Please, let us agree on something that can be used for this purpose and move on.

Regards,
    Herbert

On Fri, Jun 17, 2016 at 6:31 PM, Bollinger, John C <[email protected]> wrote:
Dear James and Colleagues,

Comments in line below.

On Thursday, June 16, 2016 9:23 PM, James Hester wrote:
> I'm not at all concerned about tweaking DDLm. The proposed update to DDLm is a clarification and an extension, because the semantic interpretation of existing files would be unchanged. Is there any particular reason you are concerned about such measured changes to DDLm? From my point of view DDLm is the lowest-impact area of the framework - very few people actually care *how* we express the meaning of a dataname, as long as that meaning doesn't change, and those that do care deeply about DDL in general (in my experience, databases) have not done any work on DDLm yet.

Perhaps my concerns are misplaced, but it seems to me that the DDLs are the locations of greatest semantic leverage in our framework. On one hand, that means that we can make a large impact with changes there, but on the other hand it means that even small changes there can have large unintended side effects. Indeed, although I am unaware of any explicit assertion to this effect previously, it seems to me that we should have at least the same commitment to the stability of definitions in our DDL dictionaries that we do to the stability of definitions in our data dictionaries. But perhaps we can relax that a bit for DDLm, given that its use is still small.

> I'm not opposed to the concept of a default key value per se, I'm just unclear as to why you are arguing that this needs to be defined in a cif_core 'Set' category as opposed to an add-on dictionary.

I'm arguing that a category that has a key and permits multiple values per item is a de facto Loop, and that it is best to in fact define such a category as a Loop so that that is clear. In that case its key must be expressed in the dictionary that defines the category. It would also be acceptable to classify such a category with some new label, but in that case I still think it would be most sensible to define the key in the same dictionary that defines the category itself.

I'm also arguing against the "magic keys" aspect of Proposal #2. I don't like magic, a.k.a. special cases, in specifications or in software, and I have presented a viable alternative in the form of default key values.

I'm furthermore arguing that even if we do give keys to Sets, wherever a category key or child key is itself defined is the proper place for any applicable default value for that key to be defined. The default value is an attribute of the definition of the key item, so I see only negatives to physically separating the two.

>> Let's consider the SPACE_GROUP category, since it sparked this whole discussion. I append a cut at what I think we should do with it (only frames containing modifications are presented); I think I have marked all the changes and additions within via CIF comments. I rarely wrangle dictionaries, so I apologize for any errors I have committed. The key defaulting presented within formalizes how, when, and why SPACE_GROUP's category key and the associated child key in SPACE_GROUP_SYMOP can be omitted from data files. To the best of my knowledge, nothing within relies on any DDLm changes.
>
> I think I understand your proposal to be using the existence of a default key value to signal that the key may be omitted in a single-value loop, *and* that child key datanames in other loops that would otherwise contain them may be omitted in this case.

I guess you can describe it as a "signal". I view it as deeper and more organic: where an explicit parent or child key may be omitted from data files, that is a direct consequence of the fact that it has a default value. That dictionary-driven software should handle such omissions naturally is also a consequence. These items can be omitted because they still take well-defined and suitable (default) values in that case.

I don't think I'm suggesting any change to the defined meaning of _enumeration.default; I'm just applying its existing meaning to the problem at hand in a way that we have not done before. The significance pertains not to _enumeration.default itself, but to its combination with a category key. That's not a change, it's a discovery. Even so, the underlying idea is not actually new. One can view it as a specific case of the same thing expressed by DDL2's _item.mandatory_code taking the value 'implicit'.

> I'm not clear whether you propose that these changes should happen in cif_core, or in an add-on dictionary.

For space_group, the dictionary changes should be applied to the core, in order to make the DDLm core consistent with our other dictionaries. I am generally inclined to put future (re-)keyings of core categories directly into the core dictionary as well, but that's a weaker opinion. Furthermore, I think there may be a way to do this so that we avoid an explosion of child keys, but I haven't worked all the way through that yet.

> In any case, I agree that this can be made precisely semantically equivalent to the 'Set' proposal, due to the fact that a default key value makes no sense in general and so the meaning of a default value for a key may be overloaded as you have done, with no implications elsewhere. This is still a change to DDLm, because the presence of _enumeration_default in certain definitions now has new implications (not that I'm opposed in principle to changing DDLm).

I agree that the "magic keys" aspect of Proposal #2 and the default keys approach I have presented both enable categories to have keys that are not expressed explicitly in data files. The former does it by fiat; the latter does it in a manner consistent with DDLm's existing semantics, even if our dictionaries have not exercised DDLm in quite that way before.

I agree that a default key value is not necessarily sensical for every present or conceivable category, but I disagree that I am overloading any definition, or that I am proposing a change to DDLm.

Default key values do not make sense for categories that rely on natural keys, as does mmCIF's atom_type category, for example. Atom_type's key, _atom_type.symbol, is the chemical symbol for the element whose characteristics are described; it is a natural key because it has significance beyond distinguishing one atom_type from another. In other words, it is not just a key, but also part of the data.

On the other hand, space_group does not use a natural key, but rather a surrogate key -- one whose values have no inherent meaning other than to distinguish between different space_groups presented in the same data file. If only one space_group is presented then any key for it will do, because the keys are arbitrary. A default value for such a key is perfectly sensible.

Now, consider this: what kind of key will any new category have if that category requires one or more existing Set categories to become looped? We have previously discussed possibilities such as twin_component and variant, but as far as I can tell, these do not afford any clear, non-trivial, natural keys. Addition of any category that relies on a natural key would require existing sets to be looped only if that category's key is inherently single-valued with respect to those sets. I'm in fact having trouble seeing the circumstances under which it would make sense to add a new category that has a natural key and that requires existing sets to be looped. But even if we did discover a new category with a non-trivial, natural, candidate key, we always have the option of choosing a surrogate key instead. Indeed, that's what was done with space_group -- _space_group.name_Hall is a candidate key, I think, but we chose a surrogate key instead. If we choose surrogate keys then default key values present no semantic problem.

> My preference would still be for the 'Set' proposal, because the semantics are wrapped up in a single enumerated value, at category level, rather than arising from an interaction between attributes of a particular dataname inside that category. I do not see any other distinguishing features. I believe that for programmers, dictionary authors, and casual dictionary readers, the 'Set' proposal is more accessible, as the particular special behaviour of the category is flagged explicitly and concisely, in the category definition, and described in a single place in the DDLm attribute dictionary.

By "the 'Set' proposal", I think you mean that we give category keys to certain Set categories, and rely on the "magic keys" provision of proposal #2 to allow associated child keys (and the category keys themselves?) to be omitted from data files when that does not create an ambiguity. Thus, I interpret you to be saying that the "magic keys" behavior would be associated specifically with Set categories, and to be asserting that various constituencies will favor that. There, however, I think you're setting up a bit of a straw man.

If indeed there is any magic / special case behavior then I agree that it would be best to indicate it as clearly and concisely as possible (and I observe in passing that labelling a category as a Set in fact does so only provisionally, depending on whether a category key is also defined). But the alternatives are not "magic behavior expressed concisely" and "magic behavior expressed complexly"; rather, they are "magic behavior" and "non-magic, machine-readable behavior". I argue that any kind of magic behavior is inherently unfavorable. Behavior that follows from the machine-readable semantics of DDLm is to be preferred, largely part because the details are precisely described for both humans and machines. That does not prevent describing the behavior with prose as well, of course.

Moreover, even if we did provide magic key behavior for Sets, I am not convinced that all the constituencies named would necessarily consider that a win, because it weakens the concept of a Set. There is a tremendous difference between "the items in a Set category take only one value each" and " the items in a Set category *ordinarily* take only one value each", especially when "ordinarily" really means when the data describe a particular kind of thing to which we have ascribed special status. In many respects, programming for, using, or interpreting the latter (the magic keys version) are all more difficult than programming for, using, or interpreting the former (the current version).

> You will notice there is semantic convenience in referring to a category as a 'Set' category, rather than 'a category that has a default key value defined'. If you propose changing the cif_core dictionary rather than using an add-on dictionary, then the 'Set' proposal involves zero changes, whereas the default_value proposal involves a single extra key definition and adjustment to the definitions for each 'Set' category. Both these objections are not particularly critical, of course.

The semantic convenience described comes at the cost of weakening the concept of a 'Set', and as a result, the comparison presented involves inequivalent expressions. The magic keys analog of 'a category that has a default key value defined' is 'a Set that has a category key defined'; these don't seem very different in weight to me. If we suppose that a Set may have one or more keys defined in a different dictionary than the one in which the Set itself is defined, then additionally we may not even be certain which kind of Set we're talking about, and if that ever changes then we cannot be confident of being able to recognize that from the dictionary at hand. That is of course where _audit.schema and audit_conform come in, but I am not much liking the idea that an applicable item definition, taken in context of its dictionary, may not completely define the given item.

> Ultimately, this is going to be a matter of taste as the semantics can be made identical, and so I don't know quite what else you or I can say to convince each other on this point. We may have to rely on our colleagues to decide.

We do seem to have both settled into our positions. Would it sway you at all if I successfully devised a solution to the child key proliferation problem? I have some ideas in that direction that I haven't fleshed out yet.

>> Note, by the way, that I think the particular changes presented, or something very like them, are needed regardless of what we choose for the general case, because the DDL1 core and mmCIF are already structured this way.
> I was perhaps too diplomatic or long-winded in previous messages. The incorporation of space_group into cif_core as a looped category was a mistake that we must *not* perpetuate. We either correct it by dropping it from DDLm cif_core, which is impossible due to widespread DDL1 usage (as a 'Set' category), or we fix the semantics. So, in the case of space_group we can feel ourselves bound only by widespread current usage, not by the contradictory semantics of the DDL1 version.

I accept that the deprecation of SYMMETRY and SYMMETRY_EQUIV in favor of SPACE_GROUP and SPACE_GROUP_SYMOP was a mistake, but whatever fix we contemplate should adhere to our policy of keeping definitions stable, at least as well as we are able to make it do. Moreover, how to deal with SPACE_GROUP is a somewhat separate issue, because it involves definitions that already exist, as opposed to definitions that we may write in the future. It makes for a reasonable test case for our future direction, but it may be that a different solution is more suitable here than whatever we decide to do in the future, when we have no legacy definitions to deal with.

I'll have more to say about these particular cases, in a separate message.

>Regarding your space-group example below, I may have missed something in your proposal: you have added a key to space_group_symop pointing to space_group. Why have you not done this for all other loop categories that rely on the value of space_group, for example, 'atom_site', 'refln' etc.?

I added _space_group_symop.sg_id in the sense that the DDLm core had not previously defined it. It is not actually new, however, because mmCIF and symCIF *do* define it, and the DDL1 core defines an analogue in _space_group_symop_sg_id. Because the other versions of the core define it, the DDLm core requires it, too.

My intention was to capture the key structure defined in symCIF and the DDL1 core (but strangely missing from mmCIF, despite the item's description), and to use default key values to cover the usage in most existing CIFs. This was a vehicle for a detailed example of how I propose to use default key values, intended to be sufficient to make the DDLm core consistent with the other versions of the core and with symCIF with respect to the categories presented. It was never intended as a complete proposal for a dictionary revision.

> Note also that my example #1 from yesterday's email was a published program that would fail when presented with a datafile conforming to the definitions below (doesn't check space group loopiness, does loop over symops to get atomic positions), i.e. these changes can only be made after a way of protecting existing software from them is established.

Yes, but we have already agreed that we cannot preventing existing software from misinterpreting data files. Moreover, your example #1 can misinterpret data files that comply with the DDL1 core, too, so I'm not prepared to give much weight to the misinterpretation issue in this case.

John

________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Re: [ddlm-group] Further discussion of proposal #2 (Bollinger, John C)

Re: [ddlm-group] Further discussion of proposal #2 (Herbert J. Bernstein)

Re: [ddlm-group] Further discussion of proposal #2 (James Hester)

Prev by Date: Re: [ddlm-group] Further discussion of proposal #2

Next by Date: Re: [ddlm-group] Further discussion of proposal #2

Prev by thread: Re: [ddlm-group] Further discussion of proposal #2

Next by thread: Re: [ddlm-group] Further discussion of proposal #2

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Further discussion of proposal #2