Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Third and final proposal to enhance dREL

Dear All,

I understand John's comments to be based on the view that a dREL method for checking correct use of DDLm is essentially tautological, as the execution of such a method will necessarily involve an assumption of those same semantics.  I can't see how this is true. For example, it is quite possible for a DDLm definition in a domain dictionary to use a non-existent value of an enumerated type DDLm attribute (e.g. "_type.container   Vector"), and for software dealing with that dictionary not to notice, if it happens not to rely on that particular piece of machine-readable information.  For this reason, software that checks the correctness of domain dictionaries is useful. A dREL method is no more than a language-agnostic form of such checking software.

From some of John's comments I perceive that perhaps the scenario I have inadvertently put forward is where each and every domain dictionary definition would have a series of dREL checks explicitly associated with it. This is not my intention, indeed the intention was to avoid having to do this, by defining a way for dREL to access dictionary information. So there would be one dREL method for checking that enumerated values for DDLm attributes are those found in the DDLm attribute dictionary. This single dREL method would be able to be applied to all DDLm dictionaries.  There might be 20 such methods altogether for DDLm as it currently stands.

I am thinking that a proper demonstration is probably necessary to properly explain the benefits, so I might work on that if this email is not convincing.

I have added some comments inline as well, below.

On Thu, 20 Sep 2018 at 07:54, Bollinger, John C <John.Bollinger@stjude.org> wrote:
Dear DDLm group,

Please find my comments inline below.

On Monday, September 17, 2018 7:39 PM, James Hester wrote:
> On Tue, 18 Sep 2018 at 00:49, Bollinger, John C <John.Bollinger@stjude.org> wrote:
>> 1. I’m not sure I follow the intended purpose of the “enhance meaning of 'Validation' methods” item.  As I understand it, the proposal is to expose all the details of each item’s definition to dREL for the use of validation methods.  But the example of checking an item’s value against the allowed values of its enumerated type is something that I would expect a DDLm-based validator to do at its own initiative, without need of a dREL method being defined in the dictionary.  More generally, I consider it the role and responsibility of a DDLm-based validator to validate all the per-item and inter-item characteristics that the relevant dictionary defines via DDLm semantics.
> If a dictionary is viewed as a data file that provides (ontological) data conforming to the DDLm attributes, then a Validation dREL method applied to the dictionary fulfills the same function as a dREL method for validating that a data file contains data that are consistent with the domain dictionaries. Following your argument, dREL is not necessary in domain dictionaries either, because calculations are more properly the domain of 'dictionary-aware software'.

I am comfortable with considering domain dictionaries to be data files conforming to the ontology-domain formalisms defined by DDLm, but that does not imply that arguments about dictionaries and data files can be cleanly shifted between the ontological level and the domain level, or vice versa.  DDLm is distinguished from other dictionaries and data files because it provides the formal definition of its own semantics instead of relying for that on some other dictionary.  As a result, although we can use DDLm semantics to understand domain dictionaries, we cannot use them to understand DDLm itself, because that would be tautological.

Perhaps there has been a misunderstanding here - I am proposing checking domain dictionary usage of DDLm, and data file values against DDLm dictionary specifications, not necessarily usage of DDLm in the DDLm attribute dictionary itself, although that would be a side-effect.  We should distinguish between semantics understandable by humans (the _definition.text items) and the behaviour (semantics) that we associate with attributes in computer programs. For this reason, I cannot see how machine-checking of DDLm dictionaries is tautological - we are simply describing relationships that we (humans) already understand from plain text in a form that computers can also understand. The analogy from dictionaries and data files does therefore carry across: for dictionaries and data files, we have to instruct software about the particular relationships between data items that we ourselves already understand from human-readable text, and we can use dREL methods for this purpose. Likewise, when checking the correctness of DDLm usage, software does not understand that e.g. values must be drawn from an enumerated list, or that the data name given in '_name.linked_item_id' should exist, and so a dREL method stating these relationships is providing new information not otherwise available to the software (i.e. not tautological).  

And that bears directly on my point.  In order for DDLm and DDLm-based dictionaries to be useful at all, we need an entry point into ontological space, some prior and external comprehension of DDLm semantics.  That we can use such a comprehension to test the completeness and consistency of DDLm's self-expression is a bit of a sideshow: we do not need to do that because we have taken a comprehension of DDLm as granted.  Domain dictionaries do not need to use dREL to express specific cases of the semantics that follow, according to DDLm, from their definitions, because such methods can be understood in the first place only in a context in which they are redundant with the required external comprehension of DDLm.

There is human comprehension, and software comprehension.  A human may comprehend that '_type.purpose' cannot have value 'Vector', but software, a priori, does not.  Note that I am not positing that these dREL methods will be in domain dictionaries under specific definitions, but rather in a separate validation dictionary that is applicable to all domain dictionaries.

I have argued, furthermore, that such redundant dREL methods are not only unneeded but undesirable.  This is a more subjective consideration, and open to debate.  From my perspective, the inherent redundancy is an invitation to introduce inconsistencies with domain dictionaries, and any significant exercise of such redundancies furthermore carries unwanted costs in storage space and possibly processing resources.

The behaviour I propose would result in a 'DDLm validation' dictionary and would be no more redundant than providing dREL methods for data files, where those dREL relationships are already expressed in human-readable text. 

I do still remain open to the possibility that dREL access to DDLm attributes of items' definitions could have some utility other than redundantly expressing DDLm semantics, but I have not yet seen or conceived any examples.

I also know of no other examples. 

>> 2. The proposed new functions seem also to be aimed at supporting validation of DDLm-based semantics via methods expressed in data dictionaries.  Here too, I am inclined to think that the method behaviors that these are intended to support are not appropriate for expression in data dictionaries.  It ought not to be necessary, and I’m not presently seeing how it would be advantageous.
> The use case I'm thinking of is that these 'validation' dREL methods would appear in dictionaries full of validation data names. A validator would then evaluate each of these data names in order to check that a domain dictionary is correctly written, in the same way as CheckCIF runs through a series of checks on a data file. By expressing the conditions for validity in dREL, the specification is not bound to a particular concrete programming language or set of CIF access libraries.

I'm hearing that my understanding of the purpose of the proposed methods is accurate.  I'm not persuaded that this is a good or useful purpose.

I return to my earlier point.  In order to understand a dictionary expressed in DDLm, you need to already understand DDLm semantics.  In particular, any implementations of the proposed functions would necessarily be based on such an understanding.  Since that has to already be in there somewhere, I do not see the appeal of using it to re-express itself in dREL, and especially not in the form of many specific cases instead of a small number of general one.
And I return to my points: the context in which one of these proposed new dREL methods executes does not presuppose the DDLm semantics that it is checking. 
Expressing DDLm-derived constraints that way is not freeing.  That the dREL is not bound to a concrete programming language or CIF library is moot, because the dREL is redundant in the first place.  You need some tool that *is* bound to programming language and libraries to process it, and such a tool needs to be able to perform the same validations without the dREL.

No, a tool written in some concrete language does not necessarily have to perform the same validations.  It is entirely reasonable for software ingesting COMCIFS-approved dictionaries to assume that all key data names are, in fact, data names that are defined in the dictionary, or that the values taken by _type.container are from the enumerated list in the DDLm attribute dictionary.  Software may assume some relationships (keys, enumerated values) depending on how it is used, but by the same token it may ignore other errors or fail silently.

>> 3. Overall, I have previously understood “Validation” methods as being aimed at supporting item cross validations that cannot be expressed via DDLm attributes.  It is unclear to me why or in what circumstances it would be necessary or appropriate for such validations to depend on DDLm attributes. As far as I can see, the semantics of DDLm ought to be handled at a different level -- dictionary authors should not be responsible for providing for them.  In a strategic sense, not only do I not think we _need_ to provide for externalizing validation of DDLm semantics, I don’t think we _want_ to do that.  However, it is possible that there are good use cases that I have not considered, so I am prepared to be persuaded.
> I think your understanding of the current intention of 'Validation' methods is correct, because the single example of their use in current dictionaries is to check that cell parameters match the crystal system. However, as I wrote in the proposal, the same result can be achieved by defining a separate data name (e.g. '_valid.crystal_system') and using a normal 'Evaluation' method, so that use of 'Validation' appears a bit pointless.

Yes, I followed that observation in your proposal, and I agree that cross validations could be defined as you describe.  I do not take that as rendering 'Validation' methods pointless, however.  They serve at least two related purposes: to enable cross validations to be defined _without_ introducing synthetic data names, and to bind such validations directly to the items being validated.

If you want to argue that one or both of those is undesirable then let's do have that discussion.  Be aware that one of the points I already see myself raising concerns the propriety of defining items that are inappropriate for explicit use in data files.

These "synthetic" data names that are created purely for validation do have a reasonable justification for existing as separate concepts. First of all, CheckCIF has distinct checks, that have their own little name, and have an explanation attached, and can be referred to by name, so it is perfectly normal and practically useful to equate a data item with a validation result. Secondly, there are validation checks that do not belong to particular data names: the cell parameters - crystal system check does not obviously belong with one or the other data name, so if 'binding to a data name' is important then it would need to be duplicated, which would be wasteful. Having a dictionary of validation items avoids this, and allows information about errors to be included in enumerated values for the validation result and in the descriptive text.

> Note that I am not proposing that domain dictionary authors would ever need to use these 'Validation' methods. I am instead proposing that these methods would have a niche use, e.g. in a dictionary listing a series of datanames whose dREL methods validate the use of DDLm attributes. This niche use is similar to the way in which quite a few DDLm attributes and attribute values are only ever used in the DDLm attribute definition dictionary itself.  If the word 'Validation' is not appropriate, we can choose a word with less baggage, such as 'Technical'.  Whatever the name, having a list of checks that can be run over domain dictionaries in a form that allows use in any environment supported by dREL would be useful. My experiments with the Lark generator suggest to me that generating code from dREL is a lot easier than one might think.

I don't think I'm catching your vision there.  So riddle me this: what would prevent all these 'Technical' methods themselves being machine generated, whether in dREL or in some other form, from the dictionary to be validated?  (Or whose associated data files are to be validated?) And if such an external representation can be so generated, then why does it need to be externalized at all?

Well, the answer to the riddle is that these technical checks could not be machine generated unless we first tell "the machine" what those checks are. That is what I am proposing. Without a dREL (or other) recipe, "the machine" just has a bunch of attributes.  Perhaps you could explain why you think generation of these recipes is possible without human input that explains how to behave given some DDLm attributes?  And if human input is necessary, how dREL methods are not just encapsulating that input in a modular way?     

Also, if we're talking about validating domain dictionaries, not data files, then wouldn't the appropriate place for any dREL be the dictionaries' dictionary, i.e. DDLm itself?  And would not dREL appearing there _naturally_ have access to all the details you're proposing to expose via new functions?

Interesting question. It was in exploring the implications of that question that it became clear to me that an alternative context was required for dREL methods, as the dREL context as presented in the dREL paper populates data name values from a single data block, and only recognises data names from the dictionary in which the dREL method is located. Therefore, a dREL method located in the DDLm attribute dictionary would only see the values of the DDLm attributes in a single definition of the domain dictionary (treated as a data file in this context), and could only loop over the particular DDLm attribute category in which it was located. These are severe restrictions which make it impossible to cross-check consistency with other definitions.  So a better approach is to post an alternative dREL context and to have a separate set of validation definitions that are not artificially linked to a particular DDLm attribute and can include information in their definitions for tools to use.

> Another driver for this is the 'CheckCIF for raw data' project. I would prefer that any checks for raw data are written in dREL, to maintain independence from a particular set of libraries or language.  I would also envisage eventually rewriting CheckCIF checks in dREL to put it on a more robust footing. However, these CheckCIF-type projects only really need the proposed 'Known' built-in function, so you may wish to comment on that separately.

Those checks that are inherent in the DDLm semantics of items' definitions are *already* expressed in a form that is independent of libraries or (programming) language: DDLm!  If there are any desired checks that are not inherent in DDLm semantics but nevertheless are based on attributes of items' definitions then I would be truly interested to hear about them.  As for other checks, I haven't yet recognized a reason to think that dREL is not sufficient as-is.

DDLm is not machine-actionable on its own. A programmer must read the definitions and write code that performs the checking.  So, while a human can deduce that the dataname referenced by '_name.linked_item_id' should exist in the dictionary, and that the two sets of values taken by the linked data names are related in a certain way, there is no explicit algorithm in the dictionary that performs such a check. This is precisely the same as the information linking cell parameters and crystal system being clear to human readers but needing an explicit dREL method for checking, even though you can argue that the information is already there.

Now I do see that it might be desirable to be able to write dREL methods associated with particular items but not residing (directly) in those items' dictionaries.  I also see that it might be useful to be able to associate identifiers with dREL methods, especially if they are physically separated from the associated item(s).  I don't think I like defining synthetic data items for this purpose, but we should be able to come up with an alternative if this is something worth pursuing.

I would be interested to hear of any alternatives that bring us to the same point: environment-agnostic validation of domain dictionary conformance to DDLm and data file conformance to domain dictionary. What I have proposed is an additional dREL execution environment and some additional built-in functions, following which we can leverage off existing standards and tools.

all the best,

John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
(901) 595-3166 [office]


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.