Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Third and final proposal to enhance dREL

Dear All,

It seems that I, too, have given an incorrect impression.  I did not intend to suggest that dREL methods could not or should not be used to validate domain dictionaries.  I was attempting to make a more narrow meta-argument, that arguments and conclusions about data files and their relationships with their domain dictionaries do not necessarily apply to domain dictionaries relative to DDLm, or vice versa, because DDLm has characteristics that distinguish it from (other) domain dictionaries.

In particular, it is tautological to validate the DDLm dictionary itself. Although doing so can demonstrate consistency between a validator implementation and an electronic representation of DDLm, it cannot prove that a validated object correctly expresses a particular human understanding of DDLm.  However, given a machine-actionable description of DDLm that we accept, on some other basis, as being correct, it is certainly meaningful to validate other dictionaries expressed in DDLm via a validation program that is demonstrably consistent with the accepted representation of DDLm.  Furthermore, that can give us reason to accept such validated dictionaries for validating, in turn, data files conforming to those dictionaries.


> From some of John's comments I perceive that perhaps the scenario I have inadvertently put forward is where each and every domain dictionary definition would have a series of dREL checks explicitly associated with it. This is not my intention, indeed the intention was to avoid having to do this, by defining a way for dREL to access dictionary information. So there would be one dREL method for checking that enumerated values for DDLm attributes are those found in the DDLm attribute dictionary. This single dREL method would be able to be applied to all DDLm dictionaries.  There might be 20 such methods altogether for DDLm as it currently stands.

I'm a little confused because your comments seem to go back and forth between validating dictionaries and validating data files (conforming to dictionaries other than DDLm itself).  The requirements and implications of these two related activities are different.

If the idea were to validate dictionaries conforming to DDLm, and you wanted to provide dREL serving that purpose, then the appropriate thing to do would be to use existing DDLm facilities to define validation methods in the DDLm dictionary.  I don't think that would require any new dREL features.  Since you've instead developed a rather involved proposal for multiple new dREL features, I deduce that the idea is primarily to validate other data files.  In that case, you should consider that

 - dREL methods' data access is scoped to the dictionary in which the method is defined

 - dREL methods do not have a mechanism for accepting parameters (this seems partially taken into account)

 - It would be organizationally cleaner to associate items' validation methods with the items themselves, as indeed DDLm already provides for.

 - dREL relies on the host dictionary for data typing, therefore it seems unlikely that a dREL-based approach can be applied to validating the DDLm dictionary itself (not that I think that's a very useful thing to do)

Furthermore, a robust DDLm implementation sufficient to support a dREL engine necessarily implements, natively, at least most of the features needed to validate data files against the semantics of the DDLm definitions of the items within.  This is the source of my skepticism in general about the value of dREL methods re-expressing constraints already specified by DDLm itself, even if only in text form.


> I am thinking that a proper demonstration is probably necessary to properly explain the benefits, so I might work on that if this email is not convincing.

I suspect that we have some philosophical differences here, but I am open to considering a demonstration.


> Perhaps there has been a misunderstanding here - I am proposing checking domain dictionary usage of DDLm, and data file values against DDLm dictionary specifications, not necessarily usage of DDLm in the DDLm attribute dictionary itself, although that would be a side-effect.  We should distinguish between semantics understandable by humans (the _definition.text items) and the behaviour (semantics) that we associate with attributes in computer programs. For this reason, I cannot see how machine-checking of DDLm dictionaries is tautological - we are simply describing relationships that we (humans) already understand from plain text in a form that computers can also understand.

I appreciate the distinction between human-readable specification and executable instructions.  I also appreciate the value of validating both domain dictionaries and data files against their respective dictionaries (except DDLm against itself).  What I do not presently appreciate is the value of expressing machine-readable validation instructions in a form that can be consumed only by a program that does not (should not) need to rely on them.

>> And that bears directly on my point.  In order for DDLm and DDLm-based dictionaries to be useful at all, we need an entry point into ontological space, some prior and external comprehension of DDLm semantics.  That we can use such a comprehension to test the completeness and consistency of DDLm's self-expression is a bit of a sideshow: we do not need to do that because we have taken a comprehension of DDLm as granted.  Domain dictionaries do not need to use dREL to express specific cases of the semantics that follow, according to DDLm, from their definitions, because such methods can be understood in the first place only in a context in which they are redundant with the required external comprehension of DDLm.

> There is human comprehension, and software comprehension.  A human may comprehend that '_type.purpose' cannot have value 'Vector', but software, a priori, does not.

I was a bit too indirect, I think.  The particular "prior and external comprehension of DDLm semantics" that we need has to manifest in software.  That is, methods can be understood _by software_ only if that software already understands DDLm.  In practice, of course, the preparation of such software is driven by human comprehension, but that's beside the point.


>  Note that I am not positing that these dREL methods will be in domain dictionaries under specific definitions, but rather in a separate validation dictionary that is applicable to all domain dictionaries.

How putting validation methods in a separate dictionary could be made to work is a separate issue altogether.  It seems to require changes to dREL semantics that a I do not recognize the proposals addressing.


>> I have argued, furthermore, that such redundant dREL methods are not only unneeded but undesirable.  This is a more subjective consideration, and open to debate.  From my perspective, the inherent redundancy is an invitation to introduce inconsistencies with domain dictionaries, and any significant exercise of such redundancies furthermore carries unwanted costs in storage space and possibly processing resources.

> The behaviour I propose would result in a 'DDLm validation' dictionary and would be no more redundant than providing dREL methods for data files, where those dREL relationships are already expressed in human-readable text.

The redundancy I'm talking about is not between human-readable text and machine-actionable instructions.  It is between the machine-actionable instructions expressed in dREL and other machine-actionable instructions that must already be present in a program that would be in a position to use the dREL.  dREL helps programs understand relationships between items in data files, but it is not needed to help programs understand relationships between items in dictionaries if they have enough support to use dREL in the first place.


> The use case I'm thinking of is that these 'validation' dREL methods would appear in dictionaries full of validation data names. A validator would then evaluate each of these data names in order to check that a domain dictionary is correctly written, in the same way as CheckCIF runs through a series of checks on a data file.

There are things to like about that idea.  I like its modularity, for example, and dREL is the most natural choice of language possible for such a task.

Again, however, if we were talking about validating _dictionaries_ then I think all we would need to do would be add validation methods directly to the DDLm dictionary, using the existing provisions for that purpose.  For validating data files, on the other hand, we would need to incorporate the validation data names into the dictionaries to which the data files conform, else all the data the methods wanted to access would be outside their scope.  Or we need additional changes to dREL.

Furthermore, most of the proposal seems aimed toward generic methods that could be reused for validating multiple different items, but it's unclear to me how that would work.  That is, what are "the particular value and loop being validated" if they are not the item on which the method is defined (a special validation data name) and its category?  More seems needed here before this could work.

And of course there's my main thesis, that in order to use dREL, you have to already have an underlying DDLm implementation that could support the wanted validations more directly.

> By expressing the conditions for validity in dREL, the specification is not bound to a particular concrete programming language or set of CIF access libraries.

I still don't see dREL re-expression of DDLm semantics as providing any advantage.


> And I return to my points: the context in which one of these proposed new dREL methods executes does not presuppose the DDLm semantics that it is checking.

Evidently my estimate of how much of DDLm semantics need to be implemented in support of a complete dREL engine implementation is considerably larger than yours.  I'm uncertain how to resolve that.

>> Expressing DDLm-derived constraints that way is not freeing.  That the dREL is not bound to a concrete programming language or CIF library is moot, because the dREL is redundant in the first place.  You need some tool that *is* bound to programming language and libraries to process it, and such a tool needs to be able to perform the same validations without the dREL.

> No, a tool written in some concrete language does not necessarily have to perform the same validations.  It is entirely reasonable for software ingesting COMCIFS-approved dictionaries to assume that all key data names are, in fact, data names that are defined in the dictionary, or that the values taken by _type.container are from the enumerated list in the DDLm attribute dictionary.  Software may assume some relationships (keys, enumerated values) depending on how it is used, but by the same token it may ignore other errors or fail silently.

Of course you're right.  As far as I can tell, most software that consumes CIF performs as little validation as possible, and that ad hoc.

But we're not talking about most CIF software.  We're talking about software that actually cares to validate comprehensively, that either is packaged with a dictionary containing validation methods or has the wherewithal to go out and get one, and that has access to a dREL engine with which to execute methods.  THAT software, I assert, already has most, if not all, the tools and knowledge it needs to validate DDLm semantics without reference to any dREL methods.  It is needed for the implementation of dREL itself.


> These "synthetic" data names that are created purely for validation do have a reasonable justification for existing as separate concepts. First of all, CheckCIF has distinct checks, that have their own little name, and have an explanation attached, and can be referred to by name, so it is perfectly normal and practically useful to equate a data item with a validation result.

OK, I accept that it could be useful to be able to identify a validation check and its result by name.

> Secondly, there are validation checks that do not belong to particular data names: the cell parameters - crystal system check does not obviously belong with one or the other data name, so if 'binding to a data name' is important then it would need to be duplicated, which would be wasteful. Having a dictionary of validation items avoids this, and allows information about errors to be included in enumerated values for the validation result and in the descriptive text.

This now makes an abrupt generalization from validating DDLm semantics to validating domain semantics.  Whereas I agree that such validations as you now describe do not rest well as associated with one specific item, we should consider our options.  For example, I'd be inclined to look into whether we could support methods on categories, and maybe on dictionaries as a whole.

>>> Note that I am not proposing that domain dictionary authors would ever need to use these 'Validation' methods. I am instead proposing that these methods would have a niche use, e.g. in a dictionary listing a series of datanames whose dREL methods validate the use of DDLm attributes. This niche use is similar to the way in which quite a few DDLm attributes and attribute values are only ever used in the DDLm attribute definition dictionary itself.  If the word 'Validation' is not appropriate, we can choose a word with less baggage, such as 'Technical'.  Whatever the name, having a list of checks that can be run over domain dictionaries in a form that allows use in any environment supported by dREL would be useful. My experiments with the Lark generator suggest to me that generating code from dREL is a lot easier than one might think.

>> I don't think I'm catching your vision there.  So riddle me this: what would prevent all these 'Technical' methods themselves being machine generated, whether in dREL or in some other form, from the dictionary to be validated?  (Or whose associated data files are to be validated?) And if such an external representation can be so generated, then why does it need to be externalized at all?

> Well, the answer to the riddle is that these technical checks could not be machine generated unless we first tell "the machine" what those checks are. That is what I am proposing. Without a dREL (or other) recipe, "the machine" just has a bunch of attributes.  Perhaps you could explain why you think generation of these recipes is possible without human input that explains how to behave given some DDLm attributes?  And if human input is necessary, how dREL methods are not just encapsulating that input in a modular way?

I thought I was referring here to validation methods that express the DDLm semantics of domain items' definitions.  Generation of such methods is entirely possible without direct human input because we can, indeed must, program DDLm semantics, and we have a DDLm domain dictionary (for dREL to be applicable in the first place).

We can also automatically write a validation method for any item that has an evaluation method, at least in principle, as all it needs to do is check whether the result of the evaluation matches the actual value.


>> Also, if we're talking about validating domain dictionaries, not data files, then wouldn't the appropriate place for any dREL be the dictionaries' dictionary, i.e. DDLm itself?  And would not dREL appearing there _naturally_ have access to all the details you're proposing to expose via new functions?

> Interesting question. It was in exploring the implications of that question that it became clear to me that an alternative context was required for dREL methods, as the dREL context as presented in the dREL paper populates data name values from a single data block, and only recognises data names from the dictionary in which the dREL method is located. Therefore, a dREL method located in the DDLm attribute dictionary would only see the values of the DDLm attributes in a single definition of the domain dictionary (treated as a data file in this context), and could only loop over the particular DDLm attribute category in which it was located. These are severe restrictions which make it impossible to cross-check consistency with other definitions.  So a better approach is to post an alternative dREL context and to have a separate set of validation definitions that are not artificially linked to a particular DDLm attribute and can include information in their definitions for tools to use.

It's not clear to me how the proposals on the table get around those limitations.

> Another driver for this is the 'CheckCIF for raw data' project. I would prefer that any checks for raw data are written in dREL, to maintain independence from a particular set of libraries or language.  I would also envisage eventually rewriting CheckCIF checks in dREL to put it on a more robust footing. However, these CheckCIF-type projects only really need the proposed 'Known' built-in function, so you may wish to comment on that separately.

It's hard for me to evaluate the proposal without a clear understanding of the objectives.  It is especially difficult for me to consider alternatives.  I think you have persuaded me that we should do _something_, but I am far from convinced that the specifics you have proposed are it.


>> Now I do see that it might be desirable to be able to write dREL methods associated with particular items but not residing (directly) in those items' dictionaries.  I also see that it might be useful to be able to associate identifiers with dREL methods, especially if they are physically separated from the associated item(s).  I don't think I like defining synthetic data items for this purpose, but we should be able to come up with an alternative if this is something worth pursuing.

> I would be interested to hear of any alternatives that bring us to the same point: environment-agnostic validation of domain dictionary conformance to DDLm and data file conformance to domain dictionary. What I have proposed is an additional dREL execution environment and some additional built-in functions, following which we can leverage off existing standards and tools.

The more I think about it, the more I'm inclined to think that we need to consider a wholesale (backwards compatible) update of the dREL specification, and that we need to reconsider how dREL is embedded in dictionaries.  I have some ideas about that, but it's late, and this e-mail is already far too long.  I'm still not much swayed by the idea of expressing DDLm semantics in dREL form for validation (or other) purposes, for I continue to consider DDLm more fundamental than dREL, but if a mechanism for that happens to come out of such an effort then I've no reason to oppose actually writing the methods.


John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
(901) 595-3166 [office]


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.