[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Third and final proposal to enhance dREL

To: ddlm-group <[email protected]>
Subject: Re: [ddlm-group] Third and final proposal to enhance dREL
From: James Hester <[email protected]>
Date: Tue, 18 Sep 2018 10:39:17 +1000
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;h=mime-version:references:in-reply-to:reply-to:from:date:message-id:subject:to; bh=7EyYDvWCcJkqugTzLtZY42tj+bCpV7ky4+Va4DV3XAc=;b=uYT7gX4i9V6I2QRkNREexwU/8EuqWPkmZch7WxABhAzFjMSdr3g622HIrd+nglTj++76KXq+AU5qaJZctOy2T3x5zMgrx2CX1bsu87hbeyL5ns35U5nWrl6NoYcXk0MjzrAU72a3ltf4Z5o5eK+VLu54BaBqgk2mrRnLUtpa9DNa0mP2VS5BVeO4M+6Y34mRgtwnF2DtOjSICHitZ+5N68wTciU+AD4L9yW3kxS2ygEoXftOu+nbGPlqYxDd+ZO9VkoAXYA2ndS3WD3gNIrD1h+LdAP6jRtKcBnpXhgsp5KWPXVFLx8nu2MefrTYes2kX4LH6fWxBWIXmJUkrJIABw==
In-Reply-To: <DM6PR04MB4556B4D2946964B6DC4ECADFE01E0@DM6PR04MB4556.namprd04.prod.outlook.com>
References: <CAM+dB2dWmcO6xGph9a6QvpdMHN+EGHS3B_btU6sLp-E_Esa-oA@mail.gmail.com><DM6PR04MB4556B4D2946964B6DC4ECADFE01E0@DM6PR04MB4556.namprd04.prod.outlook.com>

On Tue, 18 Sep 2018 at 00:49, Bollinger, John C <[email protected]> wrote:

Dear James and DDLm group,

First, I remark that whereas I received an e-mail “Proposal to update dREL, ,part I” and obviously the “Third and final proposal to enhance dREL”, I do not recall, nor do I find any record of, a second or part II proposal, which it appears would have contained items numbered 3 and 4.

Oops. I have now sent out the second part.

With regard to this part III proposal, however:

1. I’m not sure I follow the intended purpose of the “enhance meaning of 'Validation' methods” item. As I understand it, the proposal is to expose all the details of each item’s definition to dREL for the use of validation methods. But the example of checking an item’s value against the allowed values of its enumerated type is something that I would expect a DDLm-based validator to do at its own initiative, without need of a dREL method being defined in the dictionary. More generally, I consider it the role and responsibility of a DDLm-based validator to validate all the per-item and inter-item characteristics that the relevant dictionary defines via DDLm semantics.

If a dictionary is viewed as a data file that provides (ontological) data conforming to the DDLm attributes, then a Validation dREL method applied to the dictionary fulfills the same function as a dREL method for validating that a data file contains data that are consistent with the domain dictionaries. Following your argument, dREL is not necessary in domain dictionaries either, because calculations are more properly the domain of 'dictionary-aware software'.

2. The proposed new functions seem also to be aimed at supporting validation of DDLm-based semantics via methods expressed in data dictionaries. Here too, I am inclined to think that the method behaviors that these are intended to support are not appropriate for expression in data dictionaries. It ought not to be necessary, and I’m not presently seeing how it would be advantageous.

The use case I'm thinking of is that these 'validation' dREL methods would appear in dictionaries full of validation data names. A validator would then evaluate each of these data names in order to check that a domain dictionary is correctly written, in the same way as CheckCIF runs through a series of checks on a data file. By expressing the conditions for validity in dREL, the specification is not bound to a particular concrete programming language or set of CIF access libraries.

3. Overall, I have previously understood “Validation” methods as being aimed at supporting item cross validations that cannot be expressed via DDLm attributes. It is unclear to me why or in what circumstances it would be necessary or appropriate for such validations to depend on DDLm attributes. As far as I can see, the semantics of DDLm ought to be handled at a different level -- dictionary authors should not be responsible for providing for them. In a strategic sense, not only do I not think we _need_ to provide for externalizing validation of DDLm semantics, I don’t think we _want_ to do that. However, it is possible that there are good use cases that I have not considered, so I am prepared to be persuaded.

I think your understanding of the current intention of 'Validation' methods is correct, because the single example of their use in current dictionaries is to check that cell parameters match the crystal system. However, as I wrote in the proposal, the same result can be achieved by defining a separate data name (eg. '_valid.crystal_system') and using a normal 'Evaluation' method, so that use of 'Validation' appears a bit pointless.

Note that I am not proposing that domain dictionary authors would ever need to use these 'Validation' methods. I am instead proposing that these methods would have a niche use, e.g. in a dictionary listing a series of datanames whose dREL methods validate the use of DDLm attributes. This niche use is similar to the way in which quite a few DDLm attributes and attribute values are only ever used in the DDLm attribute definition dictionary itself. If the word 'Validation' is not appropriate, we can choose a word with less baggage, such as 'Technical'. Whatever the name, having a list of checks that can be run over domain dictionaries in a form that allows use in any environment supported by dREL would be useful. My experiments with the Lark generator suggest to me that generating code from dREL is a lot easier than one might think.

Another driver for this is the 'CheckCIF for raw data' project. I would prefer that any checks for raw data are written in dREL, to maintain independence from a particular set of libraries or language. I would also envisage eventually rewriting CheckCIF checks in dREL to put it on a more robust footing. However, these CheckCIF-type projects only really need the proposed 'Known' built-in function, so you may wish to comment on that separately.

James.

From: ddlm-group [mailto:[email protected]] On Behalf Of James Hester
Sent: Monday, September 10, 2018 1:30 AM
To: ddlm-group <[email protected]>
Subject: [ddlm-group] Third and final proposal to enhance dREL

Dear DDLm group,

Please see below my final proposal for enhancements to dREL. This one is somewhat more substantial, but concerns only built-in functions and an altered execution context for an unused aspect of dREL/DDLm. I urge you to try and pick holes or find better ways to encode validation in dREL, as these ideas are being presented in public for the first time. I will leave this on the table for a while, meanwhile implementing these built-in functions and trying them out with some Validation methods. Given a positive outcome of any discussion here and at Github, I would plan to fold these changes into a '1.0' version of DDLm for publication on our website and inclusion in the next edition of Volume G.

This proposal may also be read as more nicely formatted text at  https://github.com/COMCIFS/dREL/blob/master/drel_changes_3.rst , and comments in the 'Issues' tab there are welcome.

James.

====

Proposed changes to dREL, part III

==================================

Introduction

------------

dREL is a machine-actionable language describing data relationships

and designed to be embedded in DDLm dictionaries. The language is

defined both explicitly in the dREL publication [1] and implicitly by

the dREL code appearing in the DDLm core CIF dictionary. Note that

the code in the core CIF dictionary significantly expands the language

presented in the paper, for example, by adding category methods.

The present proposals concern the built-in functions that are

supported by dREL. No syntax changes or enhancements are proposed.

Proposal 5: enhance meaning of 'Validation' methods

---------------------------------------------------

It is proposed that a ``_method.purpose`` of ``Validation`` will imply

that all DDLm attribute categories may appear as variables within the

associated dREL method, and that the values of these attributes are

those for the data name being validated. Additional predefined

variables ``value`` and ``category`` are bound to the particular value

and loop being validated.

Explanation

~~~~~~~~~~~

DDLm requires that each method have an associated ``_method.purpose``.

The current DDLm attributes dictionary defines ``Evaluation``, ``Validation``

and ``Definition``. The ``Validation`` purpose is given as

method compares an evaluation with existing item value

This type of method appears only once in the DDLm core dictionary, to

show how the crystal system can be checked against the cell parameters.

This could equally well be performed by creating a data name whose

``Evaluation`` dREL method returns ``True`` if the conditions are met.

The present proposal therefore suggests repurposing Validation methods for

more general validation by providing them with access to all attributes

of the definition of any data name that they are used to check. This allows

the methods to confirm, for example, that a value for a data name matches

the list of allowed enumerated values.

Note that currently, all categories from the dictionary in which a

dREL method appears can appear as pre-defined variables in that dREL

method, with values obtained from an associated data block. This proposal

enhances that list with the attributes for the definition of the data name

being checked.

The ``category`` and ``value`` pre-bound variables are required in order

to represent the generic value(s) being checked.

Example

~~~~~~~

The following code finds enumerated values that are not allowed. It

would appear in the definition for a data name

``valid.bad_enumerated_values``. The ``enumeration_set`` variable

contains the contents of the ``enumerated_set`` category in the

definition for a given data name, and ``value`` is bound by the

execution environment to the particular value being checked. The

execution environment is responsible for collating values of this

data name for each data name in the data block being checked.::

# Check that a value is listed in an enumeration

found = 'False'

# Loop over enumerated states in the definition for this

# data name

loop e as enumeration_set {

if (value == e.state) found = 'True';

}

valid.bad_enumerated_value = found

Proposal 5: Extra validation functions

--------------------------------------

It is proposed to add the following functions to the list of those

allowed in dREL methods:

Reference(name,attribute)

The value of ``attribute`` in the

dictionary definition of ``name`` is returned. Both ``attribute`` and

``name`` are either string literals or string-valued variables. Where

the result would be a loop, an appropriate dREL category object would

be returned.

Instance(category)

Returns an instance of category ``category`` in the data block

provided to the dREL method.

PacketData(container,object)

Returns specific data corresponding to ``object`` in ``container``.

The functional equivalent of the syntax ``cat.obj``,

where ``cat`` is the value of ``container`` and ``obj`` is the value of

``object``. If ``container`` is a category, the row must be

unambiguous from context, if necessary using the resolution rules

of the proposals in Part II.

Lookup(category,keys)

The functional equivalent of ``cat[k1=val1,k2=val2,...]``

where ``cat`` is the value of ``category`` and ``keys`` is the dictionary

``{'k1':val1,'k2':val2,...}``.

Known(name)

evaluates to true if a value for the object referenced by

``name`` can be found, false otherwise. If ``name`` does not resolve

to a ``category.object`` reference, or the particular row of a

multi-row category is unknown, will return false.

Explanation

~~~~~~~~~~~

A dREL method for checking conformance to requirements arising out of

DDLm attributes (for example, that a value is drawn from a list of values

of a different data name) cannot have 'hard-coded' ``<category>.<object>``

names, as the method would no longer be applicable to all data names.

The above functions are therefore required to provide access into categories

and data in a generalised way.

Examples

~~~~~~~~

``Reference('atom_type.symbol','enumerated_set')``

Return the contents

of the ``enumerated_set`` loop in the definition of ``atom_type.symbol``.

``Loop i as Instance( Reference( name.linked_item_id,'_name.category_id'))'``

Loop over all rows of the category

containing the data name contained in variable

``name.linked_item_id``. Note that ``name.linked_item_id`` is not

contained in quotes and therefore will be assigned the value given in the

definition of the data name being validated. The ``Reference`` function returns

a string naming the category of the linked data name, and the ``Instance``

function takes that string and returns a category object that is populated

with the values in the data file.

Finding values that are not child values. ::

# Find values that are not those of the linked data name.

result = 'False'

linked_object = Reference(name.linked_item_id,'_name.object_id')

loop i as Instance(Reference(name.linked_item_id,'_name.category_id')) {

if (PacketData(i,linked_object) == value) result = 'True'

}

valid.is_child_key = result

Finding and returning repeated values of a key data name as the

value of data name ``valid.not_unique``. Note the use of variable

``category`` to refer to the loop being checked. ::

# find key values that are not unique.

not_unique = []

# Accumulate keys

keylist = []

# get the object id for each key data name

Loop k as category_key {

keylist ++= Reference(k.name,'name.object_id') #Append

}

Loop c as category {

new_val = []

for ko in keylist {

new_val ++= PacketData(c,ko) #Append

}

if (new_val in keylist) {

not_unique ++= new_val

}

else {

keylist ++= new_val

}

valid.not_unique = not_unique



Proposal 6: Extension of 'in' to substrings

-------------------------------------------

It is proposed that the construction ``<string1> in <string2>`` be interpreted

as a boolean statement that returns true if ``<string1>`` is a substring of

``<string2>``.

Explanation

~~~~~~~~~~~

``in`` in dREL is currently only applied to testing membership in a

List or Array. dREL as published proposes using the ``Substr``

function to test for membership of a string in another string. This

could be more economically performed using the ``in`` keyword without

compromising the use for Lists or Arrays. This also accords with the

use of ``in`` in Python.

Proposal 7: Removal of built-in functions

-----------------------------------------

The following functions are proposed for removal from the list of

provided functions:

TopLo, TopHi (sorting low->high, high->low)

functionality duplicated by combinations of sort() and reverse()

Substr

functionality replaced by Proposal 6.

[1] Spadaccini et. al,

(2012) *J. Chem. Inf. Model.* **52**(8) pp 1917-1925

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Third and final proposal to enhance dREL (Bollinger, John C)

References:

[ddlm-group] Third and final proposal to enhance dREL (James Hester)

Re: [ddlm-group] Third and final proposal to enhance dREL (Bollinger, John C)

Prev by Date: [ddlm-group] Proposal to update dREL, part II

Next by Date: Re: [ddlm-group] Third and final proposal to enhance dREL

Prev by thread: Re: [ddlm-group] Third and final proposal to enhance dREL

Next by thread: Re: [ddlm-group] Third and final proposal to enhance dREL

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Third and final proposal to enhance dREL