Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Third and final proposal to enhance dREL

On Tue, 18 Sep 2018 at 00:49, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear James and DDLm group,

 

First, I remark that whereas I received an e-mail “Proposal to update dREL, ,part I” and obviously the “Third and final proposal to enhance dREL”, I do not recall, nor do I find any record of, a second or part II proposal, which it appears would have contained items numbered 3 and 4.


Oops. I have now sent out the second part. 

 

With regard to this part III proposal, however:

 

1. I’m not sure I follow the intended purpose of the “enhance meaning of 'Validation' methods” item.  As I understand it, the proposal is to expose all the details of each item’s definition to dREL for the use of validation methods.  But the example of checking an item’s value against the allowed values of its enumerated type is something that I would expect a DDLm-based validator to do at its own initiative, without need of a dREL method being defined in the dictionary.  More generally, I consider it the role and responsibility of a DDLm-based validator to validate all the per-item and inter-item characteristics that the relevant dictionary defines via DDLm semantics.


If a dictionary is viewed as a data file that provides (ontological) data conforming to the DDLm attributes, then a Validation dREL method applied to the dictionary fulfills the same function as a dREL method for validating that a data file contains data that are consistent with the domain dictionaries. Following your argument, dREL is not necessary in domain dictionaries either, because calculations are more properly the domain of 'dictionary-aware software'. 
 

2. The proposed new functions seem also to be aimed at supporting validation of DDLm-based semantics via methods expressed in data dictionaries.  Here too, I am inclined to think that the method behaviors that these are intended to support are not appropriate for expression in data dictionaries.  It ought not to be necessary, and I’m not presently seeing how it would be advantageous.


The use case I'm thinking of is that these 'validation' dREL methods would appear in dictionaries full of validation data names. A validator would then evaluate each of these data names in order to check that a domain dictionary is correctly written, in the same way as CheckCIF runs through a series of checks on a data file. By expressing the conditions for validity in dREL, the specification is not bound to a particular concrete programming language or set of CIF access libraries. 

 

3. Overall, I have previously understood “Validation” methods as being aimed at supporting item cross validations that cannot be expressed via DDLm attributes.  It is unclear to me why or in what circumstances it would be necessary or appropriate for such validations to depend on DDLm attributes. As far as I can see, the semantics of DDLm ought to be handled at a different level -- dictionary authors should not be responsible for providing for them.  In a strategic sense, not only do I not think we _need_ to provide for externalizing validation of DDLm semantics, I don’t think we _want_ to do that.  However, it is possible that there are good use cases that I have not considered, so I am prepared to be persuaded.


I think your understanding of the current intention of 'Validation' methods is correct, because the single example of their use in current dictionaries is to check that cell parameters match the crystal system. However, as I wrote in the proposal, the same result can be achieved by defining a separate data name (eg. '_valid.crystal_system') and using a normal 'Evaluation' method, so that use of 'Validation' appears a bit pointless.

Note that I am not proposing that domain dictionary authors would ever need to use these 'Validation' methods. I am instead proposing that these methods would have a niche use, e.g. in a dictionary listing a series of datanames whose dREL methods validate the use of DDLm attributes. This niche use is similar to the way in which quite a few DDLm attributes and attribute values are only ever used in the DDLm attribute definition dictionary itself.  If the word 'Validation' is not appropriate, we can choose a word with less baggage, such as 'Technical'.  Whatever the name, having a list of checks that can be run over domain dictionaries in a form that allows use in any environment supported by dREL would be useful. My experiments with the Lark generator suggest to me that generating code from dREL is a lot easier than one might think.

Another driver for this is the 'CheckCIF for raw data' project. I would prefer that any checks for raw data are written in dREL, to maintain independence from a particular set of libraries or language.  I would also envisage eventually rewriting CheckCIF checks in dREL to put it on a more robust footing. However, these CheckCIF-type projects only really need the proposed 'Known' built-in function, so you may wish to comment on that separately.

James.

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Monday, September 10, 2018 1:30 AM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Third and final proposal to enhance dREL

 

Dear DDLm group,

 

Please see below my final proposal for enhancements to dREL. This one is somewhat more substantial, but concerns only built-in functions and an altered execution context for an unused aspect of dREL/DDLm.  I urge you to try and pick holes or find better ways to encode validation in dREL, as these ideas are being presented in public for the first time. I will leave this on the table for a while, meanwhile implementing these built-in functions and trying them out with some Validation methods.  Given a positive outcome of any discussion here and at Github, I would plan to fold these changes into a '1.0' version of DDLm for publication on our website and inclusion in the next edition of Volume G.

 

This proposal may also be read as more nicely formatted text at  https://github.com/COMCIFS/dREL/blob/master/drel_changes_3.rst , and comments in the 'Issues' tab there are welcome.

 

James.

====

 

Proposed changes to dREL, part III

==================================

 

Introduction

------------

 

dREL is a machine-actionable language describing data relationships

and designed to be embedded in DDLm dictionaries. The language is

defined both explicitly in the dREL publication [1] and implicitly by

the dREL code appearing in the DDLm core CIF dictionary. Note that

the code in the core CIF dictionary significantly expands the language

presented in the paper, for example, by adding category methods.

 

The present proposals concern the built-in functions that are

supported by dREL. No syntax changes or enhancements are proposed.

 

Proposal 5: enhance meaning of 'Validation' methods

---------------------------------------------------

 

It is proposed that a ``_method.purpose`` of ``Validation`` will imply

that all DDLm attribute categories may appear as variables within the

associated dREL method, and that the values of these attributes are

those for the data name being validated.  Additional predefined

variables ``value`` and ``category`` are bound to the particular value

and loop being validated.

 

Explanation

~~~~~~~~~~~

 

DDLm requires that each method have an associated ``_method.purpose``.

The current DDLm attributes dictionary defines ``Evaluation``, ``Validation``

and ``Definition``.  The ``Validation`` purpose is given as

 

     method compares an evaluation with existing item value

 

This type of method appears only once in the DDLm core dictionary, to

show how the crystal system can be checked against the cell parameters.

This could equally well be performed by creating a data name whose

``Evaluation`` dREL method returns ``True`` if the conditions are met.

The present proposal therefore suggests repurposing Validation methods for

more general validation by providing them with access to all attributes

of the definition of any data name that they are used to check. This allows

the methods to confirm, for example, that a value for a data name matches

the list of allowed enumerated values.

 

Note that currently, all categories from the dictionary in which a

dREL method appears can appear as pre-defined variables in that dREL

method, with values obtained from an associated data block. This proposal

enhances that list with the attributes for the definition of the data name

being checked.

 

The ``category`` and ``value`` pre-bound variables are required in order

to represent the generic value(s) being checked.

 

Example

~~~~~~~

 

The following code finds enumerated values that are not allowed. It

would appear in the definition for a data name

``valid.bad_enumerated_values``.  The ``enumeration_set`` variable

contains the contents of the ``enumerated_set`` category in the

definition for a given data name, and ``value`` is bound by the

execution environment to the particular value being checked.  The

execution environment is responsible for collating values of this

data name for each data name in the data block being checked.::

 

    # Check that a value is listed in an enumeration

    found = 'False'

    # Loop over enumerated states in the definition for this

    # data name

    loop e as enumeration_set {    

        if (value == e.state) found = 'True';

        }

    valid.bad_enumerated_value = found

 

Proposal 5: Extra validation functions

--------------------------------------

 

It is proposed to add the following functions to the list of those

allowed in dREL methods:

 

Reference(name,attribute)

    The value of ``attribute`` in the

    dictionary definition of ``name`` is returned.  Both ``attribute`` and

    ``name`` are either string literals or string-valued variables. Where

    the result would be a loop, an appropriate dREL category object would

    be returned.

 

Instance(category)

    Returns an instance of category ``category`` in the data block

    provided to the dREL method.

 

PacketData(container,object)

    Returns specific data corresponding to ``object`` in ``container``.

    The functional equivalent of the syntax ``cat.obj``,

    where ``cat`` is the value of ``container`` and ``obj`` is the value of

    ``object``. If ``container`` is a category, the row must be

    unambiguous from context, if necessary using the resolution rules

    of the proposals in Part II.

 

Lookup(category,keys)

    The functional equivalent of ``cat[k1=val1,k2=val2,...]``

    where ``cat`` is the value of ``category`` and ``keys`` is the dictionary

    ``{'k1':val1,'k2':val2,...}``.

 

Known(name)

    evaluates to true if a value for the object referenced by

    ``name`` can be found, false otherwise.  If ``name`` does not resolve

    to a ``category.object`` reference, or the particular row of a

    multi-row category is unknown, will return false.

 

Explanation

~~~~~~~~~~~

 

A dREL method for checking conformance to requirements arising out of

DDLm attributes (for example, that a value is drawn from a list of values

of a different data name) cannot have 'hard-coded' ``<category>.<object>``

names, as the method would no longer be applicable to all data names.

The above functions are therefore required to provide access into categories

and data in a generalised way. 

 

Examples

~~~~~~~~

 

``Reference('atom_type.symbol','enumerated_set')``

    Return the contents

    of the ``enumerated_set`` loop in the definition of ``atom_type.symbol``.

 

``Loop i as Instance( Reference( name.linked_item_id,'_name.category_id'))'``

    Loop over all rows of the category

    containing the data name contained in variable

    ``name.linked_item_id``. Note that ``name.linked_item_id`` is not

    contained in quotes and therefore will be assigned the value given in the

    definition of the data name being validated. The ``Reference`` function returns

    a string naming the category of the linked data name, and the ``Instance``

    function takes that string and returns a category object that is populated

    with the values in the data file.

 

Finding values that are not child values. ::

 

    # Find values that are not those of the linked data name.

    result = 'False'

    linked_object = Reference(name.linked_item_id,'_name.object_id')

    loop i as Instance(Reference(name.linked_item_id,'_name.category_id')) {

        if (PacketData(i,linked_object) == value) result = 'True'

    }

    valid.is_child_key = result

 

Finding and returning repeated values of a key data name as the

value of data name ``valid.not_unique``. Note the use of variable

``category`` to refer to the loop being checked. ::

 

    # find key values that are not unique.

    not_unique = []

    # Accumulate keys

    keylist = []

    # get the object id for each key data name

    Loop k as category_key {

        keylist ++= Reference(k.name,'name.object_id') #Append

        }

    Loop c as category {

        new_val = []

        for ko in keylist {

            new_val ++= PacketData(c,ko) #Append

            }

        if (new_val in keylist) {

            not_unique ++= new_val

        }

    else {

        keylist ++= new_val

        }

    valid.not_unique = not_unique

    

Proposal 6: Extension of 'in' to substrings

-------------------------------------------

 

It is proposed that the construction ``<string1> in <string2>`` be interpreted

as a boolean statement that returns true if ``<string1>`` is a substring of

``<string2>``.

 

Explanation

~~~~~~~~~~~

 

``in`` in dREL is currently only applied to testing membership in a

List or Array.  dREL as published proposes using the ``Substr``

function to test for membership of a string in another string. This

could be more economically performed using the ``in`` keyword without

compromising the use for Lists or Arrays. This also accords with the

use of ``in`` in Python.

 

Proposal 7: Removal of built-in functions

-----------------------------------------

 

The following functions are proposed for removal from the list of

provided functions:

 

TopLo, TopHi (sorting low->high, high->low)

    functionality duplicated by combinations of sort() and reverse()

 

Substr

    functionality replaced by Proposal 6.

 

 

[1] Spadaccini et. al,

(2012) *J. Chem. Inf. Model.* **52**(8) pp 1917-1925

 

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.