Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Refocusing discussion on dREL use for validation

I have added comments inline below.

On Fri, 19 Oct 2018 at 01:39, Bollinger, John C <John.Bollinger@stjude.org> wrote:

On Thursday, October 18, 2018 12:17 AM, James Hester wrote:

[....]
> I believe that expressing these checks in a programming-language-agnostic way is important, as this would avoid us being pinned to particular environments and systems over time.  Furthermore, I think that dREL would be a good choice, as it is tightly matched to the dictionary environment and tools that transform it to <insert your favourite language+CIF environment here> can be re-used.


As I'm sure is clear by now, I do not attribute much importance to expressing such checks in dREL.  In particular, I do not consider that objective a sufficient justification for a broad suite of additions and changes.  I do agree that dREL is well matched to the dictionary environment, so it seems a reasonable choice from that perspective.

Before moving on, however, I'd like to point out that if changing dREL is on the table then there is a broad spectrum of possible approaches, including such things as requiring dREL to perform type checking automatically or adding a validateType() function.  These are not the sort of thing presented in the previous proposal, but they should not be dismissed out of hand, especially if we look at the question from the point of view of maintaining the dREL language in general, as opposed to specifically enabling it to serve the purpose we're discussing.  The more features we add, the fewer implementations we can expect, and that could easily mean that in-principle tool independence is actually single-tool dependence in practice.

A good point that the greater the complexity, the less implementations there will be. The original proposal deliberately confined any changes to methods with a specific _method.purpose, so that a dREL implementation could target only a particular purpose, and added no new syntax.  Furthermore, the built-in functions that were suggested generally mirrored functions that had to be implemented anyway.  Additionally, programming languages generally feel free to add new built-in functions, but are much more careful about changing syntax.

> So, given that we wish to use dREL, can we make it work for our simple task of checking enumerated values?  dREL as currently conceived executes in a well-defined environment, which can be described as follows, if a dREL definition is located in the definition for object 'd' in category 'c', with supplied data block 'f':
>
> The following immutable bindings have been made:
> (i) a single packet of category 'c' is bound to 'c'
> (ii) values for all objects 'o' in 'c' are bound to 'c.o' using values from 'f', except for 'd'
> (ii) all other categories are available through their names, and after a packet is specified, individual data are accessed in the same way as 'c'
>
> In addition, dREL engines need to make use of the following semantic information from the dictionary in which the definition appears:
> (i) category keys are used to identify packets in categories other than 'c'
> (ii) linked items could be used to resolve key values (not yet agreed with this group)
> (iii) item type and dimension is determined using type information for the relevant data name
> (iv) correspondence between data name in the data file and category.object in the dREL
>
> Given this environment, we cannot write a dREL method for checking enumerated values of even a single, specific data name, because no explicit access to domain dictionary contents is exposed in the dREL method - neither through built-in functions, or through syntactic constructs, or through pre-existing bindings (feel free to try). Furthermore, if we wish to write a single dREL method for all enumerated value data names (which is much more economical), then we no longer even have bindings to 'c'.


Thank you.  That is exactly how I would have liked to start the discussion, and I am pleased to be there now.


> Therefore, my initial proposal posited enhancing the execution environment to remove these restrictions, with the change flagged by the value of the '_method.purpose' attribute. I think this is a low-impact solution to this conundrum, but I would welcome alternative suggestions.


There is room to extend dREL, but do we need to do that to solve the particular narrow problem we are considering?  I think not.  The challenge revolves around the fact that we seem to want methods that access data from multiple ontological levels, and a large component of the previously proposed solution can be characterized as adding introspective capabilities to dREL to support that.  But that's not the only way to get the wanted data.

For example, one alternative would be for validation methods in the domain dictionary to be generated, with appropriate ontology-level data inserted literally, by a method residing in the DDLm dictionary.  That would take the form of an evaluation method for attribute _method.expression.  It would have multiple advantages, among them that
 (i) the information about how to encode DDLm requirements in methods would be presented in the DDLm dictionary itself.
 (ii) there would be no need for any explicit validation methods (for this purpose) in domain dictionaries or external dictionaries.  They would be generated at need.
 (iii) method implementations would automatically track dictionary changes.
 (iv) dREL can support this already, I think, or with minimal changes at most.  I am optimistic enough to think it plausible that some existing dREL implementations would support it out of the box.

An intriguing idea. The issues I see:
(i) to avoid stepping on the toes of _method.expressions that are already present in the domain dictionary,
(ii) to allow multiple validation methods for a single data name
(iii) to generate multiple validation methods for a single data name

In the spirit of our concrete problem, such a dREL method for generating a dREL method that validates an enumerated state might appear as follows. This method would appear in the '_method.expression' definition of the DDLm attribute dictionary. Therefore it has access to the DDLm attribute dictionary for the semantic information listed above in my previous email, and the DDLm attributes themselves are bound to the values of those attributes in the DDLm domain dictionary definition for which the method is being generated. In order to overcome the three issues I've mentioned above, assume that a category-level dREL method has previously assigned an arbitrary value to new DDLm attribute '_method.id'.  The dREL method below is therefore provided with a particular value for '_method.id', which it uses to determine which check to generate.  As I've written it, the check uses a template, into which variable names are inserted using dollar signs and a separate dictionary function called 'resolve_variables', for which I've invented a new dREL built-in function 'substitute(a,b,c)' which replaces all occurrences of string 'b' in string 'a' with string 'c'.

# Other test generators here, and then..

enum_template = """
if ($dataname in $enumeration_set) found = 'True'
else found = 'False'
return found
"""
 
if (method.id == 'enumeration' ) {
    states = ''
    loop s as enumeration_set {
        states = states + "'" + s.state + "', "   #Create list of states
        }
    states = "[" + states + "]"
    with_states = substitute(enum_template,"$enumeration_set",states)
    method.expression = resolve_variables(enum_template)
   }

if (method.id == 'range') {
#.... and so on
# more test generators follow

### Elsewhere in the DDLm attribute dictionary appears the following function for general use:

Function resolve_variables(template: [Single,Text]) {
    # Bindings to file data are available in functions too
    object = name.object  
    category = name.category
    dataname = category + "." + object
    template = substitute(template,"$dataname",dataname)
    template = substitute(template,"$object",object)
    template = substitute(template,"$category,category)
    resolve_variables = template
}


How this could work in practice is as follows: an application interested in validation, when loading a DDLm dictionary, would see that a 'Validation' _method.expression is missing for a given definition (say refine.ls_weighting_scheme). It would then execute a category-level dREL method for the 'METHOD' category, which populates the domain dictionary definition with the names of applicable dREL checks.  The application could then ask for the '_method.expression' for any of these method names, at which point the above code would execute to produce the following string (for definition refine.ls_weighting_scheme):

if (refine.ls_weighting_scheme in ['sigma', 'unit', 'calc',]) found = 'True'
else found = 'False'
return found


So: I think I have shown that this approach could work in a dynamic environment in which functions can be parsed and executed. The additional changes required to dREL and DDLm would be:
(i) Clarification of the execution environment for 'Validation' methods, as the return value is no longer the value of the defined data name (actually we need to do this anyway)
(ii) Addition of some more string-handling built-in functions for convenience
(iii) an extra DDLm attribute to give each method a name
(iv) allowing multiple methods of a given type

I have gone into such detail in order for us to have a clear view of the way in which the two approaches would differ in practice. The original approach requires some extra built-in functions, an expansion of the dREL execution environment, and would locate the tests inside their own definitions. The 'auto-generate' approach involves no change in the execution environment, would benefit from a new (generally useful) built-in function, and would locate all tests for a dataname together with that dataname (semantically).  John has listed some other desirable characteristics above.

I nevertheless somewhat prefer the original approach, as 
(i) it allows tests to be documented via the test's DDLm definition
(ii) the meaning of the test is not hidden behind string manipulations such as those above (although templating and some further construction functions could improve this)
(iii) the auto-generation approach will involve a couple of very long methods (for method.expression and the method category) in the DDLm attribute dictionary, whereas the original approach was modular and could be kept completely separate from the dictionaries themselves.

That said, I am more concerned about just having the ability to do these tests in dREL, and using the above as a proof that it is worth investigating, I can live with the auto-generation approach.  I will investigate the possibility of doing this for a "Task II: linked items" in a separate email.

Another alternative would be even simpler: to write a separate code generator for the wanted dREL methods, and to incorporate all the resulting methods into the domain dictionary.  One would simply re-run the validation method generator each time the dictionary is updated, as a late step in the process of issuing a new release.  That would somewhat enlarge domain dictionaries, but probably not all that much if the only validations we generate are those we are specifically discussing at the moment -- validating values of items having enumerated types.

I would estimate around 10-20 different validation tests, not all of them applicable, for each data name.  Having these explicitly in the dictionary would go against the goal of keeping the dictionaries reasonably human-readable, and would hide the important dREL methods in amongst a pile of tests that are not usually scientifically relevant, so I would not like them to appear in published dictionaries.  I think the only viable approach for a validation application would be to autogenerate the tests as above, whether at compilation time or at execution time. 


Regards,

John

--
John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
John.Bollinger@StJude.org
(901) 595-3166 [office]
www.stjude.org




________________________________

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer


--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.