Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

DDLm implementation discussion

  • To: coreCIFchem@iucr.org, "Discussion list of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS)" <comcifs@iucr.org>
  • Subject: DDLm implementation discussion
  • From: David Brown <idbrown@mcmaster.ca>
  • Date: Tue, 10 Mar 2009 15:45:39 -0400
Title:
Dear Colleagues,

My discussion paper on the use of intermediate computation items in DDLm has produced many interesting suggestions but not a lot of consensus.  I have included an edited version of the discussion bringing related comments together under three headings.  The first deals with the desirability or otherwise of defining beta in the dictionary, the second addresses the implications of methods serving as definitions, and the third the problem of hiding or removing intermediate items in order not to clutter or otherwise compromise the CIF.  The solution I am proposing as a result of the discussion is that all the derived items (i.e. those with methods) should be defined in a 'derived-item' dictionary, while the experimental and assigned items will appear in an archive dictionary.  Anyone looking for a derived item will obtain these if they read an archive CIF under control of the archive + derived-items dictionary.  This is more fully described at the end of this document.

AMBIGUITY IN THE USE OF THE BETA FORM OF THE ADP

Carol Brock (Senior editor of Acta Cryst. C and a user of CIF) in a private email provides a strong argument for why betas should be invisible.

What confusion ADPs cause.  I came into crystallography at about the time that calls were being made for a switch from betas (used in ORFLS and its variants) to Us.  While reading your email a picture of the page in ORTEP manual where all the ADP types are listed flashed through my mind.  You are so right about betas being nearly useless because of the confusion about the factor of two.  I remember drawing ellipsoids with and without the factor of two for structures in the literature when attempting (usually in vain) to figure out which form had been used.  Please be sure no calculated beta is ever available where somebody might use it in the archival literature.

I would be very happy to help argue with the IT people.  Nobody should ever see a beta again.  A complication is that it is only the fossils who remember how much confusion betas caused.

Doug Duboulay (DD) on the other hand provides equally persuasive arguments why beta should be accessible.

If I understand correctly, beta is the *only*
true tensor form of the ADPs. If you want to convert between
different unit cells, transforming the Uijs is only possible by converting to
tensor form before the symmetry transform and back to Uijs afterwards.
To obfuscate its role in a definitive treatise seems lacking.
Of course I may be wrong :))

Nick Spadaccini (NS)
The _matrix_beta item is an item that does warrant definition and as
Doug states is a true tensorial form of ADPs. David alludes to another
problem that does not exist - there can be NO confusion in the definition of
the individual off-diagonal beta terms (historically there is a factor of 2
confusion). Why? Because the individual off diagonal terms can never be
accessed because there is no definition for them. There is never an error in
building the _matrix_beta because it is constructed directly from the U or B
matrices (where there is no confusion).

IDB Comment:
Nick is referring here to the way matrices are constructed from the individual matrix elements stored in the CIF.  The assumption is that experimental information will be given as matrix elements and the matrices themselves would only be created under dictionary control.  So Nick is right as long as no-one enters the beta matrix as a matrix because individual matrix elements do not appear in the dictionary.  However, there is nothing I am aware of in DDLm that prevents the matrix from being generated by external software or a word processor, and the convenience of doing so may encourage users to take this shortcut.  The question is how to prevent this.

METHODS AS DEFINITIONS

>> METHODS ARE THE NEW DEFINITIONS
>> At the meeting of COMCIFS in Osaka it was decided that when a method is
>> present in the dictionary it takes precedence over text in defining the
>> item.  One immediate corollary to this is that only one method is
>> allowed for each data item.

DD: I am not sure why that is a corollary.
Why is it not possible to have fallback methods when a particular
evaluation strategy fails?

My understanding of the derivation pathway was more like:

              U11,U22,...   B11,B22...
                /           /
 3    beta -> Uaniso -> Baniso
                 \         \
                  Uiso       Biso

i.e. there are discrete component forms  as well as matrix forms
as well as tensor forms.

Also, for calculating H atom adp's isn't there an algorithm based on Uiso of their coordinating C,N,O atoms?
How are you going to get Uiso from Uaniso?

IDB Comment.
Uiso is not the same as Uequiv.  If Uequiv is required we should define Uequiv -> Uaniso -> Uiso.  Presumably Uequiv will be the same as Uiso if no value of Uaniso is provided, providing we supply the correct methods.

NS: David suggests there is a problem with the fact that
different evaluation pathways exist to obtaining a value for an item, i.e., that multiple paths is problematic, whereas it is the expressed design in dREL.

The beta->Uani->Bani->Uiso->Biso pathway David describes is not correct, and
Doug makes an attempt to clarify (and gets closer). The actual calculation
pathway is

    |->Uani
    |->Bani
Beta|
    |->Uiso
    |->Biso

Which seems perfectly logical to me. dREL is a Turing complete language and
while there is one evaluation method in an item definition you can create a
multitude of paths to the answer - and that is a GOOD THING.

IDB Comment
In the proof-of-principle dictionary this branching depends on the value of .adp_type which may not be present (it has no method in the proof-of-principle dictionary and an enumeration default of Uiso which could be incorrect).  It would be better if branching were based on an ordered test for the existence of an item, thus if Uaniso is not present, look for Baniso.  However, as Doug points out, a search for the existence of Uaniso would have to allow Uaniso the opportunity to generate itself from its individual elements.  I am not sure how this is done in practice but that is a problem for the dREL implementation.

There seems to be a consensus here that branching is built into dREL and is desirable in a definition.  It is not clear if this has to be achieved using if-then constructs or an ordered loop defining the different branch methods.  If one goes for the if-then construction, how does one ensure the tested item is present, or can one make provision for a default procedure if it is not?  If the definitions form a tree (rather than a network with closed loops) it should not be possible to get stuck in a loop as mentioned below by DD.

HIDING THE INTERMEDIATE ITEMS

>POSSIBLE SOLUTIONS from IDB's original discussion paper
>> Tree 3

    beta -> Uaniso -> Baniso -> Uiso-> Biso

>> can be made to work if the beta form can be made
>> invisible.  It cannot be completely invisible as it must appear in the
>> appropriate CIF dictionary, and its text description will be displayed
>> by any CIF editor such as publCIF or enCIFer.
>>
>> One possible solution is to include a flag in the dictionary definition
>> to indicate that the item should be hidden from the user or deleted
>> after the calculation is complete.
>>
>> A second possibility is to give the item a dataname that disguises its
>> identity, e.g., a name such as _atom_site_aniso.intermediate1. The
>> dictionary would contain the .description 'This item is an intermediate in
>> an ADP calculation and is not to be used for archival or retrieval
>> purposes'.
>>
>> A third solution would be to rearrange the method for calculating the
>> structure factors so that it works directly with Uaniso and does not
>> generate beta as an intermediate.  In this case there is no need to define
>> beta in the dictionary.

DD: I would tend towards the first solution, but with multiple evaluation
strategies (i.e. loop_ed), combined with dREL software that checks
for multiple iterations around an evaluation path and which falls back
to an alternative if it exists, as well as the flag to say don't print
this in a result CIF (and possibly hide this in an editor!).

James Hester (JH)
I agree with IDB's diagnosis of the problem, and, rather than clutter
the dictionary with unnecessary baggage as solutions 1 and 2 do, I
would suggest a variant of your 3rd solution:

Solution 4: As beta is used primarily for calculational convenience, a
dREL function 'beta()' is defined which calculates the beta value.
The structure factor calculation is rewritten to call this function.
A given dREL implementation can choose to cache values returned by
this function to improve efficiency.

In [David's] first solution invisibility can only be maintained if everyone respects the new DDLm attribute that will be created to flag it.  The value [of the item] will leak out.

The second solution is better than the first solution, but I believe an unnecessary cluttering of the dictionary

Herbert Bernstein (HB)
I like the idea of defining useful functions in the dictionary that
will not in and off themselves generate tags, but we need to
provide some control over scope and namespaces to make it easy to
combine useful functions from multiple dictionaries -- perhaps
adopting python's module based dotted notation to resolve
conflicts.

JH: Yes.  Currently all dREL functions (and one hopes builtin functions in
the final standard) belong to DDLm category 'Function'.  We could
usefully sketch out a hierarchy of subcategories here when putting
together the final DDLm standard, or alternatively/additionally the
standard DDLm importation mechanism when importing other dictionaries
could resolve name conflicts.

NS: Correct, any function definition can be handled by the "functions" category.

A quick response to David's original post is "there is no identifiable
problem" in what David has written.

The interim data items, many of which are the actual legitimate
crystallographic objects, like the cell vectors rather than their scalar
dimensions and hence I personally believe they should be part of the
dictionary, don't have to be exported. They are in our prototype parser
because I am too lazy to clean up the output, I simply dump the in-memory
Python dictionary.

This aspect of what David sees as a problem, can be made to go away by using
DDLm's import facility. That is, the parser reads in the core dictionary
(with only the data items David/community would like to see in a submission
file) and import a "fuller" dictionary to handle everything. On output the
parser can be restricted to only exporting the data items in the core
dictionary. Problem solved. The user would never see or know about the extra
data items.

James' idea of creating functions would work also, but there are two quite
different classes of items here. Those which are truly library/utility
functions like those that strip the ortep-like object 2_567 in to a symmetry
pointer 2, and a cell displacement vector [0, 1, 2]. That is a function.

The other type of items are legitimate crystallographic items that merit
definition and should not be obfuscated in code. For instance the cell
displacement vector is a legitimate item and merits definition.
Crystallographically [0, 1, 2] is meaningful whereas _567 is actually
syntactic rubbish - albeit popular syntactic rubbish.

You may insist on only seeing _567 but to deny the ability to define a truly
crystallographic object like [0, 1, 2] is not sensible. Especially when it
can be hidden using an importing functionality.

The solution I describe by using importation will solve any perceived
problem and is the very basis on which DDLm had an importing functionality
created.

IDB's PROPOSED SOLUTION
Branching will be allowed in definitions, though a tree structure would be necessary with items arranged in an hierarchy to ensure that loops are not created and that all paths end in experimental or assigned items such as cell dimensions or symmetry operations.  The application of this will require care to respect the crystallographic integrity of the definitions, e.g., if U is calculated from B there should be no route by which B can be calculated from U.  The question remains how this is implemented: by if-then or loop construction.  I welcome advice on the best way to include multiple methods without having to test a potentially missing item.

Various solutions for the treatment of intermediate items have their supporters, including flagging them, eliminating them in favour of functions, or segregating them into a 'derived-items' dictionary.  It is clear that most intermediates are good crystallographic items that would not be out of place in the CIF output, but leaving aside the special problem of beta, there would be a danger of cluttering up the CIF with large numbers of derived items, many of them duplicating in a different format information already present.  A further danger is one that Herbert pointed out in Osaka.  He suggested that DDLm CIF dictionaries would make CIF more dynamic compared with the static character of CIF1 and CIF2, and if some experimental or assigned value (e.g., the cell constants), were changed there would be no way we could ensure that all the derived items would be automatically updated. 

The relevant items in DDLm dictionaries can be classified as either derived or experimental (or assigned) (ignoring those used for data management and description which do not concern us).  The derived items will all have methods, in principle the experimental values will not have methods.  Following Nick's suggestions, we could arrange that the derived items are placed in a 'derived-items' dictionary, while those giving the experimental and assigned values, such as cell constants and space group symmetry, would appear in an archive dictionary. 

This represents an important change in the way we handle and think about CIF: the basic experimental information would be found in an archival CIF, but reading this CIF under DDLm dictionary control would allow any desired derived items to be retrieved as if they had been originally in the CIF.  These requested items could be passed to a user program or exported as a CIF, though the default exported CIF would contain only items from the archival CIF.  Provision would be needed to allow derived items already in the archival CIF to be optionally retained rather than recalculated.  Derived items currently present in CIF such as cell volume and calculated density would appear in the derived-items dictionary.  The derived-item dictionary would contain a rich supply of derived items, e.g., a person requiring a set of bond vectors could retrieve these by supplying the labels and symops of the terminal atoms, which in turn might be generated by an external program so as to include all the bond distances of interest to the user.  The derived-items dictionary could include a definition of the beta matrix since this might be required by a user program designed to transform cell settings, but being in the derived-items dictionary there would be little temptation to use it archivally.  Editors such as publCIF which are designed to help in producing archival CIFs would not need to import the derived-items dictionary.

Of course we could also make more use of functions as suggested by James and Herbert if there are no issues with importing functions, but only when the intermediate item was not a potentially useful derived item.

As is well known, the devil is in the details, and in adapting the CIF dictionaries I will have to make many decisions in matters of detail.  But if there is agreement on splitting the dictionaries into archive and derived-item dictionaries as described above, I will have a guideline to work with.  I will undoubtedly come back with other problems in the future but this split seems to be in the spirit of DDLm and appears to solve a number of important problems.

Is this plan acceptable to everyone?  If so, I will start to apply it and move this discussion on the next problem.

David Brown



begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
coreCIFchem mailing list
coreCIFchem@iucr.org
http://scripts.iucr.org/mailman/listinfo/corecifchem

[Send comment to list secretary]
[Reply to list (subscribers only)]