[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
A DDLm problem
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <email@example.com>, coreCIFchem@iucr.org
- Subject: A DDLm problem
- From: David Brown <firstname.lastname@example.org>
- Date: Tue, 24 Feb 2009 14:38:19 -0500
I have now resumed work converting the coreCIF dictionary to the DDLm standard. This email describes a problem I have encountered in the treatment of intermediates generated during the application of a method. I need your feedback to ensure that the solution I am suggesting is generally acceptable. Skip to the end of the email if you want to know my proposed solution (though I recommend reading the rest of the email to find out the problem is that the solution is designed to resolve).
Let me remind you that the main new feature in DDLm is the inclusion of 'methods' in the CIF dictionaries. Methods are machine-executable algebraic expressions that can be used by a program to calculate the value of a derived item from measured or assigned items in the CIF.
You are receiving this email because you are on one or more mailing lists of people whose advice and approval I need for dealing with a number of issues that methods raise. It is important that these issues be discussed while there is still time to influence the decisions that will have to be made. The first of these issues is described in this email.
I am circulating this email to two lists, and if you are on both you will get it twice. I apologize and recommend that you quickly delete the second copy (unless you wish to reply to both lists.)
METHODS ARE THE NEW DEFINITIONS
At the meeting of COMCIFS in Osaka it was decided that when a method is present in the dictionary it takes precedence over text in defining the item. One immediate corollary to this is that only one method is allowed for each data item. If a program call is made to an item that is not present in the CIF, the method will initiate a call to the other items needed to calculate its value. These in turn may call further items. The route between a derived item and the measured or assigned values in terms of which it is ultimately defined, constitutes a tree, and because the method is the definition, the tree must be unique.
'If-then' constructions give rise to branching. They must therefore be treated with care because they alter the definition of an item according to the value of the CIF item that is tested. It is reasonable to include such a construction if the branching depends, e.g., on whether the structure was determined by x-ray or neutron diffraction, but is it wise to use it if it depends only on the way the CIF is structured, rather than the conditions of the experiment that is being reported? In other words, should a definition in this case be made to depend on the value of a second item that might not actually be present in the CIF?
AN EXAMPLE OF A DEFINITION TREE
Atomic displacement parameters (ADPs) illustrate the kind of problems that can arise. ADPs are expressed in a number of different forms such as B, U and beta (the latter with two different definitions). Furthermore, each form also has an isotropic and an anisotropic version. When the original core dictionary was prepared we decided to standardize on U because it has a direct physical significance in real space. Standardizing on a single form simplifies programming because anyone reading a CIF can rely on finding the ADPs in the U form. However, in response to an insistent request from the macromolecular community, we also allowed ADPs to be given in the B form since this form is universally used in that field. Thus anyone reading a CIF must now be prepared to find ADPs in either the U or the B form. There is currently no definition of the beta form.
The ADP is a measured quantity and therefore cannot strictly be calculated, but with DDLm we can make life easier by adding a method to the definition of U that will calculate U from B if B, but not U, is present. This gives rise to the tree 1.
1 U -> B
(The arrow -> indicates that U calls B to convert to contents of B to U)
U is now treated as a derivable item; but the value appearing in the CIF may be either directly measured or may be derived from the directly measured B if the ADP was originally stored in the CIF as B. Introducing this method means that any program looking for an ADP now only has to look for U. If the ADPs are given in the B form, a call to U will automatically result in a conversion from B to U. The reverse of course will not be true since the tree is unique and cannot be read backwards.
However, life is more complicated than this because the ADPs may be isotropic or anisotropic, so the full sequence is shown as tree 2:
2 Uaniso -> Baniso -> Uiso -> Biso
which is beginning to look a little cumbersome but, hey, the computer doesn't care if it has to make three conversions instead of just one, and one can always express an isotropic ADP in the form of an anisotropic one. Of course if an external program intercepts this sequence further down, say at Baniso, it will fail to find an ADP given as Uaniso. There is no way around this difficulty, but if the hierarchy is defined and understood, it shouldn't be a problem. Note that the iso- versions will only be called if no aniso versions are present so there is no problem if both are given. If the Ueq (which is stored as Uiso) is required the sequence can be intercepted at Uiso. If U and B are both present (by some accident) then U automatically takes precedence since if U is present there will be no call to B.
So far so good. No real problems.
THE PROBLEM OF INTERMEDIATE ITEMS
Some method calls may involve the calculation of intermediate items not currently included in the dictionary. An example taken from the the proof-of-concept DDLm CIF dictionary, is the beta form of the ADP which is used in the calculation of structure factors because it makes the calculation simpler. The method for generating beta uses an 'if-then' construction to test the value of _atom_site.adp_type to decide whether to calculate beta from Uiso, Biso, Uaniso or Baniso. The method for _atom_site.beta thus contains several different algorithms for making the conversion, depending on the value of .adp_type. This means that the definition is not unique but is determined by the value of an item that is not itself part of the tree. Problems could arise if .adp_type is missing (a distinct possibility) or if it does not correctly describe the ADP format, either of which would cause the calculation to fail. The sequential calls described in tree 2 are more robust because the calculation is based on the presence or absence of the items in the tree itself.
Since beta is an intermediate which must have its own method, it must have a dictionary definition and it must have a place in tree 2. Only the beginning of end of the tree make any sense, and since it cannot go at the end for various obvious reasons, not the least being that in that position it is not allowed to have a method, the only logical place is at the beginning of the sequence,
3 beta -> Uaniso -> Baniso -> Uiso -> Biso
Since external programs should access ADPs using the item at the head of the sequence, they should call beta rather than Uaniso, even though U has now superceded beta in normal use. We cannot pretend that beta does not exist since it is defined in the dictionary, and sooner or later someone is going to archive their measured ADPs in this form. This almost certainly will happen since many early papers report the ADPs in the beta form and they are stored this way in the ICSD (for example). ICSD has indicated that they would prefer to output ADPs from these early papers in the beta form if it were available. However the ADPs would not then be found by an external program that calls Uaniso and most modern programs would consider it retrograde to have to call beta.
ANOTHER GOOD REASON FOR DISCOURAGING THE USE OF BETA
There is an excellent unrelated reason for not wanting to include beta in the dictionary. There are two incompatible definitions of beta, depending on whether the '2' in the cross term is explicit or implicit. Unfortunately most papers that report betas do not state which convention they use which makes the information virtually useless. The definition in the proof-of-principle dictionary arbitrarily chooses one of these conventions, but many people are not aware that there is an ambiguity, and allowing people to archive ADPs as betas would likely lead to many incorrect CIFs. CIF should strive to avoid making it easy for such errors to occur.
Tree 3 above can be made to work if the beta form can be made invisible. It cannot be completely invisible as it must appear in the appropriate CIF dictionary, and its text description will be displayed by any CIF editor such as publCIF or enCIFer.
One possible solution is to include a flag in the dictionary definition to indicate that the item should be hidden from the user or deleted after the calculation is complete.
A second possibility is to give the item a dataname that disguises its identity, e.g., a name such as _atom_site_aniso.intermediate1. The dictionary would contain the .description 'This item is an intermediate in an ADP calculation and is not to be used for archival or retrieval purposes'.
A third solution would be to rearrange the method for calculating the structure factors so that it works with directly with Uaniso and does not generate beta as an intermediate. In this case there is no need to define beta in the dictionary.
THE SOLUTION I PROPOSE TO ADOPT
I propose that we adopt the second solution, i.e., we agree to use datanames and descriptions that indicate that the item is not to be used for archival purposes. There will of course be many intermediates that are perfectly acceptable for archiving. For example, when Uaniso is calculated from Biso, Uiso and Baniso are generated along the way and there is no need to hide them. Calculation of structure factors would also generate atom_site.intermediate1 items containing the ADPs in the beta form, so the CIF may end up being a bit cluttered, but it should be possible to write a program that would clean up the CIF by removing any unwanted intermediate items. In any case since external programs only need search for Uaniso, the presence of the other items will be of no concern.
I am looking for feed back. However if I receive none, I will assume that you agree that unwanted intermediates should be hidden by giving them meaningless datanames and text definitions that conceal their content.
Please circulate your thoughts on this problem to the whole discussion list.
begin:vcard fn:I.David Brown n:Brown;I.David org:McMaster University;Brockhouse Institute for Materials Research adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada email;internet:email@example.com title:Professor Emeritus tel;work:+905 525 9140 x 24710 tel;fax:+905 521 2773 version:2.1 end:vcard
Reply to: [list | sender only]