Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

Title:
Here are a few more comments from IDB:

So how do you intend to get around this namespace issue? No CIFs that I 
have encountered have ever declared their conformance to any dictionary.
Even if they did, there is something called the dictionary stacking 
protocol 
which allows those definitions to be overridden without declaring a 
namespace.
On top of that there is the boundless capacity for making up your own
data names on the fly for which there may never be any dictionary 
definition
at all. How can you reliably assign anything but a generic namespace to an 
infoset? Its all just adhoc guesswork.
The core dictionary defines three items which can be looped:
    _audit_conform_dict_name
    _audit_conform_dict_version
    _audit_conform_dict_location        # Contains the URL where the dictionary can be found
As far as I know these have not been widely used - Acta Cryst. should start insisting that these be included in submitted papers.  There is no need to give the dictionary version in anything as ephemeral a comment.
# start Validation Reply Form
_vrf_DIFF020_114
;PROBLEM: _diffrn_standards_interval_count and
RESPONSE: ... We have used an image-plate system
;

If intelligent software was ever intended to deal with such _vrf_s, why 
embed the only pointer to their purpose in supposedly non parsable data 
names rather than  in looped, discrete sets of tags such as 

loop_
    _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response
This would tidy things up, but the parser must be able to handle ad hoc data names without choking.
Q Is the order of "rows" in a loop_ unimportant? 
        
Yes (in CIF).
      
That is very useful (and non-obvious from the spec. It then makes it
possible to confirm the identity of two sets of coordinates, symmetry
operations, etc.

It is also debatable. 
The very recent introduction of _symmetry_equiv_pos_site_id means that
the data integrity of the majority of prior archived CIFs containing tag 
values like:    _geom_bond_site_symmetry_1  "4_564"
would be seriously impaired by a change of order in the 
loop_  _symmetry_equiv_pos_as_xyz
This was a serious omission in the first version of CIF (you have to remember that this was produced before we even considered writing dictionaries in STAR format).  As you point out we have introduced the list reference _symmetry_equiv_posi_site_id (which incidentally has now been superceded by  _space_group_symop_id taken from the symmetry_cif dictionary - a dictionary which takes a more systematic and forward-looking approach to symmetry).  Again Acta Cryst. should insist on the inclusion of these id's.
I had a hazy recollection that  "this is a string" and   this_is_a_string   
were equally valid CIF constructs containing identical information 
content, 
used for example in space group names. Would they be formally identical in 
an infoset? Does the white space in all strings have to be normalised (is 
that the right word?)?
We had a discussion of this point while preparing the symmetry_CIF dictionary and came to the decision that these two strings were not equivalent, i.e., underscore is not white space..  For that reason  P_21/c is no longer regarded as a valid space group symbol although there is a warning that some heritage CIFs may use that convention.  There is an enumeration list for _space_group_name_H-M_ref which explicitly allows only 'P 21/c'.  Other space group symbols are similarly defined
Would 1.2(2) and 1.3(2) be equivalent in an infoset? Lexically they are 
different, but semantically they are the same value, within error.
They are not semantically the same, though they are not (scientifically) significantly different.  The distinction is important.
The difficulty is not pserving the data type, but the semantics of
downstream decisions. If one author writes _my_phone "123-45678"
they are announcing this is not a number while if another writes
_my_phone 123-45678 they are announcing it is a number.
 The
discussion so far seems to suggest that these statements overrule
the datatypes specified in the dictionary entries. There is a
particular problem in loop_s, where it is then possible to have
different data types within a column:

loop_ _atom_site_occupancy
1.0
0.3
"not refined"
"0.3"
"."

which makes the implementation very difficult. I believe that a
programmer should be able to look up the data type in the dictionary
entry and write a routine that relies on a value being of the
correct data type and throws an exception if not.
        
  
It is much worse that this.  There is a definition of what constitutes a number in DDL1 but it is given only as text and that only by way of examples (which incidentally do not include 123-45678).  The examples may not be intended as an exhaustive list, but no other guidance is given.  DDL2 is both better and worse since, although numbers are defined in terms of regular expressions, each dictionary defines its own set of data types and there appears to be no limit on how many data types are defined.  It sound to me as if all values should be treated as data strings unless a dictionary is used and the appropriate data types defined in the infoset.  Then some means is needed to preserve these types (if possible) in any realization of the infoset, e.g., by writing them in XML or a different version of CIF.  In any case DDL1 certainly needs to tighten up its definition of a number if typing is going to be important.

Good luck!

David

-- 
Dr. I.D.Brown, Professor Emeritus,
Department of Physics and Astronomy
McMaster University, Hamilton
Ontario, Canada

Reply to: [list | sender only]