[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: CIF Infoset
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <firstname.lastname@example.org>
- Subject: Re: CIF Infoset
- From: David Brown <email@example.com>
- Date: Mon, 30 Aug 2004 15:15:17 -0400
- In-Reply-To: <Pine.LNX.firstname.lastname@example.org>
- References: <Pine.LNX.email@example.com>
Here are a few more comments from IDB:|
The core dictionary defines three items which can be looped:So how do you intend to get around this namespace issue? No CIFs that I have encountered have ever declared their conformance to any dictionary. Even if they did, there is something called the dictionary stacking protocol which allows those definitions to be overridden without declaring a namespace. On top of that there is the boundless capacity for making up your own data names on the fly for which there may never be any dictionary definition at all. How can you reliably assign anything but a generic namespace to an infoset? Its all just adhoc guesswork.
_audit_conform_dict_location # Contains the URL where the dictionary can be found
As far as I know these have not been widely used - Acta Cryst. should start insisting that these be included in submitted papers. There is no need to give the dictionary version in anything as ephemeral a comment.
This would tidy things up, but the parser must be able to handle ad hoc data names without choking.# start Validation Reply Form _vrf_DIFF020_114 ;PROBLEM: _diffrn_standards_interval_count and RESPONSE: ... We have used an image-plate system ; If intelligent software was ever intended to deal with such _vrf_s, why embed the only pointer to their purpose in supposedly non parsable data names rather than in looped, discrete sets of tags such as loop_ _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response
This was a serious omission in the first version of CIF (you have to remember that this was produced before we even considered writing dictionaries in STAR format). As you point out we have introduced the list reference _symmetry_equiv_posi_site_id (which incidentally has now been superceded by _space_group_symop_id taken from the symmetry_cif dictionary - a dictionary which takes a more systematic and forward-looking approach to symmetry). Again Acta Cryst. should insist on the inclusion of these id's.Q Is the order of "rows" in a loop_ unimportant?Yes (in CIF).That is very useful (and non-obvious from the spec. It then makes it possible to confirm the identity of two sets of coordinates, symmetry operations, etc. It is also debatable. The very recent introduction of _symmetry_equiv_pos_site_id means that the data integrity of the majority of prior archived CIFs containing tag values like: _geom_bond_site_symmetry_1 "4_564" would be seriously impaired by a change of order in the loop_ _symmetry_equiv_pos_as_xyz
We had a discussion of this point while preparing the symmetry_CIF dictionary and came to the decision that these two strings were not equivalent, i.e., underscore is not white space.. For that reason P_21/c is no longer regarded as a valid space group symbol although there is a warning that some heritage CIFs may use that convention. There is an enumeration list for _space_group_name_H-M_ref which explicitly allows only 'P 21/c'. Other space group symbols are similarly definedI had a hazy recollection that "this is a string" and this_is_a_string were equally valid CIF constructs containing identical information content, used for example in space group names. Would they be formally identical in an infoset? Does the white space in all strings have to be normalised (is that the right word?)?
They are not semantically the same, though they are not (scientifically) significantly different. The distinction is important.Would 1.2(2) and 1.3(2) be equivalent in an infoset? Lexically they are different, but semantically they are the same value, within error.
The difficulty is not pserving the data type, but the semantics of downstream decisions. If one author writes _my_phone "123-45678" they are announcing this is not a number while if another writes _my_phone 123-45678 they are announcing it is a number.
It is much worse that this. There is a definition of what constitutes a number in DDL1 but it is given only as text and that only by way of examples (which incidentally do not include 123-45678). The examples may not be intended as an exhaustive list, but no other guidance is given. DDL2 is both better and worse since, although numbers are defined in terms of regular expressions, each dictionary defines its own set of data types and there appears to be no limit on how many data types are defined. It sound to me as if all values should be treated as data strings unless a dictionary is used and the appropriate data types defined in the infoset. Then some means is needed to preserve these types (if possible) in any realization of the infoset, e.g., by writing them in XML or a different version of CIF. In any case DDL1 certainly needs to tighten up its definition of a number if typing is going to be important.The discussion so far seems to suggest that these statements overrule the datatypes specified in the dictionary entries. There is a particular problem in loop_s, where it is then possible to have different data types within a column: loop_ _atom_site_occupancy 1.0 0.3 "not refined" "0.3" "." which makes the implementation very difficult. I believe that a programmer should be able to look up the data type in the dictionary entry and write a routine that relies on a value being of the correct data type and throws an exception if not.
-- Dr. I.D.Brown, Professor Emeritus, Department of Physics and Astronomy McMaster University, Hamilton Ontario, Canada
Reply to: [list | sender only]
- Re: CIF Infoset (ddb)