Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

At 18:53 06/03/2007, Joe Krahn wrote:
Thanks for this topic - it has concerned us in writing CIF parsers.

The first observation is that CIF does not define an abstract data 
model (e.g. the Infoset in XML) so it is difficult to on what a 
parser should do other than confirm validity to the CIF 
standard.  (An analogy was early XML parsers whose only required 
output was "valid" or "invalid").  I suspect that each parser writer 
has created their own data model. It would be extremely valuable to 
develop such as model for CIF.

We have written a CIF parser (CIFDOM) which parses CIFs into an 
abstract data model which can be expose in XML syntax and conforms to 
Document Object models (DOM). IN doing this we have had to make 
various interpretations of the standard, while trying to retain the 
goodwill of authors and readers.  We have parsed ca 80,000 CIFs 
(standard "small molecule", all DDL-1, core dictionary, no mmCIF, no 
images, etc.). We apply the following from then standard

* within a CIF the order of the data blocks is arbitrary and changing 
that does not alter the data model
* within a data block the order of the items and loops is arbitrary 
and changing that does not affect the data model
* white space between CIF tokens (e.g. between item name and item 
value, between items and loops and between loop name or loop values) 
can be normalised to a single space or any other conformant white 
space string. This may surprise and upset authors who expect the 
pretty printing to emerge from a parser but the standard does not 
require it and it is difficult.
* the quoting mechanism for values can be changed or normalised. For 
example 'foo' can be normalised to foo. It may not always be clear 
how many line-ends should be preserved in semi-colon values or 
whether a single-line semicolon value could be translated to a quoted string.
* duplicate item names are not allowed
* all cif names can be case-normalised (e.g. H-M can become h-m)
* duplicate data block ids are not allowed

I would be grateful to know if any COMCIFer has a different view of these.

If these are accepted then comments can be reordered within blocks. 
Many comments are created on the assumption that they attach to the 
following CIF item or loop but a parser need not (and in principle 
cannot) preserve this implicit semantic. There are no such things as 
inter-block comments except that any comments preceding the first 
block can be identified as not belonging to any block. These can be reordered.

It is therefore legitimate (if unpretty) to assemble all comments 
within a block together and sort them into arbitrary order; the same 
can be done for the non-block comments.

>It seems that some CIF parsers retain comments.

Only if the parser has a data model which can be inspected or output.

>  Are there people using
>comments to hold pertinent information? If so, has there been any
>attempt to add a general purpose comment data items? My thinking is that
>the only comment that should have valid information is the CIF header
>comment,

Does this mean one or more comments before the first block? I don't 
think the standard defines a CIF header comment.

This is one of a small number of topics which could benefit from 
clarification (and in some cases an arbitrary ruling):

* data blocks. Is the value of the data block case-sensitive? are 
data block ids which differ only in case identical and therefore 
illegal. Is it allowed to have an empty string as id? or any mixture 
of non-whitespace CIF chars (e.g. punctuation only)
* data_global. This is so widespread that it would be useful to have 
at least an agreed heuristic for it.
* multi-data-block CIFs. Is it legitimate to split them? If so, 
can/should data_global be copied into each?

* what are the semantics of '?' and '.' Is it legitimate to delete an 
item of the form:
_foo ?
or does it convey information?

>and all the rest can be stripped. Are there any opinions that
>comments are important to retain?

P.


Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069 



Reply to: [list | sender only]