Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

At 17:31 07/03/2007, Joe Krahn wrote:
Thanks Joe and Brian,

>peter murray-rust wrote:
> > At 18:53 06/03/2007, Joe Krahn wrote:
> > Thanks for this topic - it has concerned us in writing CIF parsers.
> >
> > The first observation is that CIF does not define an abstract data
> > model (e.g. the Infoset in XML) so it is difficult to on what a
> > parser should do other than confirm validity to the CIF
> > standard.  (An analogy was early XML parsers whose only required
> > output was "valid" or "invalid").  I suspect that each parser writer
> > has created their own data model. It would be extremely valuable to
> > develop such as model for CIF.
> >
> > We have written a CIF parser (CIFDOM) which parses CIFs into an
> > abstract data model which can be expose in XML syntax and conforms to
> > Document Object models (DOM). IN doing this we have had to make
> > various interpretations of the standard, while trying to retain the
> > goodwill of authors and readers.  We have parsed ca 80,000 CIFs
> > (standard "small molecule", all DDL-1, core dictionary, no mmCIF, no
> > images, etc.). We apply the following from then standard
>My approach is quite different. One of the reasons that PDB-format users
>are not switching to mmCIF is that random shuffling of order assumes
>that is it really not intended to be directly read by humans. It is
>often easy to modify a PDB file with a plain text editor. PDB users
>would be much more willing to switch to CIF if the format of CIF files
>were more strictly defined.

I accept this. The problem is that it is very difficult to define a 
common presentational format - it is not part of CIF. Therefore it is 
difficult to have consistency between different implementers - what 
is pretty to one is ugly to another.

> >
> > * within a CIF the order of the data blocks is arbitrary and changing
> > that does not alter the data model
> > * within a data block the order of the items and loops is arbitrary
> > and changing that does not affect the data model
>My parser currently retains the original order, but I think a better
>approach is to have a 'preferred order' item defined in a dictionary.

Our approach allows the prettification and reordering through 
stylesheets. With CIFDOM it is trivial (using XSLT) to reorder the 
items and loops in whatever way you wish. It is possible, though more 
difficult, to reorder columns in tables. There is no obvious way in 
which rows of tables can be reordered. (but note that they are 
intrinsically unordered)

> > * white space between CIF tokens (e.g. between item name and item
> > value, between items and loops and between loop name or loop values)
> > can be normalised to a single space or any other conformant white
> > space string. This may surprise and upset authors who expect the
> > pretty printing to emerge from a parser but the standard does not
> > require it and it is difficult.
>Again, I prefer "pretty printing" because I view the CIF format as
>intended for human consumption. Trying to do this heuristically is a
>pain. A dictionary item defining the format (combined with the preferred
>order) would make "pretty printing" possible without all of the guess work.

I think this would be very difficult for some items and loop values. 
These can range from empty strings to huge paragraphs

> > * the quoting mechanism for values can be changed or normalised. For
> > example 'foo' can be normalised to foo. It may not always be clear
> > how many line-ends should be preserved in semi-colon values or
> > whether a single-line semicolon value could be translated to a 
> quoted string.
> > * duplicate item names are not allowed
> > * all cif names can be case-normalised (e.g. H-M can become h-m)
> > * duplicate data block ids are not allowed
>I have been retaining the case, even though comparisons are case
>insensitive. However, the current convention is to use underscore-joined
>words, where case does not affect readability (as opposed to WikiWord
>style). So, converting everything to lower case is probably a good idea.
>
>Do the standards not state that names must be unique?

Indeed it does. This does not stop many authors duplicating names

> >
> > I would be grateful to know if any COMCIFer has a different view of these.
> >
> > If these are accepted then comments can be reordered within blocks.
> > Many comments are created on the assumption that they attach to the
> > following CIF item or loop but a parser need not (and in principle
> > cannot) preserve this implicit semantic. There are no such things as
> > inter-block comments except that any comments preceding the first
> > block can be identified as not belonging to any block. These can 
> be reordered.
> >
> > It is therefore legitimate (if unpretty) to assemble all comments
> > within a block together and sort them into arbitrary order; the same
> > can be done for the non-block comments.
> >
> >> It seems that some CIF parsers retain comments.
> >
> > Only if the parser has a data model which can be inspected or output.
> >
> >>  Are there people using
> >> comments to hold pertinent information? If so, has there been any
> >> attempt to add a general purpose comment data items? My thinking is that
> >> the only comment that should have valid information is the CIF header
> >> comment,
> >
> > Does this mean one or more comments before the first block? I don't
> > think the standard defines a CIF header comment.
>I am working under the assumption that all comments should be stripped.
>They should only be used for things like hints in a CIF example
>template, and pretty-printing.

CIFDOM with a stylesheet will allow most reasonable options - strip, 
preserve, even remove all but first, but clearly there is no unique reordering.

> >
> > This is one of a small number of topics which could benefit from
> > clarification (and in some cases an arbitrary ruling):
> >
> > * data blocks. Is the value of the data block case-sensitive? are
> > data block ids which differ only in case identical and therefore
> > illegal. Is it allowed to have an empty string as id? or any mixture
> > of non-whitespace CIF chars (e.g. punctuation only)
> > * data_global. This is so widespread that it would be useful to have
> > at least an agreed heuristic for it.
>Isn't data_global just a bad implementation of the unused 'global_' in
>STAR? If people want a standard global, then 'global_' should be used.

NO! global_ is part of STAR but not CIF. That is part of the problem. 
I don't know who invented data_global but it wasn't an agreed 
heuristic. My own belief is that in a  file such as

data_global
   content_g
data_1
   content_1
data_2
   content_2

the heuristics are:
* this is semantically equivalent to two separate CIFs:

data_1
   content_g
   content_1

and

data_2
   content_g
   content_2

* This requires that no items in data_global have the same names as 
any in data_1 or data_2. This is nowhere defined and should be
* that the two CIFs have no other semantic relation other than any 
that can be deduced from the common items in data_global


> > * multi-data-block CIFs. Is it legitimate to split them? If so,
> > can/should data_global be copied into each?
> >
> > * what are the semantics of '?' and '.' Is it legitimate to delete an
> > item of the form:
> > _foo ?
> > or does it convey information?
> >
>The difference is fairly well defined, unknown versus undefined, where
>undefined means not-applicable. In practice, this just adds complexity.
>The difference between them is almost always obvious from the context of
>related data items. To make things even worse, '.' can sometimes
>indicate the default value. Some uses of '.' would be more sensible to
>me as zero-length strings. Otherwise, I would rather just get rid of it,
>or redefine it as exclusively meaning "the default value".

My own heuristics are:
_foo '?'
carries no useful information other than the author hasn't bothered 
to remove it from the file
_foo '.'
is highly dangerous as the dictionary can contain default values 
which most users have no idea of. Thus the default extinction 
correction is (or certainly was)  'Zachariasen' and algorithmically 
linking '.' to this value is certain to give misleading info.

loop_
_foo _bar
a .
b c

has a null value for one cell - this is required to make a rectangular table

loop_
_foo _bar
a .
b .

should be equivalent to
loop_
_foo
a
b

and this construct should be avoided

loop_
_foo _bar
a ?
b ?

is almost certainly an unedited template and should be replaced by:

loop_
_foo
a
b

and finally
loop_
_foo _bar
a ?
b c

is indistinguishable from

loop_
_foo _bar
a .
b c

All these issues come into very sharp focus when processing CIFs - it 
is not trivial to manage '.' in a column of otherwise real numbers.

P.



Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK
+44-1223-763069 



Reply to: [list | sender only]