[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <[email protected]>
Subject: Re: Opinions on comments as part of the content
From: Joe Krahn <[email protected]>
Date: Wed, 07 Mar 2007 12:31:17 -0500
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]>

peter murray-rust wrote:
> At 18:53 06/03/2007, Joe Krahn wrote:
> Thanks for this topic - it has concerned us in writing CIF parsers.
> 
> The first observation is that CIF does not define an abstract data 
> model (e.g. the Infoset in XML) so it is difficult to on what a 
> parser should do other than confirm validity to the CIF 
> standard.  (An analogy was early XML parsers whose only required 
> output was "valid" or "invalid").  I suspect that each parser writer 
> has created their own data model. It would be extremely valuable to 
> develop such as model for CIF.
> 
> We have written a CIF parser (CIFDOM) which parses CIFs into an 
> abstract data model which can be expose in XML syntax and conforms to 
> Document Object models (DOM). IN doing this we have had to make 
> various interpretations of the standard, while trying to retain the 
> goodwill of authors and readers.  We have parsed ca 80,000 CIFs 
> (standard "small molecule", all DDL-1, core dictionary, no mmCIF, no 
> images, etc.). We apply the following from then standard
My approach is quite different. One of the reasons that PDB-format users
are not switching to mmCIF is that random shuffling of order assumes
that is it really not intended to be directly read by humans. It is
often easy to modify a PDB file with a plain text editor. PDB users
would be much more willing to switch to CIF if the format of CIF files
were more strictly defined.

> 
> * within a CIF the order of the data blocks is arbitrary and changing 
> that does not alter the data model
> * within a data block the order of the items and loops is arbitrary 
> and changing that does not affect the data model
My parser currently retains the original order, but I think a better
approach is to have a 'preferred order' item defined in a dictionary.

> * white space between CIF tokens (e.g. between item name and item 
> value, between items and loops and between loop name or loop values) 
> can be normalised to a single space or any other conformant white 
> space string. This may surprise and upset authors who expect the 
> pretty printing to emerge from a parser but the standard does not 
> require it and it is difficult.
Again, I prefer "pretty printing" because I view the CIF format as
intended for human consumption. Trying to do this heuristically is a
pain. A dictionary item defining the format (combined with the preferred
order) would make "pretty printing" possible without all of the guess work.

> * the quoting mechanism for values can be changed or normalised. For 
> example 'foo' can be normalised to foo. It may not always be clear 
> how many line-ends should be preserved in semi-colon values or 
> whether a single-line semicolon value could be translated to a quoted string.
> * duplicate item names are not allowed
> * all cif names can be case-normalised (e.g. H-M can become h-m)
> * duplicate data block ids are not allowed
I have been retaining the case, even though comparisons are case
insensitive. However, the current convention is to use underscore-joined
words, where case does not affect readability (as opposed to WikiWord
style). So, converting everything to lower case is probably a good idea.

Do the standards not state that names must be unique?

> 
> I would be grateful to know if any COMCIFer has a different view of these.
> 
> If these are accepted then comments can be reordered within blocks. 
> Many comments are created on the assumption that they attach to the 
> following CIF item or loop but a parser need not (and in principle 
> cannot) preserve this implicit semantic. There are no such things as 
> inter-block comments except that any comments preceding the first 
> block can be identified as not belonging to any block. These can be reordered.
> 
> It is therefore legitimate (if unpretty) to assemble all comments 
> within a block together and sort them into arbitrary order; the same 
> can be done for the non-block comments.
> 
>> It seems that some CIF parsers retain comments.
> 
> Only if the parser has a data model which can be inspected or output.
> 
>>  Are there people using
>> comments to hold pertinent information? If so, has there been any
>> attempt to add a general purpose comment data items? My thinking is that
>> the only comment that should have valid information is the CIF header
>> comment,
> 
> Does this mean one or more comments before the first block? I don't 
> think the standard defines a CIF header comment.
I am working under the assumption that all comments should be stripped.
They should only be used for things like hints in a CIF example
template, and pretty-printing.

> 
> This is one of a small number of topics which could benefit from 
> clarification (and in some cases an arbitrary ruling):
> 
> * data blocks. Is the value of the data block case-sensitive? are 
> data block ids which differ only in case identical and therefore 
> illegal. Is it allowed to have an empty string as id? or any mixture 
> of non-whitespace CIF chars (e.g. punctuation only)
> * data_global. This is so widespread that it would be useful to have 
> at least an agreed heuristic for it.
Isn't data_global just a bad implementation of the unused 'global_' in
STAR? If people want a standard global, then 'global_' should be used.

> * multi-data-block CIFs. Is it legitimate to split them? If so, 
> can/should data_global be copied into each?
> 
> * what are the semantics of '?' and '.' Is it legitimate to delete an 
> item of the form:
> _foo ?
> or does it convey information?
> 
The difference is fairly well defined, unknown versus undefined, where
undefined means not-applicable. In practice, this just adds complexity.
The difference between them is almost always obvious from the context of
related data items. To make things even worse, '.' can sometimes
indicate the default value. Some uses of '.' would be more sensible to
me as zero-length strings. Otherwise, I would rather just get rid of it,
or redefine it as exclusively meaning "the default value".

Joe Krahn

Reply to: [list | sender only]

Follow-Ups:

Re: Opinions on comments as part of the content (peter murray-rust)

References:

Opinions on comments as part of the content (Joe Krahn)

Re: Opinions on comments as part of the content (peter murray-rust)

Prev by Date: Re: New accent modifier types?

Next by Date: Re: New accent modifier types?

Prev by thread: Re: Opinions on comments as part of the content

Next by thread: Re: Opinions on comments as part of the content

Index(es):

Date

Thread

Discussion List Archives

Re: Opinions on comments as part of the content