[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <[email protected]>
Subject: Re: Opinions on comments as part of the content
From: Brian McMahon <[email protected]>
Date: Wed, 7 Mar 2007 10:32:01 +0000
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]>

JK> Are there people using comments to hold pertinent information?
JK> If so, has there been any attempt to add a general purpose
JK> comment data items? My thinking is that the only comment that
JK> should have valid information is the CIF header comment,

Comments must not be relied upon to carry "portable" information.
There are a number of applications where they are useful:
for example, Acta Cryst template CIFs make liberal use of comments
to indicate to a human reader the best way to complete data items,
but they don't embed any data that should be exposed in a purely
crystallographic application. Applications that don't re-order
content are at liberty to carry or discard comments, and it's
true that a number do carry them along as a convenience, or
courtesy; and even, in Peter's case, efforts are made to retain
them by applying sensible heuristics if content is re-ordered.
I think such applications have value, but there is no requirement
on them by the standard to do so, nor do I believe there should be.

PMR> The first observation is that CIF does not define an abstract data 
PMR> model (e.g. the Infoset in XML) so it is difficult to on what a 
PMR> parser should do other than confirm validity to the CIF standard.
PMR> ... We have written a CIF parser (CIFDOM) which parses CIFs into an 
PMR> abstract data model which can be expose in XML syntax and conforms to 
PMR> Document Object models (DOM).

This is a good point. In practice CIFs map to different document object
modules: small-molecule CIFs submitted to Acta C/E represent a
scientific article reporting one or more discrete structures. msCIF
represents an aggregate of structural descriptions of one or more
compounds/phases, several of which may be overlaid to describe
modulated structures as superpositions of substructures. PDB mmCIFs
represent single-compound database records. symCIFs represent
tabulations of symmetry properties for different space groups. These
models aren't mutually exclusive; they will have significant overlaps.
But I think we need to work at formalising the abstract structures,
classifying different models and mapping to appropriate DOMs if it
turns out that it's necessary to do so.


PMR> In doing this we have had to make 
PMR> various interpretations of the standard, while trying to
PMR> retain the goodwill of authors and readers ... We apply
PMR> the following from the standard ...
PMR> I would be grateful to know if any COMCIFer has a different view of these.

I agree with Peter's interpretations. [I would like to see some
applications developed that did apply various styles of "pretty
printing" (and might one day find time to work on them myself),
for there is a certain aesthetic and usefulness in working with
pure-ASCII files in old-fashioned text-editing environments.
But I accept that these are cosmetic requirements only, and
any parser is at liberty to normalise whitespace that is used to
separate data items.]

PMR> Does this mean one or more comments before the first block? I don't 
PMR> think the standard defines a CIF header comment.

The 1.1 specification recommends that a CIF begin with a comment string
#\#CIF_1.1 to act as a version indicator, and incidentally as a magic
number to help filetype applications supported by an operating system
to identify the file type. Only a recommendation (since it was absent
from the initial spec), and I'd be interested in whether applications do
make use of this string when it is found.

Other than that, comments may occur before the first block, but
without any specifc semantics.

PMR> This is one of a small number of topics which could benefit from 
PMR> clarification (and in some cases an arbitrary ruling):
PMR> 
PMR> * data blocks. Is the value of the data block case-sensitive? are 
PMR> data block ids which differ only in case identical and therefore 
PMR> illegal. Is it allowed to have an empty string as id? or any mixture 
PMR> of non-whitespace CIF chars (e.g. punctuation only)

   "The file may be partitioned into multiple data blocks by the
    insertion of further data-block headers. Data-block headers
    are case-insensitive (that is, two headers differing only in
    whether corresponding letter characters are upper or lower
    case are considered identical). Within a single data file
    identical data-block headers are not permitted."
          (International Tables G, p.21)

An empty block code is not permitted in a datablock header (i.e.
data_ on its own is invalid); any mixture of non-whitespace
characters is allowed (probably unfortunate, but that's the way
it is).

PMR> * data_global. This is so widespread that it would be useful to have 
PMR> at least an agreed heuristic for it.
PMR> * multi-data-block CIFs. Is it legitimate to split them? If so, 
PMR> can/should data_global be copied into each?

For Acta papers, the *heuristic* is that data_global contains information
that applies to all the following data blocks - in practice, the title,
authors, discursive text of the paper, while succeeding data blocks contain
the experimental and derived data for each structure. That's a reasonably
reliable heuristic for that particular document model, but need not apply,
say, to a modulated-structure DOM. For a long time I've thought that we need
to formalise the relationships between data blocks (see e.g.
http://www.iucr.org/iucr-top/lists/comcifs-l/msg00228.html).


PMR> * what are the semantics of '?' and '.'

   "The more important use of the null data type is its application
    to the meta characters ` ?' (query) and ` .' (full point) that
    may occur as values associated with any data name and therefore
    have no specific type. ...

    The substitution of the query character ` ?' in place of a data
    value is an explicit signal that an expected value is missing
    from a CIF. This `missing-value signal' may be used instead of
    omitting an item (i.e. its tag and value) entirely from the file,
    and serves as a reminder that the item would normally be present.

    The substitution of the full-point character ` .' in place of
    a CIF data value serves two similar, but not identical, purposes.
    If it is used in looped lists of data it is normally a signal that
    a value in a particular packet (i.e. a value in the row of the
    table) is `inapplicable' or `inappropriate'. In some CIF
    applications involving access to a data dictionary it is used to
    signal that the default value of the item is defined in its
    definition in the dictionary. Consequently, the interpretation
    of this signal is an application-specific matter and its use must
    be determined according to the application."
          (International Tables G, p.24)
     

PMR> Is it legitimate to delete an item of the form: _foo ?
PMR> or does it convey information?

It conveys information (though probably only the person who put it there
knows what). Of course, an application that is expressly validating
against a dictionary might choose to omit it.

Brian

Reply to: [list | sender only]

Follow-Ups:

Re: Opinions on comments as part of the content (peter murray-rust)

References:

Opinions on comments as part of the content (Joe Krahn)

Re: Opinions on comments as part of the content (peter murray-rust)

Prev by Date: Re: Opinions on comments as part of the content

Next by Date: Re: New accent modifier types?

Prev by thread: Re: Opinions on comments as part of the content

Next by thread: Re: Opinions on comments as part of the content

Index(es):

Date

Thread

Discussion List Archives

Re: Opinions on comments as part of the content