Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

Thanks Brian,
We are clearly in almost complete agreement. Minor comments below.

At 10:32 07/03/2007, Brian McMahon wrote:
>courtesy; and even, in Peter's case, efforts are made to retain
>them by applying sensible heuristics if content is re-ordered.
>I think such applications have value, but there is no requirement
>on them by the standard to do so, nor do I believe there should be.


>PMR> The first observation is that CIF does not define an abstract data
>PMR> model (e.g. the Infoset in XML) so it is difficult to on what a
>PMR> parser should do other than confirm validity to the CIF standard.
>PMR> ... We have written a CIF parser (CIFDOM) which parses CIFs into an
>PMR> abstract data model which can be expose in XML syntax and conforms to
>PMR> Document Object models (DOM).
>This is a good point. In practice CIFs map to different document object
>modules: small-molecule CIFs submitted to Acta C/E represent a
>scientific article reporting one or more discrete structures. msCIF
>represents an aggregate of structural descriptions of one or more
>compounds/phases, several of which may be overlaid to describe
>modulated structures as superpositions of substructures. PDB mmCIFs
>represent single-compound database records. symCIFs represent
>tabulations of symmetry properties for different space groups. These
>models aren't mutually exclusive; they will have significant overlaps.
>But I think we need to work at formalising the abstract structures,
>classifying different models and mapping to appropriate DOMs if it
>turns out that it's necessary to do so.

Fully agreed. I believe that it is possible to have a single DOM for 
the non-STAR CIFs, based on DDL1. (Is it still called that). The more 
complex arrangements should be extensions of this. There will be a 
need for a language to define the different conventions. (My own CML 
has an attribute 'convention' precisely for this purpose). Some 
signal is required because it is non-trivial to work out what 
document model is mandated by a given lexical CIF

>PMR> In doing this we have had to make
>PMR> various interpretations of the standard, while trying to
>PMR> retain the goodwill of authors and readers ... We apply
>PMR> the following from the standard ...
>PMR> I would be grateful to know if any COMCIFer has a different 
>view of these.
>I agree with Peter's interpretations. [I would like to see some
>applications developed that did apply various styles of "pretty
>printing" (and might one day find time to work on them myself),
>for there is a certain aesthetic and usefulness in working with
>pure-ASCII files in old-fashioned text-editing environments.
>But I accept that these are cosmetic requirements only, and
>any parser is at liberty to normalise whitespace that is used to
>separate data items.]
>PMR> Does this mean one or more comments before the first block? I don't
>PMR> think the standard defines a CIF header comment.
>The 1.1 specification recommends that a CIF begin with a comment string
>#\#CIF_1.1 to act as a version indicator, and incidentally as a magic
>number to help filetype applications supported by an operating system
>to identify the file type. Only a recommendation (since it was absent
>from the initial spec), and I'd be interested in whether applications do
>make use of this string when it is found.

I think this would be valuable. There are an increasing number of 
applications which need to 'guess' filetype and magic signals are valuable.

>Other than that, comments may occur before the first block, but
>without any specifc semantics.
>PMR> This is one of a small number of topics which could benefit from
>PMR> clarification (and in some cases an arbitrary ruling):
>PMR> * data blocks. Is the value of the data block case-sensitive? are
>PMR> data block ids which differ only in case identical and therefore
>PMR> illegal. Is it allowed to have an empty string as id? or any mixture
>PMR> of non-whitespace CIF chars (e.g. punctuation only)
>    "The file may be partitioned into multiple data blocks by the
>     insertion of further data-block headers. Data-block headers
>     are case-insensitive (that is, two headers differing only in
>     whether corresponding letter characters are upper or lower
>     case are considered identical). Within a single data file
>     identical data-block headers are not permitted."
>           (International Tables G, p.21)
>An empty block code is not permitted in a datablock header (i.e.
>data_ on its own is invalid); any mixture of non-whitespace
>characters is allowed (probably unfortunate, but that's the way
>it is).

Thanks - I should have picked this up - I tend to refer to the CIF spec.

>PMR> * data_global. This is so widespread that it would be useful to have
>PMR> at least an agreed heuristic for it.
>PMR> * multi-data-block CIFs. Is it legitimate to split them? If so,
>PMR> can/should data_global be copied into each?
>For Acta papers, the *heuristic* is that data_global contains information
>that applies to all the following data blocks - in practice, the title,
>authors, discursive text of the paper, while succeeding data blocks contain
>the experimental and derived data for each structure. That's a reasonably
>reliable heuristic for that particular document model, but need not apply,
>say, to a modulated-structure DOM. For a long time I've thought that we need
>to formalise the relationships between data blocks (see e.g.

Yes - this is very useful - it had slipped my memory that we had 
preserved this discussion.

See also comments to Joe's mail

>PMR> * what are the semantics of '?' and '.'
>    "The more important use of the null data type is its application
>     to the meta characters ` ?' (query) and ` .' (full point) that
>     may occur as values associated with any data name and therefore
>     have no specific type. ...
>     The substitution of the query character ` ?' in place of a data
>     value is an explicit signal that an expected value is missing
>     from a CIF. This `missing-value signal' may be used instead of
>     omitting an item (i.e. its tag and value) entirely from the file,
>     and serves as a reminder that the item would normally be present.
>     The substitution of the full-point character ` .' in place of
>     a CIF data value serves two similar, but not identical, purposes.
>     If it is used in looped lists of data it is normally a signal that
>     a value in a particular packet (i.e. a value in the row of the
>     table) is `inapplicable' or `inappropriate'. In some CIF
>     applications involving access to a data dictionary it is used to
>     signal that the default value of the item is defined in its
>     definition in the dictionary. Consequently, the interpretation
>     of this signal is an application-specific matter and its use must
>     be determined according to the application."
>           (International Tables G, p.24)
>PMR> Is it legitimate to delete an item of the form: _foo ?
>PMR> or does it convey information?
>It conveys information (though probably only the person who put it there
>knows what). Of course, an application that is expressly validating
>against a dictionary might choose to omit it.

Thanks -  have also noted some examples in earlier reply.


>comcifs mailing list

Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK

Reply to: [list | sender only]