Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Opinions on comments as part of the content

Thanks Brian,
We are clearly in almost complete agreement. Minor comments below.

At 10:32 07/03/2007, Brian McMahon wrote:
>courtesy; and even, in Peter's case, efforts are made to retain
>them by applying sensible heuristics if content is re-ordered.
>I think such applications have value, but there is no requirement
>on them by the standard to do so, nor do I believe there should be.


>PMR> The first observation is that CIF does not define an abstract data
>PMR> model (e.g. the Infoset in XML) so it is difficult to on what a
>PMR> parser should do other than confirm validity to the CIF standard.
>PMR> ... We have written a CIF parser (CIFDOM) which parses CIFs into an
>PMR> abstract data model which can be expose in XML syntax and conforms to
>PMR> Document Object models (DOM).
>This is a good point. In practice CIFs map to different document object
>modules: small-molecule CIFs submitted to Acta C/E represent a
>scientific article reporting one or more discrete structures. msCIF
>represents an aggregate of structural descriptions of one or more
>compounds/phases, several of which may be overlaid to describe
>modulated structures as superpositions of substructures. PDB mmCIFs
>represent single-compound database records. symCIFs represent
>tabulations of symmetry properties for different space groups. These
>models aren't mutually exclusive; they will have significant overlaps.
>But I think we need to work at formalising the abstract structures,
>classifying different models and mapping to appropriate DOMs if it
>turns out that it's necessary to do so.

Fully agreed. I believe that it is possible to have a single DOM for 
the non-STAR CIFs, based on DDL1. (Is it still called that). The more 
complex arrangements should be extensions of this. There will be a 
need for a language to define the different conventions. (My own CML 
has an attribute 'convention' precisely for this purpose). Some 
signal is required because it is non-trivial to work out what 
document model is mandated by a given lexical CIF

>PMR> In doing this we have had to make
>PMR> various interpretations of the standard, while trying to
>PMR> retain the goodwill of authors and readers ... We apply
>PMR> the following from the standard ...
>PMR> I would be grateful to know if any COMCIFer has a different 
>view of these.
>I agree with Peter's interpretations. [I would like to see some
>applications developed that did apply various styles of "pretty
>printing" (and might one day find time to work on them myself),
>for there is a certain aesthetic and usefulness in working with
>pure-ASCII files in old-fashioned text-editing environments.
>But I accept that these are cosmetic requirements only, and
>any parser is at liberty to normalise whitespace that is used to
>separate data items.]
>PMR> Does this mean one or more comments before the first block? I don't
>PMR> think the standard defines a CIF header comment.
>The 1.1 specification recommends that a CIF begin with a comment string
>#\#CIF_1.1 to act as a version indicator, and incidentally as a magic
>number to help filetype applications supported by an operating system
>to identify the file type. Only a recommendation (since it was absent
>from the initial spec), and I'd be interested in whether applications do
>make use of this string when it is found.

I think this would be valuable. There are an increasing number of 
applications which need to 'guess' filetype and magic signals are valuable.

>Other than that, comments may occur before the first block, but
>without any specifc semantics.
>PMR> This is one of a small number of topics which could benefit from
>PMR> clarification (and in some cases an arbitrary ruling):
>PMR> * data blocks. Is the value of the data block case-sensitive? are
>PMR> data block ids which differ only in case identical and therefore
>PMR> illegal. Is it allowed to have an empty string as id? or any mixture
>PMR> of non-whitespace CIF chars (e.g. punctuation only)
>    "The file may be partitioned into multiple data blocks by the
>     insertion of further data-block headers. Data-block headers
>     are case-insensitive (that is, two headers differing only in
>     whether corresponding letter characters are upper or lower
>     case are considered identical). Within a single data file
>     identical data-block headers are not permitted."
>           (International Tables G, p.21)
>An empty block code is not permitted in a datablock header (i.e.
>data_ on its own is invalid); any mixture of non-whitespace
>characters is allowed (probably unfortunate, but that's the way
>it is).

Thanks - I should have picked this up - I tend to refer to the CIF spec.

>PMR> * data_global. This is so widespread that it would be useful to have
>PMR> at least an agreed heuristic for it.
>PMR> * multi-data-block CIFs. Is it legitimate to split them? If so,
>PMR> can/should data_global be copied into each?
>For Acta papers, the *heuristic* is that data_global contains information
>that applies to all the following data blocks - in practice, the title,
>authors, discursive text of the paper, while succeeding data blocks contain
>the experimental and derived data for each structure. That's a reasonably
>reliable heuristic for that particular document model, but need not apply,
>say, to a modulated-structure DOM. For a long time I've thought that we need
>to formalise the relationships between data blocks (see e.g.

Yes - this is very useful - it had slipped my memory that we had 
preserved this discussion.

See also comments to Joe's mail

>PMR> * what are the semantics of '?' and '.'
>    "The more important use of the null data type is its application
>     to the meta characters ` ?' (query) and ` .' (full point) that
>     may occur as values associated with any data name and therefore
>     have no specific type. ...
>     The substitution of the query character ` ?' in place of a data
>     value is an explicit signal that an expected value is missing
>     from a CIF. This `missing-value signal' may be used instead of
>     omitting an item (i.e. its tag and value) entirely from the file,
>     and serves as a reminder that the item would normally be present.
>     The substitution of the full-point character ` .' in place of
>     a CIF data value serves two similar, but not identical, purposes.
>     If it is used in looped lists of data it is normally a signal that
>     a value in a particular packet (i.e. a value in the row of the
>     table) is `inapplicable' or `inappropriate'. In some CIF
>     applications involving access to a data dictionary it is used to
>     signal that the default value of the item is defined in its
>     definition in the dictionary. Consequently, the interpretation
>     of this signal is an application-specific matter and its use must
>     be determined according to the application."
>           (International Tables G, p.24)
>PMR> Is it legitimate to delete an item of the form: _foo ?
>PMR> or does it convey information?
>It conveys information (though probably only the person who put it there
>knows what). Of course, an application that is expressly validating
>against a dictionary might choose to omit it.

Thanks -  have also noted some examples in earlier reply.


>comcifs mailing list

Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road,  Cambridge CB2 1EW, UK

comcifs mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.