Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

Peter asks some interesting questions.  I do not propose to answer
them in detail here.  However, I should point out that interpretation
of a given CIF may require 4 sets of documents:

  1.  The CIF itself.
  2.  The dictionary or dictionaries defining the tags
used in the CIF
  3.  The relevant DDLs
  4.  The CIF specification:
       http://www.iucr.org/iucr-top/cif/spec/version1.1/index.html

Many of Peter's questions are answered in the specification.

The infoset concept is useful, but be warned that the appropriate
handling of information depends on the context within which you are
working, regardless of whether you are using CIF or using XML or
the PDB format.  For an application intended to just get at the data,
comments may be discarded, while for an application intended to reformat
the presentation of the data, comments are highly significant
information.  Similarly, the particular form of quoting, the
distinction between "." and "?", etc. may or may not be
signficant.  If the application in question is, say, a
refinement program that just needs to read CIFs to extract
expected crystallographic data, then construction of the "infoset"
from a CIF is particularly simple.  More demanding applications,
e.g. in CIF validation and publication suites, may need to deal
with more subtle data and metadata questions.

  -- Herbert

=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================

On Tue, 17 Aug 2004, Peter Murray-Rust wrote:

> I have found it necessary to have a formal specification of the abstract
> data after parsing a CIF. In XML terminology this is called the Infoset,
> see http://www.w3.org/TR/xml-infoset. Essentially the infoset ignores
> lexical variants and can conveniently be thought of as an in-memory data
> model (though this is strictly an oversimplification).
>
> As an example consider the CIFs
>
> #start
> data_a
> _foo
> 1
> #end
>
> and
>
> #start
> data_a               _foo 1 #end
>
> They are lexical variants but have an identical infoset.  However unless
> the infoset concept and details are formally defined, program implementers
> cannot rely on the interpretation and CIF semantics are sufficiently fuzzy
> that it is not currently possible to build a consistent data model.
>
> There are many advantages of an infoset:
> - programs can be separated into parsers and applications. There is a clear
> semantic interface between them.
> - programmers have a consistent approach to data models
> - CIFs can be normalized and canonicalized so that it is possible to say
> whether two CIFs map onto the same infoset.
> - CIFs can be roundtripped to check the integrity of software
> - It is clear what information is acceptable to the infoset.
>
> I have encountered a number of semantic questions where I am unclear what
> the infoset would look like. I list some in detail and others briefly. I
> hope that COMCIFS will see this as an area which needs further definition.
> My concerns at present are restricted to DDL1
>
> Q. Are comments part of the infoset? My current belief is no, but certain
> comments (e.g.
> #\\#CIF_1.1.
> convey important information. Also some comments such as
>
> # Supplementary Material (ESI) for Organic & Biomolecular Chemistry
> # This journal is © The Royal Society of Chemistry 2003
>
> may suffer by being lost
>
> Q. Does the presence or absence of a dictionary affect the infoset? (it is
> formally impossible to deconvolute namespaces or categories without a
> dictionary) Moreover defaults, etc (see below) depend on a dictionary. If
> the presence of a dictionary is important, is it an error to have a CIF
> without a dictionary?
>
> Q Should the (a) fact (b) manner of quoting be preserved in the infoset?
> The specification suggests that '12' and 12 should be interpreted
> differently in certain circumstances, but I cannot work out which and how.
> (The type of a data item is defined by the dictionary entry char/numb -
> does the quoting overrule this? If not, what is its role?)
>
> Q Is the order of data items and loops in a data_ block unimportant? I
> believe so, but I cannot find it explicitly stated. Assuming this, the files:
>
> data_a
> _foo f
> _bar b
>
> and
>
> data_a
> _bar b
> _foo f
>
> should have identical infosets.
>
> Q is the order of names in a loop_ header important? Do
>
> data_a
> loop_ _foo _bar
> 1 2 3 4
>
> and
>
> data_a
> loop_  _bar _foo
> 2 1 4 3
>
> have identical infosets?
>
> Q Is the order of "rows" in a loop_ unimportant? Do
> data_a
> loop_ _foo _bar
> 1 2
> 3 4
>
> and
> data_a
> loop_ _foo _bar
> 3 4
> 1 2
>
> have identical infosets? (In a relational model they would).
>
> Q Does data_global have any semantics? I suspect that formally it does not,
> but it seems in widespread use:
>
> data_global
> _foo foo
>
> data_a
> _bar a
>
> data_b
> _bar b
>
> seems to have the semantics equivalent to:
>
> data_a
> _foo foo
> _bar a
>
> data_b
> _foo foo
> _bar b
>
> I would find it valuable to have a clear ruling from COMCIFS on this point.
> I believe that technically data_ blocks have no interblock semantics but
> this seems to be slipping
>
> Q how should ? be treated in the infoset?
> "The value '?' represents an unknown value of the quantity. It appears
> typically in template files to indicate data items whose value should be
> supplied by an application or user; or it may appear in the output from an
> application extracting information from a CIF in response to a request list."
> I interpret this to mean that the infoset has to hold this as a special
> value of unknown. This is a considerable burden on the infoset implementer
> if it is never used. In practice I suspect many CIF practitioners use it as
> a lexical template for hand editing or a visual prompt that a data value
> should be entered. However strictly it is an indication to the reader of
> the information (machine as well as human) that the data value is
> "unknown". Unknown values are surprisingly difficult to implement and have
> caused problems in understanding XML Schema. Personally I would suggest
> there is no distinction between:
>
> data_a
> _foo ?
>
> and
>
> data_a
>
> Q how is '.' to be interpreted?
> "
> The value '.' represents an inapplicable value of the quantity. It may be
> inappropriate or meaningless (as in the case of
> _refine_ls_hydrogen_treatment reported for an inorganic salt containing no
> hydrogens). It may represent a value omitted intentionally, for some good
> reason. Or, in some circumstances (described in the _definition portion of
> the appropriate CIF dictionary entry), it may indicate the use of a default
> value."
>
> This is extremely difficult to interpret in the infoset. The first part
> suggests that the limitations come from a non-rectangular loop_ - it is
> simply there so the syntax is not violated. The default value cannot be
> applied without a program that understands and implements dictionary
> entries. How common is this? (I suspect fairly rare.) If so, I would argue
> that the default approach is dangerous and be phased out.
>
> ====
>
> I have a number of other semantic concerns, but I hope that these show the
> value of an explicitly defined infoset, and the semantic clarifications
> that are necessary to adopt it.
>
> P.
>
>
>
> Peter Murray-Rust
> Unilever Centre for Molecular Informatics
> Chemistry Department, Cambridge University
> Lensfield Road, CAMBRIDGE, CB2 1EW, UK
> Tel: +44-1223-763069
>
> _______________________________________________
> comcifs mailing list
> comcifs@iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs
>


Reply to: [list | sender only]