[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CIF Infoset

I have found it necessary to have a formal specification of the abstract 
data after parsing a CIF. In XML terminology this is called the Infoset, 
see http://www.w3.org/TR/xml-infoset. Essentially the infoset ignores 
lexical variants and can conveniently be thought of as an in-memory data 
model (though this is strictly an oversimplification).

As an example consider the CIFs

#start
data_a
_foo
1
#end

and

#start
data_a               _foo 1 #end

They are lexical variants but have an identical infoset.  However unless 
the infoset concept and details are formally defined, program implementers 
cannot rely on the interpretation and CIF semantics are sufficiently fuzzy 
that it is not currently possible to build a consistent data model.

There are many advantages of an infoset:
- programs can be separated into parsers and applications. There is a clear 
semantic interface between them.
- programmers have a consistent approach to data models
- CIFs can be normalized and canonicalized so that it is possible to say 
whether two CIFs map onto the same infoset.
- CIFs can be roundtripped to check the integrity of software
- It is clear what information is acceptable to the infoset.

I have encountered a number of semantic questions where I am unclear what 
the infoset would look like. I list some in detail and others briefly. I 
hope that COMCIFS will see this as an area which needs further definition. 
My concerns at present are restricted to DDL1

Q. Are comments part of the infoset? My current belief is no, but certain 
comments (e.g.
#\\#CIF_1.1.
convey important information. Also some comments such as

# Supplementary Material (ESI) for Organic & Biomolecular Chemistry
# This journal is © The Royal Society of Chemistry 2003

may suffer by being lost

Q. Does the presence or absence of a dictionary affect the infoset? (it is 
formally impossible to deconvolute namespaces or categories without a 
dictionary) Moreover defaults, etc (see below) depend on a dictionary. If 
the presence of a dictionary is important, is it an error to have a CIF 
without a dictionary?

Q Should the (a) fact (b) manner of quoting be preserved in the infoset? 
The specification suggests that '12' and 12 should be interpreted 
differently in certain circumstances, but I cannot work out which and how. 
(The type of a data item is defined by the dictionary entry char/numb - 
does the quoting overrule this? If not, what is its role?)

Q Is the order of data items and loops in a data_ block unimportant? I 
believe so, but I cannot find it explicitly stated. Assuming this, the files:

data_a
_foo f
_bar b

and

data_a
_bar b
_foo f

should have identical infosets.

Q is the order of names in a loop_ header important? Do

data_a
loop_ _foo _bar
1 2 3 4

and

data_a
loop_  _bar _foo
2 1 4 3

have identical infosets?

Q Is the order of "rows" in a loop_ unimportant? Do
data_a
loop_ _foo _bar
1 2
3 4

and
data_a
loop_ _foo _bar
3 4
1 2

have identical infosets? (In a relational model they would).

Q Does data_global have any semantics? I suspect that formally it does not, 
but it seems in widespread use:

data_global
_foo foo

data_a
_bar a

data_b
_bar b

seems to have the semantics equivalent to:

data_a
_foo foo
_bar a

data_b
_foo foo
_bar b

I would find it valuable to have a clear ruling from COMCIFS on this point. 
I believe that technically data_ blocks have no interblock semantics but 
this seems to be slipping

Q how should ? be treated in the infoset?
"The value '?' represents an unknown value of the quantity. It appears 
typically in template files to indicate data items whose value should be 
supplied by an application or user; or it may appear in the output from an 
application extracting information from a CIF in response to a request list."
I interpret this to mean that the infoset has to hold this as a special 
value of unknown. This is a considerable burden on the infoset implementer 
if it is never used. In practice I suspect many CIF practitioners use it as 
a lexical template for hand editing or a visual prompt that a data value 
should be entered. However strictly it is an indication to the reader of 
the information (machine as well as human) that the data value is 
"unknown". Unknown values are surprisingly difficult to implement and have 
caused problems in understanding XML Schema. Personally I would suggest 
there is no distinction between:

data_a
_foo ?

and

data_a

Q how is '.' to be interpreted?
"
The value '.' represents an inapplicable value of the quantity. It may be 
inappropriate or meaningless (as in the case of 
_refine_ls_hydrogen_treatment reported for an inorganic salt containing no 
hydrogens). It may represent a value omitted intentionally, for some good 
reason. Or, in some circumstances (described in the _definition portion of 
the appropriate CIF dictionary entry), it may indicate the use of a default 
value."

This is extremely difficult to interpret in the infoset. The first part 
suggests that the limitations come from a non-rectangular loop_ - it is 
simply there so the syntax is not violated. The default value cannot be 
applied without a program that understands and implements dictionary 
entries. How common is this? (I suspect fairly rare.) If so, I would argue 
that the default approach is dangerous and be phased out.

====

I have a number of other semantic concerns, but I hope that these show the 
value of an explicitly defined infoset, and the semantic clarifications 
that are necessary to adopt it.

P.



Peter Murray-Rust
Unilever Centre for Molecular Informatics
Chemistry Department, Cambridge University
Lensfield Road, CAMBRIDGE, CB2 1EW, UK
Tel: +44-1223-763069

_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://scripts.iucr.org/mailman/listinfo/comcifs


Reply to: [list | sender only]