Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Tue, 17 Aug 2004, Peter Murray-Rust wrote:

> Q. Are comments part of the infoset? My current belief is no, but
> certain comments (e.g. #\\#CIF_1.1. convey important information. Also
> some comments such as
> # Supplementary Material (ESI) for Organic & Biomolecular Chemistry #
> This journal is © The Royal Society of Chemistry 2003
> may suffer by being lost

This question is more interesting than any answer. If infosets define
lexically equivalent files why ask this question? If there is a comment in
the file, then there should exist an infoset that can handle it - isn't
that the idea? Whether at an application level one chooses to use the
comments is a different question.

StarBase (an application) *chooses* to interpret comments as lexical
whitespace and removes them in the tokenising phase.

Does an infoset for HTML that says

<b><!--interpret hello as goodbye-->hello</b> is equivalent to
<b>hello</b>? If so, wouldn't that be somewhat dangerous?

> Q. Does the presence or absence of a dictionary affect the infoset? (it
> is formally impossible to deconvolute namespaces or categories without a
> dictionary) Moreover defaults, etc (see below) depend on a dictionary.

Why is the deconvolution of namepsaces and categories (in the Star syntax)  
a lexical issue? That is a higher order issue. The datanames would have to
be identical (up to case) in either file, though their placement could be
very different.

> the presence of a dictionary is important, is it an error to have a CIF
> without a dictionary?

The lexical level I am trying to see how you need a dictionary. If it is a
question of a value like "?" versus a another file with the default value
substituted then these are very different things, and the infoset should
highlight them as such.

> Q Should the (a) fact (b) manner of quoting be preserved in the infoset?  
> The specification suggests that '12' and 12 should be interpreted
> differently in certain circumstances, but I cannot work out which and
> how.  (The type of a data item is defined by the dictionary entry
> char/numb - does the quoting overrule this? If not, what is its role?)

This is a throwback from the very first versions of STAR. It was a weak
attempt at some type information (only char and num - woefully
inadequate). However it seems to me the declaration as char or numb had to
do with its lexical appearance - not its actual type. So if something is
numb, you expect it to be a number, irrespective of the lexical eye candy
provided by a variety of delimited string forms. If _cell_length is
declared numb, then '12.1' and 12.1 are equivalent in interpretation (at
the application level).

Mmmmmm. Now I can see why you think you need dictionaries. However if the
above is what you are supposed to do with infosets the I have
misunderstood what its intent is. I guess that infosets states the
following to XML entities are lexically equivalent, <blah></blah> and
<blah />, but this is a well defined operation - like order independence
in STAR. I wouldn't *expect* an infoset to deal with the semantic
equivalences of delimited versus non delimited strings.

> Q Is the order of data items and loops in a data_ block unimportant?

By definition.

> Q is the order of names in a loop_ header important? Do

At any single level yes, but not through a full nesting (STAR not a CIF

> Q Is the order of "rows" in a loop_ unimportant? Do

Yes (in CIF).

> have identical infosets? (In a relational model they would).
> Q Does data_global have any semantics? I suspect that formally it does
> not, but it seems in widespread use:

data_global doesn't exist. global_ does (in STAR and CIF?). Its semantics
are well defined.

> _foo foo
> data_a
> _bar a
> data_b
> _bar b
> seems to have the semantics equivalent to:
> data_a
> _foo foo
> _bar a
> data_b
> _foo foo
> _bar b

Yes, furthermore 

 _foo foo

 _bar a

 _foo foo2

 _bar b

 seems to have the semantics equivalent to:

 _foo foo
 _bar a

 _foo foo2
 _bar b

> Q how should ? be treated in the infoset?

Strictly it should be treated as ? at the lexical level ie TOKEN(UNKNOWN).
What you do with that at the higher level may require the dictionary.
Similarly (at a lextical level) "." should be left as it is. It is up to
the application to deal with it.

> Q how is '.' to be interpreted?

Again (I believe) an application level problem, not to be handled at a
lexical level.

> This is extremely difficult to interpret in the infoset. The first part
> suggests that the limitations come from a non-rectangular loop_ - it is
> simply there so the syntax is not violated. The default value cannot be
> applied without a program that understands and implements dictionary
> entries. How common is this? (I suspect fairly rare.) If so, I would
> argue that the default approach is dangerous and be phased out.

I suspect apart from Syd and I, almost no one sucks in dictionaries to
validate STAR/CIF file contents. Most just assume they know what they need
to and hope the definition of the data item has never changed..

Good luck, Peter.



Dr N. Spadaccini                                      Head of School

School of Computer Science &                voice: +(61 8) 6488 3452
Software Engineering                          fax: +(61 8) 6488 1089
The University of Western Australia      email: nick@csse.uwa.edu.au            
35 Stirling Highway                    w3: www.csse.uwa.edu.au/~nick
CRAWLEY, Perth,  WA  6009             
AUSTRALIA                               CRICOS Provider Code: 00126G

Reply to: [list | sender only]