Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Tue, 17 Aug 2004, Peter Murray-Rust wrote:

> Q. Are comments part of the infoset? My current belief is no, but
> certain comments (e.g. #\\#CIF_1.1. convey important information. Also
> some comments such as
> 
> # Supplementary Material (ESI) for Organic & Biomolecular Chemistry #
> This journal is © The Royal Society of Chemistry 2003
> 
> may suffer by being lost

This question is more interesting than any answer. If infosets define
lexically equivalent files why ask this question? If there is a comment in
the file, then there should exist an infoset that can handle it - isn't
that the idea? Whether at an application level one chooses to use the
comments is a different question.

StarBase (an application) *chooses* to interpret comments as lexical
whitespace and removes them in the tokenising phase.

Does an infoset for HTML that says

<b><!--interpret hello as goodbye-->hello</b> is equivalent to
<b>hello</b>? If so, wouldn't that be somewhat dangerous?

> Q. Does the presence or absence of a dictionary affect the infoset? (it
> is formally impossible to deconvolute namespaces or categories without a
> dictionary) Moreover defaults, etc (see below) depend on a dictionary.

Why is the deconvolution of namepsaces and categories (in the Star syntax)  
a lexical issue? That is a higher order issue. The datanames would have to
be identical (up to case) in either file, though their placement could be
very different.

> the presence of a dictionary is important, is it an error to have a CIF
> without a dictionary?

The lexical level I am trying to see how you need a dictionary. If it is a
question of a value like "?" versus a another file with the default value
substituted then these are very different things, and the infoset should
highlight them as such.

> Q Should the (a) fact (b) manner of quoting be preserved in the infoset?  
> The specification suggests that '12' and 12 should be interpreted
> differently in certain circumstances, but I cannot work out which and
> how.  (The type of a data item is defined by the dictionary entry
> char/numb - does the quoting overrule this? If not, what is its role?)

This is a throwback from the very first versions of STAR. It was a weak
attempt at some type information (only char and num - woefully
inadequate). However it seems to me the declaration as char or numb had to
do with its lexical appearance - not its actual type. So if something is
numb, you expect it to be a number, irrespective of the lexical eye candy
provided by a variety of delimited string forms. If _cell_length is
declared numb, then '12.1' and 12.1 are equivalent in interpretation (at
the application level).

Mmmmmm. Now I can see why you think you need dictionaries. However if the
above is what you are supposed to do with infosets the I have
misunderstood what its intent is. I guess that infosets states the
following to XML entities are lexically equivalent, <blah></blah> and
<blah />, but this is a well defined operation - like order independence
in STAR. I wouldn't *expect* an infoset to deal with the semantic
equivalences of delimited versus non delimited strings.

> Q Is the order of data items and loops in a data_ block unimportant?

By definition.

> Q is the order of names in a loop_ header important? Do

At any single level yes, but not through a full nesting (STAR not a CIF
issue).

> Q Is the order of "rows" in a loop_ unimportant? Do

Yes (in CIF).

> have identical infosets? (In a relational model they would).
> 
> Q Does data_global have any semantics? I suspect that formally it does
> not, but it seems in widespread use:


data_global doesn't exist. global_ does (in STAR and CIF?). Its semantics
are well defined.

> 
  global_
> _foo foo
> 
> data_a
> _bar a
> 
> data_b
> _bar b
> 
> seems to have the semantics equivalent to:
> 
> data_a
> _foo foo
> _bar a
> 
> data_b
> _foo foo
> _bar b

Yes, furthermore 

 global_
 _foo foo

 data_a
 _bar a

 global_
 _foo foo2

 data_b
 _bar b

 seems to have the semantics equivalent to:

 data_a
 _foo foo
 _bar a

 data_b
 _foo foo2
 _bar b

> Q how should ? be treated in the infoset?

Strictly it should be treated as ? at the lexical level ie TOKEN(UNKNOWN).
What you do with that at the higher level may require the dictionary.
Similarly (at a lextical level) "." should be left as it is. It is up to
the application to deal with it.

> Q how is '.' to be interpreted?

Again (I believe) an application level problem, not to be handled at a
lexical level.

> This is extremely difficult to interpret in the infoset. The first part
> suggests that the limitations come from a non-rectangular loop_ - it is
> simply there so the syntax is not violated. The default value cannot be
> applied without a program that understands and implements dictionary
> entries. How common is this? (I suspect fairly rare.) If so, I would
> argue that the default approach is dangerous and be phased out.

I suspect apart from Syd and I, almost no one sucks in dictionaries to
validate STAR/CIF file contents. Most just assume they know what they need
to and hope the definition of the data item has never changed..

Good luck, Peter.

cheers

Nick

--------------------------------
Dr N. Spadaccini                                      Head of School

School of Computer Science &                voice: +(61 8) 6488 3452
Software Engineering                          fax: +(61 8) 6488 1089
The University of Western Australia      email: nick@csse.uwa.edu.au            
35 Stirling Highway                    w3: www.csse.uwa.edu.au/~nick
CRAWLEY, Perth,  WA  6009             
AUSTRALIA                               CRICOS Provider Code: 00126G





Reply to: [list | sender only]