Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Aug 18 2004, David Brown wrote:

> I am also finding this interchange interesting.

Thanks - it is a deep issue and resulted many thousands of emails in the 
XML community.

The issue as I see it is whether CIFs are seen as machine-understandable 
documents or whether they are primarily to produce material for humans to 
read. (They can do both, but it requires work).

By machine-understandable I mean that an application unknown to the creator 
of the document can interpret it in a way consistent with the creator's 
intentions. This requires consistent interpretation of CIF semantics over 
all applications (hence the requirement for an infoset).

  I have only a couple of 
> short comments to add:
> Technically the comments are not part of the CIF, 

This is what I had believed in conversations and emails but it is not 
explicitly mentioned in the specs. But clearly this view is not universal.

Q. If my CIF parser automatically strips all comments from the document 
and, say, deposists them in a public repopsitory, does anyone feel this is 
a problem?

Q. Is the CIF version "comment" a special case and should it be preserved 
(I believe yes)

and in practice the 
> CIFs I handle for Acta Cryst. only contain template comments that are 
> designed to direct the author to include the requred information.   When 
> CIF editors become more widely used, these comments will not be needed.

I believe that in practice many authors use comments to convey information 
they would wish to remain in the document (e.g. copyright). If so it seems 
that new dictionary items may be required.
> >> So if something is numb, you expect it to be a number, irrespective 
> >> of the lexical eye candy provided by a variety of delimited string 
> >> forms. If _cell_length is declared numb, then '12.1' and 12.1 are 
> >> equivalent in interpretation (at the application level).
> >
> >
> > The CIF specification indicates that these have different semantics. 
> > If this is now obsolete or deprecated it would make implementations 
> > simpler. 
> The quotes are important.  The dictionary gives, I believe, the default 
> type, but this can be overridden by the acutal type.  Thus in the 
> example given above '12.1' would be read as char and an application 
> would have to decide whether it could convert this to numb.

I am now unclear about the role of char and numb. I assumed they were for 
data validation and application programmers. The first would ensure that a 
data value was always a number - thus I would have believed that

_cell_length_a 'too large to measure'

was a validation error. The second aspect is now a nightmare for 
application programmers. Firstly the infoset (the result of the parse) has 
to retain knowledge of whether the value is quoted. Then the apllication 
has to take different action on whether the value is quoted. The author 
submits that _cell_length-a '12.1' _cell_length-a 12.1 have different 
meanings. (I cannot see what - as a programmer - I can or have to do). 
Formally if I get _cell_length-a '12.1' I would have to throw an exception 
"Cell_length_a is not a number, cannot continue".

  Quoting is 
> important - for example in the dictionaries '_cell_length_a' is not a 
> dataname, though _cell_name_a is.  This might occur in a CIF if someone 
> wrote:
> _exptl_special_details   '_exptl_density_obs unobserverable'

This is a separate issue. The quoting is simply an escape mechanism (as 
also for whitspace and multiline text). Any compliant CIF parser should 
have no problem parsing the above but I would not expect the infoset to 
retain the quotes or the fact of quoting.

Similarly I would not expect a CIF writer to output any quoutes unless it 
was required to escape something. (The other extreme is that a writer could 
quote everything to be safe). Unless the semantic meaning is clear I would 
suggest that quotes are only used to escape values.

> >> > > Q Does data_global have any semantics? I suspect that formally it 
> >> does
> >> > not, but it seems in widespread use:
> >>
> >>
> >> data_global doesn't exist. 
> >
> >
> > It does (frequently). (I appreciate that gloabl_ is different and 
> > irrelevant to CIF/DDL1). data_gloabl is very frequently used as the 
> > first block in a multiblock CIF to indicate information that (I 
> > assume) the author wishes to apply to all blocks. I think it either 
> > needs deprecating or accepting and formalising. 
> One of the commonly used templates (I believe that supplied by SHELX) 
> starts with a datablock called data_global but this is not a reserved 
> dataname and has no significance beyond being a legitimate form of 
> data_xxxxx.

Fully agreed.

  In the template it introduces a datablock that contains the 
> text part of a paper, with the numerical information supplied in one or 
> more additional blocks depending on how many structures are being 
> described. Since formally each datablock in CIF is independent, there is 
> no formal linkage between the data_global datablock and any of the other 
> datablocks that follow. 

Also fully agreed. That means that formally a processor can split a 
multiblock cif into indivdual files without loss of information. In 
practice, unfortunately, it cannot. I would suggest that the use of 
semantically independent data_ blocks is much safer than data_global. The 
extra information per block is trivial.

 As Nick points out, global_ is defined in STAR, 
> though not in the current version of CIF.  The name is currently 
> reserved in CIF in case we wish to use it later.

Yes. I have not confused global_ with data_global. 
> > "." is worse because the spec can be interpreted as requiring the 
> > implementer to insert the default value from the dictionary. At one 
> > stage this would be interpreted to mean that unless specified all 
> > extinstion corrections were, by default, Zachariasen. Defaults, and 
> > their insertion, have to be explicitly specified. 
> I agree there is a problem.  In working through dictionary definitions 
> we are trying to remove the default values and in my view "." should 
> never be used to indicate a default - it should only mean 'this item has 
> no physical meaning in the present context'.  One good example of where 
> defaults make sense is in  _atom_site_occupancy.  In a straightforward 
> structure report this item may not be given, but it certainly does not 
> imply that it is irrelevent or not known.  It would be assumed by any 
> application to have the value of 1.0 unless otherwise stated.  A value 
> of '.' for this item should not indicate the default - if the item is 
> present in a CIF the value should be given explicitly even if it is the 
> same as the default.  A value of '.' says that it makes no sense to talk 
> about the occupancy of this atom (it might occur if the atom in question 
> was a dummy atom, which is allowed).
This is very helpful.

> >
> >
> >>
> >> I suspect apart from Syd and I, almost no one sucks in dictionaries to
> >> validate STAR/CIF file contents. Most just assume they know what they 
> >> need
> >> to and hope the definition of the data item has never changed.. 
> >
> There are two editor/browsers available and another that a couple of 
> students are writing for me that do read in the dictionary before 
> reading in the CIF and use the dictionary for validating.  The 
> validations are not complete, but at least they test the important items 
> such as enumeration lists, type etc.  It is a beginning, and I am trying 
> (as an Acta editor) to educate users to preparing CIFs that will be 
> accessible to the advanced software of the future.  

This is also excellent. It is (IMO) almost essential that editors are based 
on infosets, and that machine-dictionary validation bcomes standard. We do 
not intend to develop editors but are developing some of the infrastructure 
(Parsers, infosets, and validators) and should correspond.

However, most users 
> still see CIF as just a more complicated file structure that offers 
> little more than the old formatted output files produced by the 
> principal structure-solving packages.

Understood :-) but this should change gradually.


Reply to: [list | sender only]