Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

  • To: "Discussion list of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS)" <comcifs@iucr.org>
  • Subject: Re: CIF Infoset
  • From: ddb@owari.msl.titech.ac.jp
  • Date: Mon, 23 Aug 2004 15:25:00 +0900 (JST)
Hello,

well off the pace here, but just a few issues which sprang to mind:

On Wednesday 18 August 2004 19:24, Dr P. Murray-Rust wrote:
> On Aug 18 2004, Nick Spadaccini wrote:
> > On Tue, 17 Aug 2004, Peter Murray-Rust wrote:

> > > Q. Does the presence or absence of a dictionary affect the infoset? 
(it
> > > is formally impossible to deconvolute namespaces or categories 
without
> > > a dictionary) Moreover defaults, etc (see below) depend on a
> > > dictionary.
> >
> > Why is the deconvolution of namepsaces and categories (in the Star
> > syntax) a lexical issue? That is a higher order issue. The datanames
> > would have to be identical (up to case) in either file, though their
> > placement could be very different.

Besides case and ".","_" equivalence in ddl2, there is the issue of data 
name aliases, equivalent names for the same data item which arise as the 
dictionaries go under review.
By definition the information content and therefore the infoset (I would 
expect) should be the same. Recognising that would be impossible without 
the dictionary (in addition to typing and value defaults resolution 
requirements).

> In CIF DDL1 namespaces are indicated by prefixes affixed through
> underscores. A tag such as _my_local_namespace_atom_xyz_angstrom_B12
>
> is unparsable unless there is a lookup table of what the allowable
> namespaces are, the dictionaries that belong to them, the allowed 
sufiixes,
> etc.

> FWIW I have written my own CIFinfoset. I'm looking for communal feedback
> before publishing it...

So how do you intend to get around this namespace issue? No CIFs that I 
have encountered have ever declared their conformance to any dictionary.
Even if they did, there is something called the dictionary stacking 
protocol 
which allows those definitions to be overridden without declaring a 
namespace.
On top of that there is the boundless capacity for making up your own
data names on the fly for which there may never be any dictionary 
definition
at all. How can you reliably assign anything but a generic namespace to an 
infoset? Its all just adhoc guesswork.


> > > the presence of a dictionary is important, is it an error to have a 
CIF
> > > without a dictionary?
> >
> > The lexical level I am trying to see how you need a dictionary.
>
> DDL1 semantics require that only tags of the same category are found 
within
> a given loop_. It is impossible to determine the category from the tag 
name
> without a dictionary.

That sounds like a strict limitation, but how can you prevent people
from adding their own tags within a category  loop_ structure? If there is 
no specified dictionary, you can't even police this.
Is it an error to use a data name without a definition?

How about splitting a single category loop_ into two separate loops_
(I am not saying its wise, I'm just wondering if there are CIF 
restrictions that I am not aware of.) with implied ordering? Will that 
affect the infoset?

By the way, here are some tags that I can't  imagine will ever be defined 
anywhere:

loop_
_publ_manuscript_incl_extra_item

'_geom_extra_tableA_col_1'
'_geom_extra_tableA_col_2'
'_geom_extra_tableA_col_3'
'_geom_extra_tableA_col_4'
'_geom_extra_tableA_col_500'

loop_
_geom_extra_tableA_col_1
_geom_extra_tableA_col_2
_geom_extra_tableA_col_3
_geom_extra_tableA_col_4
...
_geom_extra_tableA_col_500
       
 
Presumably this is just a kludge to get around limitations of the CIF data 
structure as a vehicle for manuscript submissions.
But should it be defined somewhere, all possible 500 columns?

_publ_vrn_code    VRNxxxxx


_vrf_validator_comments
; lets make up as many obscure undefined and possibly undefinable
_data_names_as_possible
;

# start Validation Reply Form
_vrf_DIFF020_114
;PROBLEM: _diffrn_standards_interval_count and
RESPONSE: ... We have used an image-plate system
;

If intelligent software was ever intended to deal with such _vrf_s, why 
embed the only pointer to their purpose in supposedly non parsable data 
names rather than  in looped, discrete sets of tags such as 

loop_
    _vrf_suite _vrf_subroutine _vrf_error_code _vrf_authors_response

And don't even get me started on 
_least_squares_correlation_matrix_element_732x351
(ok, I made that one up, but perhaps you get the point?)

> > > Q Is the order of "rows" in a loop_ unimportant? Do
> >
> > Yes (in CIF).
>
> That is very useful (and non-obvious from the spec. It then makes it
> possible to confirm the identity of two sets of coordinates, symmetry
> operations, etc.
>

It is also debatable. 
The very recent introduction of _symmetry_equiv_pos_site_id means that
the data integrity of the majority of prior archived CIFs containing tag 
values like:    _geom_bond_site_symmetry_1  "4_564"
would be seriously impaired by a change of order in the 
loop_  _symmetry_equiv_pos_as_xyz

If you want to reorder rows you are up against a CIF integrity/versioning 
problem. 
Both old and new style CIFs may be lexically and semantically valid and 
both may be conformant with the latest dictionary. Probably neither
would state that conformance in any readily accessible manner.

Maybe there are other examples like incremental powder profile scan data.

On Thursday 19 August 2004 17:29, Dr P. Murray-Rust wrote:
> On Aug 18 2004, David Brown wrote:

> > Quoting is an
> > important - for example in the dictionaries '_cell_length_a' is not a
> > dataname, though _cell_name_a is.  This might occur in a CIF if 
someone
> > wrote:
> > _exptl_special_details   '_exptl_density_obs unobserverable'
>
> This is a separate issue. The quoting is simply an escape mechanism (as
> also for whitspace and multiline text). Any compliant CIF parser should
> have no problem parsing the above but I would not expect the infoset to
> retain the quotes or the fact of quoting.
>
> Similarly I would not expect a CIF writer to output any quoutes unless 
it
> was required to escape something. (The other extreme is that a writer 
could
> quote everything to be safe). Unless the semantic meaning is clear I 
would
> suggest that quotes are only used to escape values.

I had a hazy recollection that  "this is a string" and   this_is_a_string   
were equally valid CIF constructs containing identical information 
content, 
used for example in space group names. Would they be formally identical in 
an infoset? Does the white space in all strings have to be normalised (is 
that the right word?)?

Would 1.2(2) and 1.3(2) be equivalent in an infoset? Lexically they are 
different, but semantically they are the same value, within error.

On Thursday 19 August 2004 22:14, Dr P. Murray-Rust wrote:
> On Aug 19 2004, Herbert J. Bernstein wrote:
> > At 2:26 PM +0100 8/19/04, Dr P. Murray-Rust wrote:
> > >On Aug 19 2004, Herbert J. Bernstein wrote:

> > >The difficulty is not pserving the data type, but the semantics of
> > >downstream decisions. If one author writes _my_phone "123-45678"
> > >they are announcing this is not a number while if another writes
> > >_my_phone 123-45678 they are announcing it is a number. The
> > >discussion so far seems to suggest that these statements overrule
> > >the datatypes specified in the dictionary entries. There is a
> > >particular problem in loop_s, where it is then possible to have
> > >different data types within a column:
> > >
> > >loop_ _atom_site_occupancy
> > >1.0
> > >0.3
> > >"not refined"
> > >"0.3"
> > >"."
> > >
> > >which makes the implementation very difficult. I believe that a
> > >programmer should be able to look up the data type in the dictionary
> > >entry and write a routine that relies on a value being of the
> > >correct data type and throws an exception if not.

(one of a standardised set of permissible parser exceptions
outlined in a rigorous specification perhaps?)

> >
> > If there is a dictionary, so the type is known, there are no 
downstream
> > decisions to be made. If the data type is numeric, the non-numeric
> > strings are an error.
>
> Good. This makes things much easier.
>
>  If the data type is a character type, all the data
>
> > values are valid.
>
> Again no problem.
>
> > If there is no dictionary, then the parser designer has
> > to make some context-sensitive typing decisions. The choice in CIFtbx 
is
> > to infer the typing from the first instance of the data. Other choices
> > could be made, 
> >including posponing the typing decision until an entire
> > column is read, but whatever the decision, once it is made, the right
> > thing to do is to report to the user conflicts between the type of the
> > data and the type chosen for the tag.
>
> I understand the logic of this. It is probably manageable if there are 
only
> char and numb - but becomes impossible if there are many. I am happy to 
go
> along with any interpretation as long as it's general across the 
community.
> I understand your proposal as:


If I understand rightly, there could in future be many data types. From 
Syd and Nick's work  you could have binary, hexadecimal and octal 
representations of numbers as well as tuples, lists, associative arrays, 
vectors and matrix container types, presumably with composites. Ad-hoc 
guessing at the data-type assignment sounds ridiculous in this enlightened 
so-called information age.
As does ad-hoc guessing at the dictionary, category and namespace.

If you can only create a rigorous infoset from CIFs claiming to be 
conformant to the CIF1.1 specification, does that not suggest that perhaps 
the 
information retention and reusability of pre 1.1 CIF has not quite lived 
up to expectations? Is it worrying that entrusting valuable data, and 
documentation to a file format that is allegedly deficient in its 
information representation model, ambiguous in its symantic intent, 
incompletely specified, non-unique in its data representation and of which 
even the primary architects may not be fully cognisant?


On Wednesday 18 August 2004 19:24, Dr P. Murray-Rust wrote:
> There is communal benefit in having all parsers produce the same infoset
> from a given xmlInstance and exposing it through a common API such as 
SAX
> and DOM. (see http://sax.sf.net for a discussion of the history and 
benfits
> of this approach).
>
> In writing a CML application I can use any XML parser in the knowledge 
that
> it will produce the same SAX callback content. I do not have to write a
> parser myself. (I also do not have to write a validator, a transformer,
> etc.) If, in CIF, everyone has to write all components in an application 
it
> reduces the communal benefit of having a shared lexical structure.

What are the actual benefits to end users of having this infoset 
technology?

And I wonder if after all the adhoc decisions and effort needed to convert 
current CIFs into an allegedly more versatile and reliable XML/infoset, 
would the reverse transformation process to recover the original 
information in a useable format be very complicated?


$0.02
Doug
(observing)

_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://scripts.iucr.org/mailman/listinfo/comcifs

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.