Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Ordering in CIFs

I've put this in a separate thread, which I think the discussion of
ordering deserves.

Ordering in CIF files
===============

Some overlong ruminations follow regarding ordering in CIF files.

Executive summary: if we want to create DDL attributes relating to
ordering, then ordering in the whole file only becomes semantically
meaningless at the stage that a dictionary is applied.

A number of proposals are starting to appear which in one way or
another deal with ordering in CIF files.  For clarity in discussing
these proposals, we should distinguish between semantically
meaningless ordering, which would include anything done purely for
presentational purposes, and semantically meaningful ordering.  An
example of the latter would be where the ordering relates to a
sequence of commands to be executed, or a sequence of bases in a
protein chain.  I wish to discuss here semantically meaningful order,
that is, a situation where a change in the order will potentially
change the meaning.  It follows that if a DDL attribute controls this
order, a conforming data file must follow this DDL requirement.
Contrast this with meaningless order, where DDL attributes need not be
conformed to, such as Herb's PRESENTATION category.

Nick asserts that order is not a lexical issue, which I take to mean that
we should confine our ordering shenanigans to the DDL.  However, as
soon as ordering becomes significant at the DDL level, it impacts back
on the parser - the parser cannot now blindly discard or change the
ordering it finds in a CIF file, and this has subtle implications.

CIF file input up until now could be separated into two logical
stages: the result of parsing is the "infoset", which in the CIF case
is roughly a set of sets of sets of key-value pairs, where both key
and value are the strings as parsed from the file.  The infoset
represents precisely the information that can be derived from the
syntax alone. At the next stage, the dictionary DDL information is
applied to the "infoset" to produce dictionary-specific
datastructures.  These two parsing stages correspond to two levels of
semantics; implicit semantics arising purely from syntax, and explicit
semantics arising from DDL attributes. Note that, even if your
application applies dictionary information as soon as the dataname is
read in, the process can still be modelled in this way as the
behaviour of the parser is not affected by the dictionary information.

In a formal sense this model now has to change: if there are ordering
requirements for loop packets, the parser must preserve the ordering
information when constructing that part of the infoset.  Therefore
parts of the infoset change based on what is in the DDL dictionary,
and so it is no longer exclusively representative of the
syntax-derived semantics.  If we want to preserve the division between
syntax-derived and DDL-derived semantics , we can either

(i) redefine the "infoset" as now being a "list of lists of lists of
key-value pairs", and then at the DDL level we say that "order is not
important, except for the following cases" or
(ii) define new syntax, such as 'loop_ordered' instead of 'loop_'

To tease out how important or otherwise this really is, we need to
consider the implications for a CIF reader.  I'm only going to discuss
generic CIF readers here, such as ciftbx or PyCIFRW; readers that are
tailored for a particular dictionary at software construction time are
not as relevant, although keeping clean divisions between the two
semantic levels is probably helpful for them as well.

A generic CIF reader does not a priori know which DDL dictionaries a
given CIF data file conforms to, and therefore must potentially parse
the entire file in order to find the _audit data items. Under CIF1, a
CIF reader can parse a CIF into a datastructure isomorphous to the
infoset, confident that all syntax-derived semantic information of
relevance to any DDL dictionary has been preserved.  I say "can parse", not
"will", because, while a set structure is a more fundamental
mathematical construct than a list, it is the other way around in
computing: a set is a list with duplicate removal and order-based
operations suppressed.  So it is pretty likely that all CIF readers
retain ordering information at the end of a parse, even though not
strictly necessary, simply because extra effort is required to remove
the ordering (in most programming languages - although Python recently
acquired a 'set' type). That is, choosing option (i) might merely bring us
into line with what a programmer is likely to do in any case.

Nevertheless, I think we should all be aware that as soon as we start
using DDL attributes to control order in a CIF datablock, we are
requiring generic readers to preserve the relevant order during the
parse.



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.