Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Imgcif-l] proposed change in first line of imgcif files

Firstly, I need some clarification for a few statements:

On Wed, Sep 24, 2008 at 1:14 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> The problem with the proposition
>
>  "that the
>>
>> information derived from the 'style' and 'version' parts of the header
>> would not contain anything that couldn't be derived from the CIF file
>> proper"
>
> is one of chickens versus eggs.  We well may need magic number or other
> file-type identifier information to determine the parse logic for a
> CIF.

How is this so?  I thought that the CIF 1.1 syntax specification is
sufficient to parse a data file into datablocks containing text
datanames and text data values.  You may then (selectively, if you
like) examine values that you are interested in to obtain further
information, convert to numbers, create headers etc.

> Once we know that, we certainly can have tags within a CIF to
> confirm those parse decisions, but even then, we may not be able
> to glean everything without some external dialect specifier, e.g.
> when dealing with the differences between mmCIF and pdbx CIF.

I thought that the differences between mmCIF and pdbx CIF occur above
the syntactical level ie in the selection of datanames and data types
available.  Are there syntactical differences?

> At first I thought of solving this problem by simply cloning all
> style variant flags as the values of tags with the CIF, but what
> do we then do if the magic number information and the CIF tags
> disagree or one or the other is missing?  We need to specify the
> handling of those cases, and not simply by declaring them to
> be errors.  People still will want to know how to recover their
> data.

I don't think this is a problem at all if we adopt the philosophy that
a commentless CIF is semantically equivalent to one with comments (and
I thought this was the philosophy all along).  Therefore, the contents
of the datablocks take precedence over anything that might appear in a
comment.

Case (1): If the header information disagrees with the tags, the
header is wrong.
Case (2): If the header is missing, it can be regenerated from the
relevant tags.
Case (3): If the relevant data tags are missing but the header is
present, this particular data file is doing an end run around the
standard and we slap it on the wrists, but pragmatically it will
probably survive

The only case in which people might have trouble accessing their data
is (3), if some idealistic program strips off all comments.  But (3)
should be explicitly rejected by us in any case.

> So, what I am proposing is:
>
> 1.  We make an clean, unambiguous statement of how to handle magic
> numbers, comments and whitespace in CIF.  I think what I have proposed
> will do that job, but I someone may have a better approach

As you can probably tell, my clean, unambigous statement is:

(1) A commented CIF must always be semantically equivalent to that CIF
with all comments removed
(2) A convenience header may be supplied, but must be recoverable from
information within the CIF itself.

I see no need for the standard to deal with whitespace or comments
anywhere apart from the header. What are the motivations to concern
ourselves with comments etc. in other parts of the CIF?  I appreciate
that particular implementations may want to preserve comments, but
does that have to be within the purvey of a general standard?  Indeed,
it seems to me that the standard is failing if we are putting anything
worthwhile into comments (with the exception of a header for
programming convenience).

> 2.  For information that can be carried in the same CIF in mulitple,
> perhaps conflicting ways, we specify a precedence of interpretations.
> As a practical matter, I think magic numbers have to take precedence
> over conflicting or missing tags values within the CIF.  The magic
> number will have been read and interpreted well before the tag value
> if encountered.  This may then call for a warning, but the users
> will expect a rational effort at completing the parse, and perhaps
> even an automatic correction to the CIF to remove the conflict.

See above - I go the reverse way, but that is based on my
understanding that any CIF-formatted file can be parsed based on the
standards documents, without reference to any supplementary header
information.

> That being said, I have no objection to encouraging the use of the
> tags James has proposed, but the alignment between that content
> and the magic number information needs to be explcitly stated, and
> the simplest way to do that unambiguosly within a CIF, especially
> in a DDLm CIF, would be by stating that relationship in term of
> the values of James' new tags and the value of _ws.prologue. In DDLm
> we could even include the parse algorithms for decomposing the
> magic number and for creating it.

This looks like a promising solution in that it keeps the
specification of _ws.prologue within the imgCIF dictionary, and gives
it an interpretation beyond "the first comment in the file".  From the
point of view of the standard, I would not specify that this *must* be
the first comment in the file, simply that this may be output as the
first line, simply because I still think it should be possible for a
commentless CIF to be viable.  Likewise, I don't think it should be
automatically set to the value of the first comment in the file when
reading in: the two should simply match, with a resolution for
mismatches as set out above.  I hope this is seen as a formality which
does not have significant practical impact.

> This is not quite the same as James's prescription of an equivalence
> between a CIF with comments and the same CIF with those comments
> removed, but I think it is a pragmatic compromise and comes closer
> to that goal than we have been in the past.

If my modification above is acceptable, then we are all happy.  A
comment-stripping CIF program will not see any comments in the input
file, but will see _ws.prologue and may or may not include it as a
header when outputting (but will carry through the _ws.prologue data
item).  A comment-aware CIF program will see the header and happily
use it, perhaps checking that it matches _ws.prologue.  A program
which needs the header but finds a CIF without one can farm the CIF
off to a little utility which gets hold of _ws.prologue (perhaps even
using DDLm methods to generate it) and prepends it to the file.

A technical issue: there is one header for a given file, but there are
possibly multiple datablocks.  What is the behaviour if multiple
versions/styles are present in different datablocks?  This is a corner
case which presumably doesn't relate to manufacturer-produced CIFs,
but we need to specify the behaviour.  I would suggest that the
definition of _ws.prologue state that 'The value of _ws.prologue may
be output as the first line of an output CIF file.  Where multiple
datablocks are present in a file, a value of _ws.prologue from any one
of those datablocks can be used'.

James.

-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
imgcif-l mailing list
imgcif-l@iucr.org
http://scripts.iucr.org/mailman/listinfo/imgcif-l

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.