[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

There are two questions that Peter raises relative to comments and one
relative to data types that call for a very clear response

> Q. If my CIF parser automatically strips all comments from the document
> and, say, deposists them in a public repopsitory, does anyone feel this is
> a problem?

This is not only a problem, but depending on who owns the illectual
property right in the document involved, it well may a violation of
copyright law. It is common practice to put copyright statements and
references to  licenses in the comments of documents, whether they be in
CIF, XML or some other language.  If you have created the document in
question, what you extract from it and deposit in a public repository is
your business. If the document was created by someone else, or you
surrendered your intellectual propoerty rights to someone else, they get
to decide how derived works are handled.  So, if you are designing a CIF
parser to extract information from a CIF for some application to process
internally, stripping all comments may well be a good idea, but if you are
designing a CIF (or XML, or postscript, or ASN.1) or other parser to
reformat  documents, then you need to be much more careful and inclusive
of comments.

> Q. Is the CIF version "comment" a special case and should it be preserved
> (I believe yes)

The handling of the CIF magic number comments depends on what you are
doing with the document.  If you are reading the document, it is a good
idea to read and parse the magic number to provide your parser with a hint
as to the intended syntax (e.g. 80 character vs. 2048 character line
length limit).  If you are writing a document, then rather than preserving
the magic number comment from some starting document, you want to generate
your own magic number comment that corectly specifies the syntax
specification being followed by your CIF writer.  The sensible practice
has been well established in the HTML/SGML/XML community, and proves very
helpful in dealing with the dizzying variety of HTML/SGML/XML syntax
versions.  Hopefully we will never have as many co-existing syntax
versions in the CIF  community, but the practice is still a sound one to

> I am now unclear about the role of char and numb. I assumed they were for
> data validation and application programmers. The first would ensure that a
> data value was always a number - thus I would have believed that
> _cell_length_a 'too large to measure'
> was a validation error. The second aspect is now a nightmare for
> application programmers. Firstly the infoset (the result of the parse) has
> to retain knowledge of whether the value is quoted. Then the apllication
> has to take different action on whether the value is quoted. The author
> submits that _cell_length-a '12.1' _cell_length-a 12.1 have different
> meanings. (I cannot see what - as a programmer - I can or have to do).
> Formally if I get _cell_length-a '12.1' I would have to throw an exception
> "Cell_length_a is not a number, cannot continue".

As do many langauges, CIF has data types.  The number of types depends
on the DDL, but in all cases, there is a distinction between numeric data
and other, more string-oriented data types (e.g. char and text).  Just as
with most programming languages, a quoted "12" is not a number.  The
application does not need to preserve the quotes, but it does need to
recognize that the data type of the data that it just read is not a
numeric data type, and if the context within which it is being used
calls for a numeric data type (e.g. as a value to _cell_length_a) then
a good parser really should inform the user of the conflict.  This
does mean that the parser "has to take different action on whether the
value is quoted", but that is one of the services the parser is there
to perform for the user, if it can.  Yes, there may be justification
for writing a light-weight parser that does not catch such errors, but
that hardly makes it a "nighmare for application programmers" to
write parser that do catch such errors.  Even when a dictionary is not
being used, you really do want to recognize the distinction between
number and non-numeric data.  For example, 1234-308 might well be
intended as the number 1234*10**(-308) while '1234-308' is clearly
intended to be the string of characters stated.

  -- Herbert
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


On Thu, 19 Aug 2004, Dr P. Murray-Rust wrote:

> On Aug 18 2004, David Brown wrote:
> > I am also finding this interchange interesting.
> Thanks - it is a deep issue and resulted many thousands of emails in the
> XML community.
> The issue as I see it is whether CIFs are seen as machine-understandable
> documents or whether they are primarily to produce material for humans to
> read. (They can do both, but it requires work).
comcifs mailing list

Reply to: [list | sender only]