Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF Infoset

On Aug 19 2004, Herbert J. Bernstein wrote:

> There are two questions that Peter raises relative to comments and one
> relative to data types that call for a very clear response
> 
> >
> > Q. If my CIF parser automatically strips all comments from the 
> > document and, say, deposists them in a public repopsitory, does anyone 
> > feel this is a problem?
> 
> 
> This is not only a problem, but depending on who owns the illectual
> property right in the document involved, it well may a violation of
> copyright law. It is common practice to put copyright statements and
> references to  licenses in the comments of documents, whether they be in
> CIF, XML or some other language.  If you have created the document in
> question, what you extract from it and deposit in a public repository is
> your business. If the document was created by someone else, or you
> surrendered your intellectual propoerty rights to someone else, they get
> to decide how derived works are handled.  So, if you are designing a CIF
> parser to extract information from a CIF for some application to process
> internally, stripping all comments may well be a good idea, but if you are
> designing a CIF (or XML, or postscript, or ASN.1) or other parser to
> reformat  documents, then you need to be much more careful and inclusive
> of comments.

My own view is that comments should be preserved. Taking Herbert's view it 
then follows that comments are order-dependent.

# Here is a list of authors
# The first one is the lead author
# A.B.Foo
# D.E.Bar

It also suggest that we should have a "comment block" (since comments 
cannot span more than one line

However I think it would also be valuable to stress that any IPR, metadata, 
or other semantics are put in CIF items or loop_s and not in comments. I 
would prefer that authors are dissuaded from using comments for important 
information

- that the 
> 
> >
> > Q. Is the CIF version "comment" a special case and should it be 
> > preserved (I believe yes)
> 
> The handling of the CIF magic number comments depends on what you are
> doing with the document.  If you are reading the document, it is a good
> idea to read and parse the magic number to provide your parser with a hint
> as to the intended syntax (e.g. 80 character vs. 2048 character line
> length limit).  If you are writing a document, then rather than preserving
> the magic number comment from some starting document, you want to generate
> your own magic number comment that corectly specifies the syntax
> specification being followed by your CIF writer.  The sensible practice
> has been well established in the HTML/SGML/XML community, and proves very
> helpful in dealing with the dizzying variety of HTML/SGML/XML syntax
> versions.  Hopefully we will never have as many co-existing syntax
> versions in the CIF  community, but the practice is still a sound one to
> follow.
> 
There is currently only one syntax for XML (V1.0), though XML1.1 is under 
devlopment. The XML declaration: <?xml version="1.0"?> is not mandatory but 
encouraged. I assume that the CIF magic comment is of that form and 
therefore not fundamentally a comment (the XML declaration is not a 
processing instruction).

> > I am now unclear about the role of char and numb. I assumed they were 
> > for data validation and application programmers. The first would ensure 
> > that a data value was always a number - thus I would have believed that
> >
> > _cell_length_a 'too large to measure'
> >
> > was a validation error. The second aspect is now a nightmare for 
> > application programmers. Firstly the infoset (the result of the parse) 
> > has to retain knowledge of whether the value is quoted. Then the 
> > apllication has to take different action on whether the value is 
> > quoted. The author submits that _cell_length-a '12.1' _cell_length-a 
> > 12.1 have different meanings. (I cannot see what - as a programmer - I 
> > can or have to do). Formally if I get _cell_length-a '12.1' I would 
> > have to throw an exception "Cell_length_a is not a number, cannot 
> > continue".
> 
> As do many langauges, CIF has data types.  The number of types depends
> on the DDL, but in all cases, there is a distinction between numeric data
> and other, more string-oriented data types (e.g. char and text).

Agreed.

  Just as
> with most programming languages, a quoted "12" is not a number.  The
> application does not need to preserve the quotes, but it does need to
> recognize that the data type of the data that it just read is not a
> numeric data type, and if the context within which it is being used
> calls for a numeric data type (e.g. as a value to _cell_length_a) then
> a good parser really should inform the user of the conflict.  This
> does mean that the parser "has to take different action on whether the
> value is quoted", but that is one of the services the parser is there
> to perform for the user, if it can.  Yes, there may be justification
> for writing a light-weight parser that does not catch such errors, but
> that hardly makes it a "nighmare for application programmers" to
> write parser that do catch such errors.  Even when a dictionary is not
> being used, you really do want to recognize the distinction between
> number and non-numeric data.  For example, 1234-308 might well be
> intended as the number 1234*10**(-308) while '1234-308' is clearly
> intended to be the string of characters stated.

The difficulty is not pserving the data type, but the semantics of 
downstream decisions. If one author writes _my_phone "123-45678" they are 
announcing this is not a number while if another writes _my_phone 123-45678 
they are announcing it is a number. The discussion so far seems to suggest 
that these statements overrule the datatypes specified in the dictionary 
entries. There is a particular problem in loop_s, where it is then possible 
to have different data types within a column:

loop_ _atom_site_occupancy
1.0
0.3
"not refined"
"0.3"
"."

which makes the implementation very difficult. I believe that a programmer 
should be able to look up the data type in the dictionary entry and write a 
routine that relies on a value being of the correct data type and throws an 
exception if not.



P.

_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://scripts.iucr.org/mailman/listinfo/comcifs

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.