Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Request for approval of CIF version 1.1 specification

  • To: Multiple recipients of list <comcifs-l@iucr.org>
  • Subject: Re: Request for approval of CIF version 1.1 specification
  • From: "Ralf W. Grosse-Kunstleve" <rwgk@yahoo.com>
  • Date: Wed, 16 Oct 2002 02:27:09 +0100 (BST)
Attached are some comments regarding the 1.1 CIF specification
as posted by Brian.

I am wondering if it wouldn't make sense to officially announce these
documents as "Working Specification" with the expressed intend to adopt
them as "final" after being publicly available and actively used by
implementors for a while (e.g. a year). Hopefully the community will
see this as an invitation to participate in ironing out any kinks.



Version 1.1 Specification

I have to admit that am still not entirely clear what the authoritative
Version 1.0 Specification is (Hall, Allen & Brown, 1991?). It would be
useful to clearly explain this in the introduction ("Revision
history"). It would also be useful to outline the boundaries between
this specification and the DDL specifications ("Scope").



Definition of terms: consolidate in one place (link).

Regarding quoting rules:

I am asking myself how to deal with a string like

contains both isolated ' and " and ends with a '

If I understand correctly, anything can be handled in a multi-line text
field. However, take the viewpoint of someone implementing a CIF
writer. If the goal is to make the output human-readable, one would
probably prefer quoted strings over multi-line text, in particular
inside a loop_ construct. But then it seems necessary to pre-scan the
text fields to be output to determine what kind of quoting is
applicable. I am under the impression that it could be quite hard to
devise an algorithm that generates both correct and "nice" output. A
human working on a CIF will face similar difficulties. Isn't this an
issue in practice? Could it be useful to include some practical
quoting guidelines?

I believe many people will expect constructs like
  'an embedded \' quote'
  "an embedded \" double quote"
to work as they do in many programming languages. To avoid this common
misunderstanding it will be useful to provide a link to the "Accented
letters" table in the semantics document.

17. ... trailing white space on a line may however be elided.

In my opinion the specification should be unambiguous:
White space should not be elided by the parser. The data value should
be left untouched. Eliding is in the regime of semantics, not syntax.

20. ...
By contrast the value of the text field

; foo

is `foo\n bar' ...

Should this be `foo\n  bar' (two spaces before the bar)?

Also, in the semantics document the notation <eol> is used instead
of \n. I suggest using <eol> everywhere.


The ASCII characters at decimal positions 11 (VT or vertical tab) and
12 (FF or form feed), often included in library implementations as
white space characters, are explicitly excluded from the CIF character
set at this revision.


  1. I don't see the benefit of explicitly excluding these
     characters. In practice it means that parsing of old
     files might fail only because these characters are
     embedded. I know there was some discussion already,
     but I cannot remember the details. Is there something
     wrong with the following, more forgiving approach:
     Unquoted VT and FF are treated as white space,
     quoted VT and FF are "passed through" like any other

       WhiteSpace> := { <SP> | <HT> | <VT> | <FF> | <eol>
                        | <TokenizedComments>}+ 
       <AnyPrintChar> := <OrdinaryChar> | <double_quote> | '#' | '$'
                         | <single_quote> | '_' | <SP> | <HT>  | <VT> | <FF>
                         | ';' | '[' | ']' 

  2. If it is decided to explicitly exclude VT and FF this deviation
     from STAR should (also) be listed under "Implementation restrictions."


How does the "Maximum line length" apply to <eol>\; quoted strings
as explained in the semantics document? For example, is the following

2000 characters ...\
2000 characters ...

Finally, in the post-Fortran and post-C era line length restrictions
seem very arbitrary and are ultimately a nuisance. I'd rather see this
restriction removed from the specification. Programs written in
languages without automatic dynamic memory management could simply
allocate a large buffer (e.g. 128k are perfectly reasonable these days)
and report an "Technical limitation" in the highly unlikely event
that the buffer is insufficient.



This sentence in the introduction leaves me puzzled:

  As computer techniques evolve, it becomes more appropriate to discuss
  the machine-accessible semantic content, or "meaning", of the data in
  such a file.

Again: Definition of terms: consolidate in one place (link).

10. The character string [local] is reserved for local use.
                         ^     ^
Is this [notation] used somewhere else? Are there alternatives?

Handling of long lines

  - I am a bit surprised that this is presented in the semantic
    features document rather than the syntax document.

  - Why do we need this for # comments?

Typographic style codes

  I don't see how these comments could make a significant difference in
  practice, but they significantly contribute to conveying the
  impression that the semantics features are a bit of a hodgepodge.
  I suggest deleting the entire "Typographic style codes" section.

Do you Yahoo!?
Faith Hill - Exclusive Performances, Videos & More