Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Request for approval of CIF version 1.1 specification

  • To: Multiple recipients of list <comcifs-l@iucr.org>
  • Subject: Re: Request for approval of CIF version 1.1 specification
  • From: Brian McMahon <bm@iucr.org>
  • Date: Wed, 26 Feb 2003 16:36:47 GMT
Dear Ralf

I have just recently put up the final version of the 1.1 spec on the web:
  http://www.iucr.org/iucr-top/cif/spec/version1.1/
It differs in a few details from the last circulated version, and I realise
that I owe you a response to your useful comments of last October.

> I am wondering if it wouldn't make sense to officially announce these
> documents as "Working Specification" with the expressed intend to adopt
> them as "final" after being publicly available and actively used by
> implementors for a while (e.g. a year). Hopefully the community will
> see this as an invitation to participate in ironing out any kinks.

I have adopted the terminology "Working specification" and certainly welcome
feedback from the community. However, the intention is certainly to abide by
the spirit of this document as closely as possible.

> I have to admit that am still not entirely clear what the authoritative
> Version 1.0 Specification is (Hall, Allen & Brown, 1991?). It would be
> useful to clearly explain this in the introduction ("Revision
> history"). It would also be useful to outline the boundaries between
> this specification and the DDL specifications ("Scope").

There is now a short revision history. I think the "scope" is best handled
by extending as needed the subdocuments off the introductory page
  http://www.iucr.org/iucr-top/cif/spec/version1.1/index.html
but we can review that if people feel strongly about it.

> Syntax:
> Definition of terms: consolidate in one place (link).

However, it's helpful to have them within each document also, so I have 
duplicated the "Definition" section in a separate file which can be
consulted in a separate window alongside the current document.

> Regarding quoting rules:
> I am asking myself how to deal with a string like
> ;
> contains both isolated ' and " and ends with a '
> ;
> 
> If I understand correctly, anything can be handled in a multi-line text
> field. However, take the viewpoint of someone implementing a CIF
> writer. If the goal is to make the output human-readable, one would
> probably prefer quoted strings over multi-line text, in particular
> inside a loop_ construct. But then it seems necessary to pre-scan the
> text fields to be output to determine what kind of quoting is
> applicable. I am under the impression that it could be quite hard to
> devise an algorithm that generates both correct and "nice" output. A
> human working on a CIF will face similar difficulties. Isn't this an
> issue in practice? Could it be useful to include some practical
> quoting guidelines?

I leave this open as a topic for further discussion. As things stand, a
conscientious string formatting routine will need to pre-scan to guard
against embedded quotes. It's a shortcoming of the design; but it's
alleviated a little if you have a maximum line length, since anything longer
will automatically go into the semicolon-delimited fields :-)

> I believe many people will expect constructs like
>   'an embedded \' quote'
> or
>   "an embedded \" double quote"
> to work as they do in many programming languages. To avoid this common
> misunderstanding it will be useful to provide a link to the "Accented
> letters" table in the semantics document.

OK, I added an explanatory paragraph (new para 16).

> 17. ... trailing white space on a line may however be elided.
>                                        ^^^
> In my opinion the specification should be unambiguous:
> White space should not be elided by the parser. The data value should
> be left untouched. Eliding is in the regime of semantics, not syntax.

I have much sympathy with your view. I think this stanza was introduced
because of the difficulty of guaranteeing the preservation of whitespace on
operating systems that are not record-oriented.

> 20. ...
> By contrast the value of the text field ...  is `foo\n bar' ...
> 
> Should this be `foo\n  bar' (two spaces before the bar)?

Yes. Fixed.

> Also, in the semantics document the notation <eol> is used instead
> of \n. I suggest using <eol> everywhere.

OK, did that.

> 22.:
>   2. If it is decided to explicitly exclude VT and FF this deviation
>      from STAR should (also) be listed under "Implementation restrictions."

The exclusion of VT and FF came after several rounds of opinionated
discussion; I didn't feel like revisiting this topic, because I think it's
more a theoretical issue than a real-world problem. However, I have added it
to the list of STAR implementation restrictions as you advise.

> 27.
> How does the "Maximum line length" apply to <eol>\; quoted strings
> as explained in the semantics document? For example, is the following
> legal?
> ;\
> 2000 characters ...\
> 2000 characters ...
> ;

Yes. Each line (at the syntax level) is short. The logical "value" of the
string may run to a much longer extent if an application handler chooses to
reassemble according to the protocol. But that is considered a
semantic-level transformation (which also answers why the line-folding
protocol is treated in the second document).

> Finally, in the post-Fortran and post-C era line length restrictions
> seem very arbitrary and are ultimately a nuisance.

There is much sympathy for this view. However, the current revision is
judged to be appropriate to accommodate older and less flexible platforms
that are still in use to an appreciable extent. The expectation is that the
restriction will be dropped in a subsequent revision.

> Semantic:
> This sentence in the introduction leaves me puzzled:
>   As computer techniques evolve, it becomes more appropriate to discuss
>   the machine-accessible semantic content, or "meaning", of the data in
>   such a file.

What it was trying to say is that the STAR and derivative syntax provides an
adequate string-retrieval mechanism, but determining or assigning meaning to
the strings is at a higher level. Originally this higher level was simply
human interpretation of the information content of the strings;
increasingly, through encoding of relationships in the DDL, complex data
models can be assembled in software. But it's somewhat enigmatic, and is
stated rather more clearly in para 14, so I dropped it from here.

> 10. The character string [local] is reserved for local use.
>                          ^     ^
> Is this [notation] used somewhere else? Are there alternatives?

It's not a notational artefact - the bracket characters are literal. That's
now clarified in the new draft.

> Handling of long lines
>   - I am a bit surprised that this is presented in the semantic
>     features document rather than the syntax document.

As I mentioned above, it's classed as "semantic" in the sense that a
general-purpose parser need not concern itself with the interpretation of
the text field; it can just be treated as a dumb string. Specific
applications may need to do some reassembly, but only if the value is
relevant to that application's purpose.

> Typographic style codes
>   I don't see how these comments could make a significant difference in
>   practice, but they significantly contribute to conveying the
>   impression that the semantics features are a bit of a hodgepodge.
>   I suggest deleting the entire "Typographic style codes" section.

They've been introduced as part of the Acta markup, and so should be
documented somewhere,

I don't dispute the characterisation of these semantic features as
"hodgepodge". I think there are ways to clean this up a lot by
introducing the idea of content handlers which can be tailored to process
specific fields based on the content-type and content-encoding of those
fields (you will see strong borrowing from the MIME approach here). All this
is material for a subsequent revision, though.

Thanks again for your careful critique of these points.

Best wishes
Brian