[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Nick's latest draft looks promising.  Remembering that standards
involve being pedantic, note the following comments.  Nick writes:

	However an appropriate separator is required between tokens to
	unambiguously parse a CIF2 document. The appropriate separator
	is defined by the context in which it is used. For example at
	the highest-level, a whitespace serves this purpose. In a List
	object the ASCII , serves this purpose. In the Associative
	Array object, the separators are ASCII : and ASCII ,. The
	absence of a separator or use of the incorrect separator will
	give rise to ambiguity and possible error. The coercion rules
	for these cases need to be argued by the “community”.

As a result of the character set restrictions, the first line would
more accurately read "However, an appropriate separator is *sometimes*
required between tokens to unambiguously parse a CIF2 document.", and
somewhere I would add something like: "For consistency, backwards
compatibility and transferability into non-CIF applications, a
separator must always appear between top-level tokens even when not
strictly required in order to successfully scan the tokens".

Regarding coercion rules, I would like option (1) "give error message
and die" to always be legal behaviour, and the only sanctioned
behaviour for a validating parser.  That said, discussion of recovery
strategies as per option (2) is appropriate in notes on the standard,
and somewhere it should be noted that these changes have made CIF2
somewhat more robust against certain types of file corruption.

My reason for insisting on (1) being legal is that this should be
sufficient to ensure that CIF2 writers always pad between tokens, and
the only times that approaches in (2) will be required are when files
have been corrupted.  I note also that STAR and CIF2 could diverge at
this point without undue problems: e.g. STAR could adopt (2) and CIF2
could adopt (1), CIF2 could require whitespace, STAR could be less
strict.

Moving on to eliding apostrophes and quotes: I remain to see the need
for doing this at all, given that we will have triple quoted and
semicolon delimited strings for the pathological cases of single line
strings which contain both quote and apostrophe characters. If we must
have them I agree with where Simon's original thinking was going, and
what Nick's latest email (as of this morning) mentioned.  The source
of the problem is that the elide character is overloaded: it fulfills
a function on the lexical level and arbitrary functions at higher
levels (IUCr, Latex, unicode...).  To simplify things, you need to
decouple it as follows (as Nick wrote):

For single quote strings:
\' -> '  delivered to the application
\\ -> \  delivered to the application
\\\' -> \' delivered to the application
\x for any other character -> \x

And why should the IUCr decide how to do things on this level - we
give them a way to get an acute accent (or use Unicode).

But frankly, I fail to see the need for this eliding, as I failed to
see the need for optional whitespace.  Perhaps an example, however
artificial, where only this eliding can produce the required string?


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]