Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear James,

   Nick has packed his computer and has asked me to reply.  The separator
is required.  The coercion rules are to provide something sensible to do 
in recovering from the error so the parse can move on and do other things.
In a strict parser, you then would report all of these as fatal errors. 
This is more useful than reporting one error, dying, having the users fix 
that error, run another pass to get the next error, etc.  Yes, for a 
liberal parser, these might be treated as warnings, but that does not
mean we are telling the users to write their CIF2 files that way.  Instead
it gives us the framework on how to suggest to them they should rewrite
their files to not get these warnings or errors.

   I repeat -- the standard would be to require a separator, but we are
clearly defining which separators are required where.

   To see a model, try two butted strings in python.

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 16 Oct 2009, James Hester wrote:

> Nick's latest draft looks promising.  Remembering that standards
> involve being pedantic, note the following comments.  Nick writes:
> 	However an appropriate separator is required between tokens to
> 	unambiguously parse a CIF2 document. The appropriate separator
> 	is defined by the context in which it is used. For example at
> 	the highest-level, a whitespace serves this purpose. In a List
> 	object the ASCII , serves this purpose. In the Associative
> 	Array object, the separators are ASCII : and ASCII ,. The
> 	absence of a separator or use of the incorrect separator will
> 	give rise to ambiguity and possible error. The coercion rules
> 	for these cases need to be argued by the “community”.
> As a result of the character set restrictions, the first line would
> more accurately read "However, an appropriate separator is *sometimes*
> required between tokens to unambiguously parse a CIF2 document.", and
> somewhere I would add something like: "For consistency, backwards
> compatibility and transferability into non-CIF applications, a
> separator must always appear between top-level tokens even when not
> strictly required in order to successfully scan the tokens".
> Regarding coercion rules, I would like option (1) "give error message
> and die" to always be legal behaviour, and the only sanctioned
> behaviour for a validating parser.  That said, discussion of recovery
> strategies as per option (2) is appropriate in notes on the standard,
> and somewhere it should be noted that these changes have made CIF2
> somewhat more robust against certain types of file corruption.
> My reason for insisting on (1) being legal is that this should be
> sufficient to ensure that CIF2 writers always pad between tokens, and
> the only times that approaches in (2) will be required are when files
> have been corrupted.  I note also that STAR and CIF2 could diverge at
> this point without undue problems: e.g. STAR could adopt (2) and CIF2
> could adopt (1), CIF2 could require whitespace, STAR could be less
> strict.
> Moving on to eliding apostrophes and quotes: I remain to see the need
> for doing this at all, given that we will have triple quoted and
> semicolon delimited strings for the pathological cases of single line
> strings which contain both quote and apostrophe characters. If we must
> have them I agree with where Simon's original thinking was going, and
> what Nick's latest email (as of this morning) mentioned.  The source
> of the problem is that the elide character is overloaded: it fulfills
> a function on the lexical level and arbitrary functions at higher
> levels (IUCr, Latex, unicode...).  To simplify things, you need to
> decouple it as follows (as Nick wrote):
> For single quote strings:
> \' -> '  delivered to the application
> \\ -> \  delivered to the application
> \\\' -> \' delivered to the application
> \x for any other character -> \x
> And why should the IUCr decide how to do things on this level - we
> give them a way to get an acute accent (or use Unicode).
> But frankly, I fail to see the need for this eliding, as I failed to
> see the need for optional whitespace.  Perhaps an example, however
> artificial, where only this eliding can produce the required string?
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.