[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Revised version of syntax change summary document
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Revised version of syntax change summary document
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Wed, 09 Dec 2009 18:31:13 -0500
- In-Reply-To: <20091209140355.GA29341@emerald.iucr.org>
- References: <20091209100252.GA6642@emerald.iucr.org><20091209140355.GA29341@emerald.iucr.org>
Brian McMahon wrote: > A few comments on the latest version of the CIF2 syntax changes > summary document. > > I'm glad to see the explanation of tokens and separators. I was going > to ask for something of the sort. The visual aid is quite a good way > of doing this - and it does emphasise that the word "token" is a > rather dangerous (i.e. potentially ambiguous) one, since it can > apply promiscuously to a complete list or to lists contained in lists > or - n'est-ce pas? - to the individual elements within a list. > > For the target audience for this document, this level of ambiguity, > normally resolved by context, is probably OK, but we should be very > careful in drafting the final complete specification document. > > In similar vein, a complete specification should probably define > very carefully what is meant by phrases such as "lexical characters". > Again, I don't think that degree of pedantry is necessary for the > purposes of getting this out to the developer community. > > A few more specific points. > > 1. Permitted character set (under "Terminology" and/or "Encoding"). > CIF 1.1 explicitly EXCLUDES some of the characters in the ASCII set, > usually thought of as 'control characters'. Specifically, the excluded > characters are (decimal values) 00-08, 11, 12, 14-31 and 127. Should > this be restated clearly in this document for clarity? I would say no. This is just a change document. Once it is final, it should include the full syntax. > > [Possibly relevant: what are the "additional 20 UNICODE characters > that constitute whitespace" mentioned in the "Terminology section"?] http://en.wikipedia.org/wiki/Whitespace_(computer_science) It probably is less confusing just to say that "no UNICODE characters are accepted as whitespace". The set of UNICODE whitespace may change over time. > > 2. Encoding. > "UTF-8 directly supports an extensive range of printable objects that > are not accessible through ASCII." Not strictly true: acceptance of a > \uNNNN encoding would give you access to all of these using the ASCII > character set. Just drop this sentence. I suggest dropping the next > also. We haven't yet revisited my suggestion that the IUCr markup > conventions be disallowed in CIF 2 - which, of course, isn't a > syntactic issue at this level of discourse. Other than UTF-8 being widely used, this is similar to the issue of escapes being handled by the application rather than the lexer. STAR and CIF leave encoding as an implementation-specific procedure. UTF-8 standardizes the encoding, avoiding the mess of inconsistent implementations. As for the IUCR markup, there is no reason to disallow them, just deprecate them. An implementation should be able to apply any sort of character processing for a specific purpose. Banning one specific type is not useful. I would say that any sort of encoding that can simply be handled by Unicode characters should be [strongly] discouraged. ---- In the Table data type description, what does this mean: "This implies implicit line joining, and there is no newline token between implicit continuation lines (as in the previous example)." Is this referring to how the full list is presented as a standardized string format? Why claim there is no newline character? A newline is valid white space, and it would be equally valid with all newline characters an no spaces. --- Some rationales are still missing, and I am still looking forward to a concise explanation of these: 1) Why define new quotation types. 2) Why disallow close-quote characters contained within a string by dropping the followed-by-whitespace rule. 3) Why require quotes in the key names for table data, and why only ' or " quotes. The Reasoning paragraph sort of explains #2 with "...CIF2 we adopt a simpler, more common approach...". But, it does explain how the change solves any problem. Overall, simpler is not the goal, because there are no 5 quote types. The idea that the white space and quote on each end of the string are part of the string token is confusing, but not related to the quoting syntax at all. Is this implying that CIF2 does not includes the quotes in the string token? Aren't quotes always stripped of in the end, and this is just an implementation detail about processing strings at the lexer versus dictionary level, and is not really relevant to the fle syntax? Thanks, Joe Krahn _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Revised version of syntax change summary document (Brian McMahon)
- Re: [ddlm-group] Revised version of syntax change summary document (Brian McMahon)
- Prev by Date: Re: [ddlm-group] Data-name character restrictions - one last time
- Next by Date: Re: [ddlm-group] Revised version of syntax change summary document
- Prev by thread: Re: [ddlm-group] Revised version of syntax change summary document
- Next by thread: Re: [ddlm-group] Revised version of syntax change summary document
- Index(es):