[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Straw poll results
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Straw poll results
- From: Brian McMahon <bm@iucr.org>
- Date: Wed, 14 Oct 2009 10:34:13 +0100
- In-Reply-To: <279aad2a0910130827h34774cfey78af132620cf6f74@mail.gmail.com>
- References: <279aad2a0910130827h34774cfey78af132620cf6f74@mail.gmail.com>
1. UTF-8: "binary" - yes. Principle of maximal disruption; but also it's a clean upgrade path. If any byte with high-order bit set occurs in a CIF known to be conformant to the new 1.2 specs, a "validating" CIF lexer analyses the value, and if appropriate scans ahead the next one, two or three bytes to establish that they collectively represent a valid Unicode character, UTF-8 encoded. If they do not, the lexer throws an exception. A "non-validating" lexer presumably just accepts any high-bit byte and leaves it to someone else to check that it's part of a UTF-8 character. Maybe no-one will ever write such a "non-validating" parser, but it does seem to me that checking the validity of the UTF-8 encoded Unicode character set is a more heavyweight process than checking that only "printable ASCII" plus a few permitted control characters are present. In terms of managing services where, let us say, Acta accepts CIF1.2 but will deliver CIF1.1 upon request, there are established UTF-8 -> ASCII conversions based upon the \u or HTML character references (\u27, λ) which could be used to accomplish this. (Simon will perhaps discuss with me offline whether any of these would cause particular difficulties with our software.) Setting up such a conversion procedure in a way that's unambiguous and allows full CIF1.2->1.1->1.2 cycling is not necessarily easy, and may require additional conventions to be adopted at the application level (or by particular communities); but I'm with Nick in saying that the *specification* should be clear - full 1.2 compatibility requires that UTF-8 Unicode characters be identified as such. Let me also note that what is actually "done" with characters that are not within the ASCII set is currently entirely an application issue. So, even the vi editor will accept high-order 8-bit bytes without mangling them. It may not render them as character glyphs, but neither does it corrupt them - and it does permit you to construct UTF-8 byte sequences if you are clever enough to be able to import them from some other application. 2. Trailing whitespace. Let me be specific about my concern (which might or might not be what other people are thinking about in this context). Since "token" has proved ambiguous, let me speak instead of "data values", which might be quote-delimited strings. As I understood the proposal, the following would now be legitimate: loop_ _colour 'red''blue''green' as a loop of three values. If this is permitted, are the rules for mixing with unquoted strings (a) unambiguous and (b) straightforward enough that every lexer/parser will get them right? How does one interpret the following examples: INTENDED loop_ _colour 'red'blue'green' # 'red' blue 'green' loop_ _colour 'red' blue'green' # 'red' blue 'green' loop_ _colour 'red'blue 'green' # 'red' blue 'green' loop_ _colour 'red'''blue'green' # 'red' '' blue 'green' loop_ _colour 'red''''blue'''green # 'red' '''blue''' green The last could also be # 'red' '' 'blue' '' green (or has it already been ascertained that quoted strings cannot have null values? I guess that's implicit in having a ''' delimiter at all.) Then one has other possible cases, some of which are legal, some not, but it may not be immediately obvious: loop_ _colour'red' 'green' 'blue' loop_ _colour 'red' 'green' 'blue'_name Fred loop_ _colour 'red''green''blue'_name Fred loop_ _colour 'red''green''blue' _name Fred I suspect that formal rules can be constructed that do prevent ambiguity, but in this case it still seems that 'maximal disruption' is seeping over into 'gratuitous disruption' - unless the pay-off is really worthwhile. So, on that basis, my vote is currently 2(b). Pardon the verbosity of all this, but I'm trying to cover (in my own mind) many of the issues that will need to be treated explicitly in the next edition of International Tables G :-) Brian On Tue, Oct 13, 2009 at 06:27:20PM +0300, James Hester wrote: > Here are the results of the straw poll. See the end of the email for > detailed vote counts, and note the request for a further vote on > certain issues. > > CONCLUSIONS > =========== > > 1. UTF8 will be supported. Not clear on asciified version or > binary. Therefore, please comment and vote on the following, given > that UTF8 will be included in the new standard: > > (a) UTF8 should be supported in standard form only (i.e. 'binary' > characters with values above 127 will appear in CIF files) > > (b) An asciified version only should be supported. An example would > be the syntax \uxxxx, where xxxx refers to the Unicode code point of > the character in hexadecimal notation. NB this is not strictly UTF8, > but simply a Unicode representation. > > My vote: 1.a > > 2. Termination of quoted strings on first occurence of quote delimiter > and restriction of character set for non-delimited strings: Approved, > but not clear whether to deprecate first or move immediately to > requirement. Upon long consideration of Brian's email and Herbert's > reservations, and two cups of tea, and some chocolate, I am happy to > change my votes to 1.2 and 2.3 (and perhaps call the new CIF syntax > 2.0 rather than 1.2), therefore I declare these proposals approved as > a requirement in the new standard. I'll write a separate email on > this. > > However: Brian and James want to require whitespace between tokens > outside compound expressions regardless of it now becoming strictly > unnecessary in several cases. Given that the above proposals have > been passed, please vote again on the following options: > > (a) Whitespace is not required between tokens unless tokens could not > otherwise be separated; writers are encouraged to pad between tokens > (b) Whitespace must always appear between tokens outside compound expressions > (c) Whitespace must always appear between tokens both in and outside > compound expressions > > My vote: 2.b > > Detailed vote summary > ===================== > > Issue1: Removing the requirement for a trailing whitespace after > quoted strings outside of bracketed constructs. > Options: 1.1. Preserve the current convention as is > 1.2. Terminate all quoted strings on the occurance of the > trailing quoted delimiter without consideration of the next character > 1.3 Deprecate rather than require 1.2 > > 1.1: Nobody (Herbert prefers if 1.3 not an option) > 1.2: Brian (but whitespace required between tokens), Nick, Simon > 1.3: Herbert, James (but whitespace required between tokens) > > Difficult to determine any clear preference from John W., but he seems > happy to go along with the changes we are discussing so long as there > is a clear fallback position. > > Issue2: Restriction of the character set for non-delimited strings > outside of bracketed constructs > Options 2.1. Preserve the current convention as is > 2.2. Modify the current convention to deprecate use of > any characters other than a strictly limited set > of characters, adding a warning oon reads and > defaulting to add quote marks on write > 2.3. Modify the current convention to forbid the use of > any characters other than a strctly limited set > of characters, making it an error to read a non-delimited > string that does not comply even if the intention > can be inferred from context > > 2.1: Nobody > 2.2: Herbert, James > 2.3: Nick, Simon, (John) > > UTF8: > > Do not use: Nobody > Use: Simon, Brian, John > Use, binary: Herbert, James > Use, asciified: Nick > > A clear preference for binary or ascii can't be gleaned from Brian and > Simon's and John's emails, so I've left them as simply 'Use'. > > > -- _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Straw poll results (SIMON WESTRIP)
- References:
- [ddlm-group] Straw poll results (James Hester)
- Prev by Date: Re: [ddlm-group] Straw poll results
- Next by Date: Re: [ddlm-group] Straw poll results
- Prev by thread: Re: [ddlm-group] Straw poll results
- Next by thread: Re: [ddlm-group] Straw poll results
- Index(es):