Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Straw poll results

1. UTF-8: "binary" - yes.

Principle of maximal disruption; but also it's a clean upgrade path.

If any byte with high-order bit set occurs in a CIF known to be
conformant to the new 1.2 specs, a "validating" CIF lexer analyses
the value, and if appropriate scans ahead the next one, two or
three bytes to establish that they collectively represent a
valid Unicode character, UTF-8 encoded. If they do not, the
lexer throws an exception. A "non-validating" lexer presumably just
accepts any high-bit byte and leaves it to someone else to check that
it's part of a UTF-8 character. Maybe no-one will ever write such a
"non-validating" parser, but it does seem to me that checking the
validity of the  UTF-8 encoded Unicode character set is a more
heavyweight process than checking that only "printable ASCII" plus
a few permitted control characters are present.

In terms of managing services where, let us say, Acta accepts CIF1.2
but will deliver CIF1.1 upon request, there are established
UTF-8 -> ASCII conversions based upon the \u or HTML character
references (\u27, λ) which could be used to accomplish this.
(Simon will perhaps discuss with me offline whether any of these
would cause particular difficulties with our software.)

Setting up such a conversion procedure in a way that's unambiguous
and allows full CIF1.2->1.1->1.2 cycling is not necessarily easy,
and may require additional conventions to be adopted at the
application level (or by particular communities); but I'm with Nick
in saying that the *specification* should be clear - full 1.2
compatibility requires that UTF-8 Unicode characters be identified
as such.

Let me also note that what is actually "done" with characters that
are not within the ASCII set is currently entirely an application issue.
So, even the vi editor will accept high-order 8-bit bytes without
mangling them.  It may not render them as character glyphs, but
neither does it corrupt them - and it does permit you to construct
UTF-8 byte sequences if you are clever enough to be able to import
them from some other application.


2. Trailing whitespace. Let me be specific about my concern (which
might or might not be what other people are thinking about in this
context).  Since "token" has proved ambiguous, let me speak instead
of "data values", which might be quote-delimited strings.

As I understood the proposal, the following would now be legitimate:

loop_  _colour   'red''blue''green'

as a loop of three values. If this is permitted, are the rules for
mixing with unquoted strings (a) unambiguous and (b) straightforward
enough that every lexer/parser will get them right? How does one
interpret the following examples:

                                                  INTENDED
loop_  _colour   'red'blue'green'     #  'red'     blue      'green'
loop_  _colour   'red' blue'green'    #  'red'     blue      'green'
loop_  _colour   'red'blue 'green'    #  'red'     blue      'green'
loop_  _colour   'red'''blue'green'   #  'red' ''  blue      'green'
loop_  _colour   'red''''blue'''green #  'red'    '''blue'''  green

The last could also be
                                      #  'red' '' 'blue'  ''  green

(or has it already been ascertained that quoted strings cannot
have null values? I guess that's implicit in having a ''' delimiter
at all.)

Then one has other possible cases, some of which are legal, some not, but
it may not be immediately obvious:

loop_  _colour'red' 'green' 'blue'
loop_  _colour 'red' 'green' 'blue'_name Fred
loop_  _colour 'red''green''blue'_name Fred
loop_  _colour 'red''green''blue' _name Fred

I suspect that formal rules can be constructed that do prevent ambiguity,
but in this case it still seems that 'maximal disruption' is
seeping over into 'gratuitous disruption' - unless the pay-off is really
worthwhile.

So, on that basis, my vote is currently 2(b).

Pardon the verbosity of all this, but I'm trying to cover (in my own mind)
many of the issues that will need to be treated explicitly in the next
edition of International Tables G :-)

Brian




On Tue, Oct 13, 2009 at 06:27:20PM +0300, James Hester wrote:
> Here are the results of the straw poll.  See the end of the email for
> detailed vote counts, and note the request for a further vote on
> certain issues.
> 
> CONCLUSIONS
> ===========
> 
> 1. UTF8 will be supported. Not clear on asciified version or
> binary. Therefore, please comment and vote on the following, given
> that UTF8 will be included in the new standard:
> 
> (a) UTF8 should be supported in standard form only (i.e. 'binary'
> characters with values above 127 will appear in CIF files)
> 
> (b) An asciified version only should be supported.  An example would
> be the syntax \uxxxx, where xxxx refers to the Unicode code point of
> the character in hexadecimal notation.  NB this is not strictly UTF8,
> but simply a Unicode representation.
> 
> My vote: 1.a
> 
> 2. Termination of quoted strings on first occurence of quote delimiter
> and restriction of character set for non-delimited strings: Approved,
> but not clear whether to deprecate first or move immediately to
> requirement.  Upon long consideration of Brian's email and Herbert's
> reservations, and two cups of tea, and some chocolate, I am happy to
> change my votes to 1.2 and 2.3 (and perhaps call the new CIF syntax
> 2.0 rather than 1.2), therefore I declare these proposals approved as
> a requirement in the new standard.  I'll write a separate email on
> this.
> 
> However: Brian and James want to require whitespace between tokens
> outside compound expressions regardless of it now becoming strictly
> unnecessary in several cases.  Given that the above proposals have
> been passed, please vote again on the following options:
> 
> (a) Whitespace is not required between tokens unless tokens could not
> otherwise be separated; writers are encouraged to pad between tokens
> (b) Whitespace must always appear between tokens outside compound expressions
> (c) Whitespace must always appear between tokens both in and outside
> compound expressions
> 
> My vote: 2.b
> 
> Detailed vote summary
> =====================
> 
> Issue1:  Removing the requirement for a trailing whitespace after
> quoted strings outside of bracketed constructs.
>   Options:  1.1. Preserve the current convention as is
>             1.2. Terminate all quoted strings on the occurance of the
> trailing quoted delimiter without consideration of the next character
>             1.3  Deprecate rather than require 1.2
> 
> 1.1: Nobody (Herbert prefers if 1.3 not an option)
> 1.2: Brian (but whitespace required between tokens), Nick, Simon
> 1.3: Herbert, James (but whitespace required between tokens)
> 
> Difficult to determine any clear preference from John W., but he seems
> happy to go along with the changes we are discussing so long as there
> is a clear fallback position.
> 
>   Issue2:  Restriction of the character set for non-delimited strings
> outside of bracketed constructs
>   Options  2.1.  Preserve the current convention as is
>            2.2.  Modify the current convention to deprecate use of
>                  any characters other than a strictly limited set
>                  of characters, adding a warning oon reads and
>                  defaulting to add quote marks on write
>            2.3.  Modify the current convention to forbid the use of
>                  any characters other than a strctly limited set
>                  of characters, making it an error to read a non-delimited
>                  string that does not comply even if the intention
>                  can be inferred from context
> 
> 2.1: Nobody
> 2.2: Herbert, James
> 2.3: Nick, Simon, (John)
> 
> UTF8:
> 
> Do not use: Nobody
> Use: Simon, Brian, John
> Use, binary: Herbert, James
> Use, asciified: Nick
> 
> A clear preference for binary or ascii can't be gleaned from Brian and
> Simon's and John's emails, so I've left them as simply 'Use'.
> 
> 
> -- 
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.