[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Vote on BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Vote on BOM
From: James Hester <[email protected]>
Date: Fri, 18 Jun 2010 14:58:42 +1000
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]>

I see that by calling this a UTF8 BOM vote (in line with the
discussion) some have misinterpreted it as a vote on single or
multiple encodings.  It is not.  Parts (1) and (2) are a vote on what
to do when Unicode code point 0xFEFF arises in the decoded data
stream.  Part(3) was included in deference to Herbert's wish to switch
encodings midstream, and if he no longer wishes to do that inside the
CIF2 standard, we can call 0xFE 0xFF a syntax error as it would
normally be.

Herbert (and possibly Simon), if you wish to vote such that the quest
for multiple possible CIF2 encodings is not compromised, I think 1(a),
2(anything) and 3(a) would be a consistent position for you to take
and would not in any way compromise the possibility of multiple
encodings.  Not 3(b), as that implies that decoding is the business of
the CIF2 standard itself, which I gather you are opposed to.

I would strongly urge others to vote ASAP, so that we can resolve the
issue of 0xFEFF.  0xFEFF is completely orthogonal to the question of
multiple encodings, which will *not* be resolved by this vote.

Note that I would be happy to switch my vote to 1(a) instead of 1(b),
so if anybody prefers 1(b), they should say so, as 1(a) is the leader
at the moment.

On Wed, Jun 16, 2010 at 9:19 PM, Herbert J. Bernstein
<[email protected]> wrote:
> Dear Colleagues,
>
> �I vote for none of the false tricotomy presented. �I vote for
> a CIF2 to be a text file containing its information as a sequence of
> valid printable unicode code points, however encoded, and that a BOM be
> treated as part of the encoding/decoding process, not as part of the
> information that has been encoded.
>
> �This is similar to the original handling of nulls before C and the
> stdio got us all to become unclear about the distinction between
> text and binary, but even in the world of utf-8 streams, a null cannot
> be part of the text of a text file because it is the C-string terminator.
> I propose to treat the BOM with the same sort of caution.
>
> �Regards,
> � �Herbert
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
>
> On Wed, 16 Jun 2010, Brian McMahon wrote:
>
>> My vote, in line with my "keep it simple/blunt" approach:
>>
>> �1(a)
>> �2(a)
>> �3(a)
>>
>> I understand many of the counter-arguments, and think that most other
>> outcomes are also acceptable if properly documented. 2(c)(ii) and perhaps
>> 2(d) might give rise in many naive rendering programs (e.g. older versions
>> of "vi") to the appearance of whitespace in datanames, which would confuse
>> many users, so I would be least happy with these outcomes.
>>
>> One can see from examples such as the W3C Working Group Note of
>> Unicode in XML and other Markup Languages (section 3.5 of
>> http://www.w3.org/TR/unicode-xml/ ) that we are not the only group
>> struggling to express a clean formulation of this topic. The solution
>> in that document is suggestive, but not necessarily applicable to CIF,
>> which is not exactly a "markup" language.
>>
>> Regards
>> brian
>>
>> On Wed, Jun 16, 2010 at 11:31:59AM +1000, James Hester wrote:
>>>
>>> For clarity, by 'UTF8 BOM' I mean the byte sequence 0xEF,0xBB,0xBF,
>>> which corresponds to Unicode code point 0xFEFF.� A UCS2 BOM is the
>>> byte sequence 0xFE, 0xFF or the reverse.
>>>
>>> Please indicate your preferred behaviour below.� I have inserted mine
>>> already:
>>>
>>> 1.�Treatment of UTF8 BOM as first three bytes of a CIF2 file
>>> � �(a) Syntax error/Non CIF2 file
>>> � �(b) UTF8-BOM followed by #\#CIF2.0 is a valid CIF2 magic number
>>> � � � � � � � �James
>>> 2. Treatment of UTF8 BOM in a CIF file, other than as the first three
>>> bytes:
>>> � �(a) Always a syntax error
>>> � �(b) Syntactic whitespace
>>> � �(c) An ordinary character:
>>> � � � � �(i) May appear only in delimited data values and comments
>>> � � � � � � � � � � �James
>>> � � � � �(ii) May appear anywhere other ordinary characters can
>>> appear (i.e. including datanames, datablock names etc.)
>>> � �(d) Silently ignored
>>>
>>> 3. Treatment of UCS BOM in a CIF file
>>> � (a) Syntax error � � � � � � � � � � � � � � � � � �James
>>> � (b) Encoding switch
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Vote on BOM (David Brown)

Re: [ddlm-group] Vote on BOM (SIMON WESTRIP)

References:

[ddlm-group] Vote on BOM (James Hester)

Re: [ddlm-group] Vote on BOM (Brian McMahon)

Re: [ddlm-group] Vote on BOM (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] Vote on BOM

Prev by thread: Re: [ddlm-group] Vote on BOM

Next by thread: Re: [ddlm-group] Vote on BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Vote on BOM