[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Vote on BOM

Dear Colleagues,

   I vote for none of the false tricotomy presented.  I vote for
a CIF2 to be a text file containing its information as a sequence of
valid printable unicode code points, however encoded, and that a BOM be 
treated as part of the encoding/decoding process, not as part of the 
information that has been encoded.

   This is similar to the original handling of nulls before C and the
stdio got us all to become unclear about the distinction between
text and binary, but even in the world of utf-8 streams, a null cannot
be part of the text of a text file because it is the C-string terminator.
I propose to treat the BOM with the same sort of caution.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 16 Jun 2010, Brian McMahon wrote:

> My vote, in line with my "keep it simple/blunt" approach:
>
>  1(a)
>  2(a)
>  3(a)
>
> I understand many of the counter-arguments, and think that most other
> outcomes are also acceptable if properly documented. 2(c)(ii) and perhaps
> 2(d) might give rise in many naive rendering programs (e.g. older versions
> of "vi") to the appearance of whitespace in datanames, which would confuse
> many users, so I would be least happy with these outcomes.
>
> One can see from examples such as the W3C Working Group Note of
> Unicode in XML and other Markup Languages (section 3.5 of
> http://www.w3.org/TR/unicode-xml/ ) that we are not the only group
> struggling to express a clean formulation of this topic. The solution
> in that document is suggestive, but not necessarily applicable to CIF,
> which is not exactly a "markup" language.
>
> Regards
> brian
>
> On Wed, Jun 16, 2010 at 11:31:59AM +1000, James Hester wrote:
>> For clarity, by 'UTF8 BOM' I mean the byte sequence 0xEF,0xBB,0xBF,
>> which corresponds to Unicode code point 0xFEFF.  A UCS2 BOM is the
>> byte sequence 0xFE, 0xFF or the reverse.
>>
>> Please indicate your preferred behaviour below.  I have inserted mine already:
>>
>> 1. Treatment of UTF8 BOM as first three bytes of a CIF2 file
>>     (a) Syntax error/Non CIF2 file
>>     (b) UTF8-BOM followed by #\#CIF2.0 is a valid CIF2 magic number
>>                 James
>> 2. Treatment of UTF8 BOM in a CIF file, other than as the first three bytes:
>>     (a) Always a syntax error
>>     (b) Syntactic whitespace
>>     (c) An ordinary character:
>>           (i) May appear only in delimited data values and comments
>>                       James
>>           (ii) May appear anywhere other ordinary characters can
>> appear (i.e. including datanames, datablock names etc.)
>>     (d) Silently ignored
>>
>> 3. Treatment of UCS BOM in a CIF file
>>    (a) Syntax error                                    James
>>    (b) Encoding switch
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]