Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Vote on BOM

Dear Colleagues,

   I vote for none of the false tricotomy presented.  I vote for
a CIF2 to be a text file containing its information as a sequence of
valid printable unicode code points, however encoded, and that a BOM be 
treated as part of the encoding/decoding process, not as part of the 
information that has been encoded.

   This is similar to the original handling of nulls before C and the
stdio got us all to become unclear about the distinction between
text and binary, but even in the world of utf-8 streams, a null cannot
be part of the text of a text file because it is the C-string terminator.
I propose to treat the BOM with the same sort of caution.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Wed, 16 Jun 2010, Brian McMahon wrote:

> My vote, in line with my "keep it simple/blunt" approach:
>  1(a)
>  2(a)
>  3(a)
> I understand many of the counter-arguments, and think that most other
> outcomes are also acceptable if properly documented. 2(c)(ii) and perhaps
> 2(d) might give rise in many naive rendering programs (e.g. older versions
> of "vi") to the appearance of whitespace in datanames, which would confuse
> many users, so I would be least happy with these outcomes.
> One can see from examples such as the W3C Working Group Note of
> Unicode in XML and other Markup Languages (section 3.5 of
> http://www.w3.org/TR/unicode-xml/ ) that we are not the only group
> struggling to express a clean formulation of this topic. The solution
> in that document is suggestive, but not necessarily applicable to CIF,
> which is not exactly a "markup" language.
> Regards
> brian
> On Wed, Jun 16, 2010 at 11:31:59AM +1000, James Hester wrote:
>> For clarity, by 'UTF8 BOM' I mean the byte sequence 0xEF,0xBB,0xBF,
>> which corresponds to Unicode code point 0xFEFF.  A UCS2 BOM is the
>> byte sequence 0xFE, 0xFF or the reverse.
>> Please indicate your preferred behaviour below.  I have inserted mine already:
>> 1. Treatment of UTF8 BOM as first three bytes of a CIF2 file
>>     (a) Syntax error/Non CIF2 file
>>     (b) UTF8-BOM followed by #\#CIF2.0 is a valid CIF2 magic number
>>                 James
>> 2. Treatment of UTF8 BOM in a CIF file, other than as the first three bytes:
>>     (a) Always a syntax error
>>     (b) Syntactic whitespace
>>     (c) An ordinary character:
>>           (i) May appear only in delimited data values and comments
>>                       James
>>           (ii) May appear anywhere other ordinary characters can
>> appear (i.e. including datanames, datablock names etc.)
>>     (d) Silently ignored
>> 3. Treatment of UCS BOM in a CIF file
>>    (a) Syntax error                                    James
>>    (b) Encoding switch
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.