[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

   How about the following:

   Any CIF dataset with no information about encoding is presumed to be 
in UTF-8 encoding, but CIF writers using UTF-8 should include 
one of recommended UTF-8 identifiers.

   All CIF-2 parsers are required to handle UTF-8, and may reject other

   Systems that are handling CIFs in  encodings other than UTF-8 are 
required to include on of the recommended encoding identifers to clearly 
identify the encoding they are using and must _not_ use a UTF-8 idntifier 
if the coding is something other than UTF-8


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 23 Oct 2009, James Hester wrote:

> To continue...
> In what follows, by UTF8 I mean the UTF8 standard as such, and not
> some asciified unicode like \u1212.
> First, to answer Nick's points:
> I wrote:
>> 1. Encoding can be automatically determined: If a given CIF1.2 file
>> contains any bytes with values >127 then it can/should only be UTF8.
> Nick commented:
>     Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
>     coding algorithm?
> The images in imgCIF are encoded in pure ASCII.  CBF encodes the image
> only in binary, the other parts conform to CIF standards (Herbert may
> want to correct me on that), so a criterion similar to the one I
> propose above would apply to CBF
> except that the detection of non-ASCII characters would need to occur
> during parsing, rather than possibly before.
> I wrote:
>> Would such dictionary-based regulation give the PDB and IUCr
>> sufficient control over UTF8 introduction (John/Brian/Simon?).
> Which prompted the following witticism:
>     OK Only one-byte UTF-8 is allowed. Voila. Problem solved.
> At the risk of being terribly earnest, well yes, that was my point.
> If the end consumer of a dataname can only handle pure ASCII, that can
> be stated in the dictionary and we need not worry ourselves about
> taking this problem into account when setting the syntax standard.  I
> would be interested in hearing from John, Brian or Simon regarding how
> satisfied they would be with a dictionary-based approach to
> restricting UTF8 use.
> Finally, I wrote:
>> 3. An additional UTF8 encoding magic number could complicate the
>> simple magic number scheme we currently have in place.
> Nick:
>  I don't think it does. It would simplify the case for those parsers not
>  supporting yet UTF-8. It would tell them to terminate the process.
> Firstly, note that for a parser to support UTF8 it simply needs to
> treat the UTF8 bytes in the same way as the non-syntax-special ASCII
> bytes, with no other
> effort required, so compared to the effort involved in coding bracket
> construct parsing, there's very little for the programmer to do in
> order to support UTF8.  In any case, if UTF8 is not supported, the
> parser throws a syntax error (or applies some coercion rules) upon
> reaching the offending byte, which is arguably more productive than
> giving up without even starting - especially if there turns out to be
> only one-byte
> UTF8 (ASCII) in the file.
> Secondly, I envision UTF8 support as part of the CIF2.0 "package".  If
> you support CIF2.0, you support UTF8 and bracket constructs etc.  Why
> make a special case for the UTF8 part of CIF2.0?  I'm still not sure I
> understand the nature of the encoding problems that John alludes to if
> the only allowed encodings are ASCII and UTF8 - especially as a
> UTF8-compliant parser is also an ASCII-compliant parser.
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]