Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

   How about the following:

   Any CIF dataset with no information about encoding is presumed to be 
in UTF-8 encoding, but CIF writers using UTF-8 should include 
one of recommended UTF-8 identifiers.

   All CIF-2 parsers are required to handle UTF-8, and may reject other
encodings

   Systems that are handling CIFs in  encodings other than UTF-8 are 
required to include on of the recommended encoding identifers to clearly 
identify the encoding they are using and must _not_ use a UTF-8 idntifier 
if the coding is something other than UTF-8

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 23 Oct 2009, James Hester wrote:

> To continue...
>
> In what follows, by UTF8 I mean the UTF8 standard as such, and not
> some asciified unicode like \u1212.
>
> First, to answer Nick's points:
>
> I wrote:
>> 1. Encoding can be automatically determined: If a given CIF1.2 file
>> contains any bytes with values >127 then it can/should only be UTF8.
>
> Nick commented:
>     Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
>     coding algorithm?
>
> The images in imgCIF are encoded in pure ASCII.  CBF encodes the image
> only in binary, the other parts conform to CIF standards (Herbert may
> want to correct me on that), so a criterion similar to the one I
> propose above would apply to CBF
> except that the detection of non-ASCII characters would need to occur
> during parsing, rather than possibly before.
>
> I wrote:
>> Would such dictionary-based regulation give the PDB and IUCr
>> sufficient control over UTF8 introduction (John/Brian/Simon?).
>
> Which prompted the following witticism:
>     OK Only one-byte UTF-8 is allowed. Voila. Problem solved.
>
> At the risk of being terribly earnest, well yes, that was my point.
> If the end consumer of a dataname can only handle pure ASCII, that can
> be stated in the dictionary and we need not worry ourselves about
> taking this problem into account when setting the syntax standard.  I
> would be interested in hearing from John, Brian or Simon regarding how
> satisfied they would be with a dictionary-based approach to
> restricting UTF8 use.
>
> Finally, I wrote:
>> 3. An additional UTF8 encoding magic number could complicate the
>> simple magic number scheme we currently have in place.
>
> Nick:
>  I don't think it does. It would simplify the case for those parsers not
>  supporting yet UTF-8. It would tell them to terminate the process.
>
> Firstly, note that for a parser to support UTF8 it simply needs to
> treat the UTF8 bytes in the same way as the non-syntax-special ASCII
> bytes, with no other
> effort required, so compared to the effort involved in coding bracket
> construct parsing, there's very little for the programmer to do in
> order to support UTF8.  In any case, if UTF8 is not supported, the
> parser throws a syntax error (or applies some coercion rules) upon
> reaching the offending byte, which is arguably more productive than
> giving up without even starting - especially if there turns out to be
> only one-byte
> UTF8 (ASCII) in the file.
>
> Secondly, I envision UTF8 support as part of the CIF2.0 "package".  If
> you support CIF2.0, you support UTF8 and bracket constructs etc.  Why
> make a special case for the UTF8 part of CIF2.0?  I'm still not sure I
> understand the nature of the encoding problems that John alludes to if
> the only allowed encodings are ASCII and UTF8 - especially as a
> UTF8-compliant parser is also an ASCII-compliant parser.
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.