Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

   It really is not possible to determing the encoding of a file with a few 
accented characters from context.  You lose nothing by including a place 
to flag the encoding used, and can always then say that a conformant 
parser need only accept the delcared UTF-8 encoding and is free to declare
all other encodings to be an error, but you gain a lot (such a clean
access to the entire java/browser world and many older unicode operating
systems) by handling at least UCS-2 encoding.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 23 Oct 2009, James Hester wrote:

> Dear All: Herbert's argument for at least UCS2 inclusion, if I have
> understood correctly, is that CIF2 users may inadvertently or
> deliberately save a CIF file containing multilingual characters in a
> non-UTF8 encoding.  This file may look fine until it escapes from the
> confines of that person's computer or lab, when suddently other users
> find it difficult or impossible to decode. I agree that inadvertent
> saving in another encoding is entirely likely, given that virtually
> all internationalised applications support multiple encodings out of
> the box.  I would expect that this is more of a problem when a CIF
> file is directly edited, whereas a program is likely to do it properly
> - or at least get it wrong only once...
> I believe that John's concerns about encodings may also arise out of
> the multiple possible ways of encoding any given piece of non-ASCII
> text and having to deal with that incoming variety in a reliable way.
> This is a real problem e.g. a file containing Cyrillic characters could
> be encoded in any of at least 5 ways, and these encodings differ only
> in which bytes correspond to which letters - if you don't read
> Russian, you may not even know that there is an encoding error.  If
> the letters are in isolation, even a Russian speaker may have trouble.
> But: despite agreeing with Herbert on the likelihood of CIF files
> being saved in the wrong encoding, I repeat the point that we are
> defining a standard for successfully communicating information between
> computers across time and space.  In that context 'optional' does not
> make sense.  If any conformant CIF reader cannot read any conformant
> CIF file, the standard has failed. It would therefore be better to
> completely ignore UCS2, so that when the CIF reader fails, it is
> because of non-conformance to the standard either at the writer's end
> or the reader's end, not because of lack of some option.  Those who
> have inadvertently or otherwise miscoded their multilingual characters
> simply cause an error (I say 'will' and not 'may', see paragraph below
> about distinctiveness of UTF8 encoding), and get feedback to correct
> the problem, just as today when they mess up their hand-edited CIFs.
> One of the nice aspects of UTF8 is that the European non-ASCII
> characters require at least two bytes where the likely alternative
> non-Unicode encodings would have only one.  These two-byte pairs have
> a distinctive bit pattern which makes it very likely that a non-UTF8
> encoding would be detected almost immediately.  So in that sense I
> think that allowing only UTF8 encoding is a robust solution to the
> multiple encoding problem as a non-UTF8 encoding can be automatically
> detected.
> Finally, Herbert says:
> 	Even if we mandate UTF-8 as the archiving and file
> 	transmission standard, we really do need to deal with other
> 	encodings in a properly, self-identifying manner, just as
> 	emacs and vim do.
> I would suggest that specifying a standards violation whenever a
> different encoding is detected is sufficiently proper.  I don't think
> that the comparison with vim and emacs is valid: these applications
> aim to actually deal with text in different encodings, whereas we do
> not.
> On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> Dear Colleagues,
>>   I think as a practical matter there are two encodings for which we need
>> to consider providing support:
>>   1.  UTF-8 -- I think we now all agree that this is the sensible default
>> encoding for CIF-2
>>   2.  UCS-2/UTF-16.  This is the encoding used in java and in web
>> browsers.  It is also the encoding used in imgCIF base-32K binary
>> encoding.  This is where the BOM flag becomes important -- it tells you
>> when a switch to UCS-2/UTF-16 has ocurred and whether what follows is
>> big-endian or little-endian.  It also gives you the capability of
>> switching back to UTF-8.  However, the major use is simply as a flag at
>> the start of a file, all of which is in one encoding.
>> Certainly there are other encodings that people may use -- in a system
>> dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII
>> (what we have used in the past).  I am not proposing that we try to get
>> into the business of asking every parser to support every coding on every
>> legacy system, and certainly for interchange, we should be telling people
>> to stick to unicode, preferably as UTF-8, but I am certain that people
>> will still want to use CIF in other enviroments with other "native" (i.e.
>> system-dependent) encodings, and everybody gains from having a formalism
>> for what should only be system-internal files propoerly marking with the
>> encoding they are using to avoid the disasters that can occur when such
>> files escape from their system cage without proper marking as to what they
>> are.  Think of the mess we could have is people using java accidentally
>> shipped a UCS-2/UTF-16 file without a BOM.  Most text editors will _not_
>> show you the alternating 0 bytes on the ordinary ASCII characters in that
>> encodings, but it can produce very strange errors even there, and when we
>> get to embedded accented characters, there is likely to simply be a wrong
>> character with no indication of an error.
>>   Even if we mandate UTF-8 as the archiving and file transmission
>> standard, we really do need to deal with other encodings in a properly,
>> self-identifying manner, just as emacs and vim do.
>>   Regards,
>>      Herbert
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.