[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 22 Oct 2009 19:13:39 -0400 (EDT)
- In-Reply-To: <279aad2a0910221606k11e28de4la7582d9a85cadcf3@mail.gmail.com>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com><279aad2a0910160435x3876c24ev797e022adbc05529@mail.gmail.com><20091016091713.I65032@epsilon.pair.com><279aad2a0910221606k11e28de4la7582d9a85cadcf3@mail.gmail.com>
Dear Colleagues, It really is not possible to determing the encoding of a file with a few accented characters from context. You lose nothing by including a place to flag the encoding used, and can always then say that a conformant parser need only accept the delcared UTF-8 encoding and is free to declare all other encodings to be an error, but you gain a lot (such a clean access to the entire java/browser world and many older unicode operating systems) by handling at least UCS-2 encoding. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 23 Oct 2009, James Hester wrote: > Dear All: Herbert's argument for at least UCS2 inclusion, if I have > understood correctly, is that CIF2 users may inadvertently or > deliberately save a CIF file containing multilingual characters in a > non-UTF8 encoding. This file may look fine until it escapes from the > confines of that person's computer or lab, when suddently other users > find it difficult or impossible to decode. I agree that inadvertent > saving in another encoding is entirely likely, given that virtually > all internationalised applications support multiple encodings out of > the box. I would expect that this is more of a problem when a CIF > file is directly edited, whereas a program is likely to do it properly > - or at least get it wrong only once... > > I believe that John's concerns about encodings may also arise out of > the multiple possible ways of encoding any given piece of non-ASCII > text and having to deal with that incoming variety in a reliable way. > This is a real problem e.g. a file containing Cyrillic characters could > be encoded in any of at least 5 ways, and these encodings differ only > in which bytes correspond to which letters - if you don't read > Russian, you may not even know that there is an encoding error. If > the letters are in isolation, even a Russian speaker may have trouble. > > But: despite agreeing with Herbert on the likelihood of CIF files > being saved in the wrong encoding, I repeat the point that we are > defining a standard for successfully communicating information between > computers across time and space. In that context 'optional' does not > make sense. If any conformant CIF reader cannot read any conformant > CIF file, the standard has failed. It would therefore be better to > completely ignore UCS2, so that when the CIF reader fails, it is > because of non-conformance to the standard either at the writer's end > or the reader's end, not because of lack of some option. Those who > have inadvertently or otherwise miscoded their multilingual characters > simply cause an error (I say 'will' and not 'may', see paragraph below > about distinctiveness of UTF8 encoding), and get feedback to correct > the problem, just as today when they mess up their hand-edited CIFs. > > One of the nice aspects of UTF8 is that the European non-ASCII > characters require at least two bytes where the likely alternative > non-Unicode encodings would have only one. These two-byte pairs have > a distinctive bit pattern which makes it very likely that a non-UTF8 > encoding would be detected almost immediately. So in that sense I > think that allowing only UTF8 encoding is a robust solution to the > multiple encoding problem as a non-UTF8 encoding can be automatically > detected. > > Finally, Herbert says: > > Even if we mandate UTF-8 as the archiving and file > transmission standard, we really do need to deal with other > encodings in a properly, self-identifying manner, just as > emacs and vim do. > > I would suggest that specifying a standards violation whenever a > different encoding is detected is sufficiently proper. I don't think > that the comparison with vim and emacs is valid: these applications > aim to actually deal with text in different encodings, whereas we do > not. > > > On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: >> Dear Colleagues, >> >> I think as a practical matter there are two encodings for which we need >> to consider providing support: >> >> 1. UTF-8 -- I think we now all agree that this is the sensible default >> encoding for CIF-2 >> >> 2. UCS-2/UTF-16. This is the encoding used in java and in web >> browsers. It is also the encoding used in imgCIF base-32K binary >> encoding. This is where the BOM flag becomes important -- it tells you >> when a switch to UCS-2/UTF-16 has ocurred and whether what follows is >> big-endian or little-endian. It also gives you the capability of >> switching back to UTF-8. However, the major use is simply as a flag at >> the start of a file, all of which is in one encoding. >> >> Certainly there are other encodings that people may use -- in a system >> dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII >> (what we have used in the past). I am not proposing that we try to get >> into the business of asking every parser to support every coding on every >> legacy system, and certainly for interchange, we should be telling people >> to stick to unicode, preferably as UTF-8, but I am certain that people >> will still want to use CIF in other enviroments with other "native" (i.e. >> system-dependent) encodings, and everybody gains from having a formalism >> for what should only be system-internal files propoerly marking with the >> encoding they are using to avoid the disasters that can occur when such >> files escape from their system cage without proper marking as to what they >> are. Think of the mess we could have is people using java accidentally >> shipped a UCS-2/UTF-16 file without a BOM. Most text editors will _not_ >> show you the alternating 0 bytes on the ordinary ASCII characters in that >> encodings, but it can produce very strange errors even there, and when we >> get to embedded accented characters, there is likely to simply be a wrong >> character with no indication of an error. >> >> Even if we mandate UTF-8 as the archiving and file transmission >> standard, we really do need to deal with other encodings in a properly, >> self-identifying manner, just as emacs and vim do. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):