[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 23 Oct 2009 10:06:13 +1100
- In-Reply-To: <20091016091713.I65032@epsilon.pair.com>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com><279aad2a0910160435x3876c24ev797e022adbc05529@mail.gmail.com><20091016091713.I65032@epsilon.pair.com>
Dear All: Herbert's argument for at least UCS2 inclusion, if I have understood correctly, is that CIF2 users may inadvertently or deliberately save a CIF file containing multilingual characters in a non-UTF8 encoding. This file may look fine until it escapes from the confines of that person's computer or lab, when suddently other users find it difficult or impossible to decode. I agree that inadvertent saving in another encoding is entirely likely, given that virtually all internationalised applications support multiple encodings out of the box. I would expect that this is more of a problem when a CIF file is directly edited, whereas a program is likely to do it properly - or at least get it wrong only once... I believe that John's concerns about encodings may also arise out of the multiple possible ways of encoding any given piece of non-ASCII text and having to deal with that incoming variety in a reliable way. This is a real problem e.g. a file containing Cyrillic characters could be encoded in any of at least 5 ways, and these encodings differ only in which bytes correspond to which letters - if you don't read Russian, you may not even know that there is an encoding error. If the letters are in isolation, even a Russian speaker may have trouble. But: despite agreeing with Herbert on the likelihood of CIF files being saved in the wrong encoding, I repeat the point that we are defining a standard for successfully communicating information between computers across time and space. In that context 'optional' does not make sense. If any conformant CIF reader cannot read any conformant CIF file, the standard has failed. It would therefore be better to completely ignore UCS2, so that when the CIF reader fails, it is because of non-conformance to the standard either at the writer's end or the reader's end, not because of lack of some option. Those who have inadvertently or otherwise miscoded their multilingual characters simply cause an error (I say 'will' and not 'may', see paragraph below about distinctiveness of UTF8 encoding), and get feedback to correct the problem, just as today when they mess up their hand-edited CIFs. One of the nice aspects of UTF8 is that the European non-ASCII characters require at least two bytes where the likely alternative non-Unicode encodings would have only one. These two-byte pairs have a distinctive bit pattern which makes it very likely that a non-UTF8 encoding would be detected almost immediately. So in that sense I think that allowing only UTF8 encoding is a robust solution to the multiple encoding problem as a non-UTF8 encoding can be automatically detected. Finally, Herbert says: Even if we mandate UTF-8 as the archiving and file transmission standard, we really do need to deal with other encodings in a properly, self-identifying manner, just as emacs and vim do. I would suggest that specifying a standards violation whenever a different encoding is detected is sufficiently proper. I don't think that the comparison with vim and emacs is valid: these applications aim to actually deal with text in different encodings, whereas we do not. On Sat, Oct 17, 2009 at 12:36 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > I think as a practical matter there are two encodings for which we need > to consider providing support: > > 1. UTF-8 -- I think we now all agree that this is the sensible default > encoding for CIF-2 > > 2. UCS-2/UTF-16. This is the encoding used in java and in web > browsers. It is also the encoding used in imgCIF base-32K binary > encoding. This is where the BOM flag becomes important -- it tells you > when a switch to UCS-2/UTF-16 has ocurred and whether what follows is > big-endian or little-endian. It also gives you the capability of > switching back to UTF-8. However, the major use is simply as a flag at > the start of a file, all of which is in one encoding. > > Certainly there are other encodings that people may use -- in a system > dependent manner -- e.g. EBCDIC (yes it is still around) or 7-bit ASCII > (what we have used in the past). I am not proposing that we try to get > into the business of asking every parser to support every coding on every > legacy system, and certainly for interchange, we should be telling people > to stick to unicode, preferably as UTF-8, but I am certain that people > will still want to use CIF in other enviroments with other "native" (i.e. > system-dependent) encodings, and everybody gains from having a formalism > for what should only be system-internal files propoerly marking with the > encoding they are using to avoid the disasters that can occur when such > files escape from their system cage without proper marking as to what they > are. Think of the mess we could have is people using java accidentally > shipped a UCS-2/UTF-16 file without a BOM. Most text editors will _not_ > show you the alternating 0 bytes on the ordinary ASCII characters in that > encodings, but it can produce very strange errors even there, and when we > get to embedded accented characters, there is likely to simply be a wrong > character with no indication of an error. > > Even if we mandate UTF-8 as the archiving and file transmission > standard, we really do need to deal with other encodings in a properly, > self-identifying manner, just as emacs and vim do. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):