[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Fri, 23 Oct 2009 15:47:40 -0400 (EDT)
- In-Reply-To: <4AE20173.9060700@mcmaster.ca>
- References: <279aad2a0910120838t5f400d71wf1f237d05338c08@mail.gmail.com><C6F976F1.1206C%nick@csse.uwa.edu.au><279aad2a0910221613m2a2a7891k4ae23476e50f98e4@mail.gmail.com><20091022214818.D61491@epsilon.pair.com><279aad2a0910222132t5c8297aao90914fa40c4fbd91@mail.gmail.com><4AE20173.9060700@mcmaster.ca>
Dear Colleagues, I have only mild objections to saying the "UTF-8 is the only official encoding for CIF 2". My mild objection is that imgCIF will not be compliant in sereral of its variants, but it certainly will always be able to provide at least one compliant translation of any file, 50-60% bigger than it has to be in, say, UCS-2, but compliant. No, the real problem is not what is officially the "right" way to write CIFs, but what people will really do. People will do what they have always done -- work with CIF on whatever system they have. That system may be modern and support UTF-8, but, even then, its "native" mode may be something different. If we are lucky, the differences will be sufficiently dramatic to allow the encoding used to be detected from context. If somebody decides they are still using EBCDIC, we will have no trouble figuring that out, but sometimes the differences are more subtle. I just took a French message catalog for RasMol and converted it to the Latin-1 encoding. Most of the text is absolutely the same. Just a few acented characters differ. In a large text with just a few accents, this could easily be missed, and lots of people in Europe use the Latin-1 encoding. I am not saying that we should handle Latin-1 in all CIF-2 parsers. I am saying that it would be a very good idea to conform to the vim or emacs editor conventions in marking CIF with their encoding, so that if somebody does make a mistake and send a journal a Latin-1 CIF-2 file instead of a UTF-8 CIF-2, there will be some chance of spotting the error. The is the same issue as having the magic number #\# CIF 2.0 so we have a chance to spotting cases where somebody is trying to feed in a different CIF level. Just because somebody might, somewhere, sometime, decide to send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0 does not mean that suddenly we have to tell the person with the CIF 1 parser that their parser is broken. It just means the person with the CIF 1 parser or the person with the CIF 2 file have a better chance of quickly figuring out they have a mismatch. People will edit in different encodings, whether we approve of it or not. We lose nothing by flagging the UTF-8 encoding, and we can save people a lot of time in the future. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 23 Oct 2009, David Brown wrote: > I would just like to point out a philosophical principle which we tried to > observe in the earlier CIFs, and which I think very important, namely that in > a standard like CIF it is only necessary to define one convention for each > feature in the standard. Writers are required to convert the input to this > convention and readers can always be confident that they will only have to > read this one convention. Every time you allow alternative ways of encoding > a piece of information you *require* the reader to be able to read both > alternatives. If you allow three different encodings, you require three > different parsers. If you allow ten different codings, you require ten > different parsers in every piece of reading software. With one standard, a > single parser works everywhere. > > If a standard allows two different codings, it is no longer a standard, it is > two standards, and that is something we have tried to avoid (not always > successfully) in CIF. It should be a goal. > David > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- References:
- [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (David Brown)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):