[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Mon, 26 Oct 2009 10:34:51 -0400 (EDT)
- In-Reply-To: <279aad2a0910260536l48dcc06bg1ffbcda0936b529e@mail.gmail.com>
- References: <279aad2a0910120838t5f400d71wf1f237d05338c08@mail.gmail.com><279aad2a0910222132t5c8297aao90914fa40c4fbd91@mail.gmail.com><4AE20173.9060700@mcmaster.ca> <20091023152244.U10188@epsilon.pair.com><715417.99025.qm@web87006.mail.ird.yahoo.com><20091024104627.N28064@epsilon.pair.com><165167.96476.qm@web87006.mail.ird.yahoo.com><279aad2a0910252228w40b94ab1hea895f257bb58059@mail.gmail.com><960215.80636.qm@web87008.mail.ird.yahoo.com><20091026055916.E21351@epsilon.pair.com><279aad2a0910260536l48dcc06bg1ffbcda0936b529e@mail.gmail.com>
Dear James, We see the balance points differently. I would suggest taking a straw poll of those who are interested and moving on to other issues. imgCIF will use UTF-8, UCS-2 and several true binary representations, but I guess that all but the UTF-8 can go under the "binary" heading. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 26 Oct 2009, James Hester wrote: > Hi Herbert and others: > >> The heart of the problem is the person who submits a non-UTF-8 file to a >> system, such as the IUCr, that chooses consider anything other thatn UTF-8 >> an error. If you have no explicit flag citing what the person thinks they >> used as an encoding, the only way you can detect and flag this error is by >> examining the text character by character, looking for two things: >> >> keywords and tags that are invalid >> strings that contain invalid characters >> >> Keywordd and tags are not likely to raise warning flags in, say, the >> Latin-1 vs. UTF-8 encoding, bacause the keywords are all from the common >> ASCII portion of both encodings, and the tag names from the official >> dictionaries are also all from the common ASCII portion of both encodings. > > Agreed. > >> That leaves us only with the contents of the strings themselves to use to >> spot the differences, a dubious proposition if the person has only a few >> accented letters. > > As I have explained previously, this is actually a distinguishing > feature of UTF-8: a single accented character in Latin-1 or any > ISO-8859 encoding will be broken UTF-8 immediately. Two accented > characters side by side have an apriori chance of accidentally being > correct UTF-8 of 1/32, and if even only two such combinations occur > the chance of it being correct UTF-8 is <0.1%. The more accented > characters, the less likely an encoding error will not be detected, > and the easier it is for a machine to detect the error; the less > accented characters, the easier it is for a human to check the > particular word involved. And if there are any single high-bit set > characters, you immediately know that it is not UTF-8. So, I would > assert that in the particular case of UTF-8 you do not need to worry > about not being able to detect non-UTF-8 files. > > The IUCr is not really a good use case for alternative encodings, > because (1) the strings that are likely to have accented etc. > characters in them are those that appear in print and so will be > examined by human eyes (2) the author is available, so straightforward > rejection of non UTF-8 files is perfectly OK (just as a non-ASCII file > at the moment would be treated). > >> If we give people a standard place to flag their encoding, then, if they >> ignore that option and the editors they use ignore that option, we are no >> worse off than off than if the option was not made available, but if we >> provide the option and they either pay attention to what they are doing >> (very unlikely) or their software pays attention to what they are doing >> (an increasing reality) then the chances of producing a journal article >> with a mistransliterated accent are reduced. > > I strongly disagree, for the reasons outlined in point 1 of my > previous email. At the same time as introducing the possibility of > catching a rogue encoding or two, you introduce the possibility of > using the wrong encoding and not noticing, which is much less likely > under UTF-8. And if we do say 'UTF-8 only', why would anyone write > software in the first place for CIF2.0 that uses a different encoding? > (Not only a rhetorical question - perhaps you have a use case in > mind). > >> I do not claim this is a huge benefit, but inasmuch as the cost is very >> low for providing it, it seems worth having, as it is worth having in XML. >> >> That does leave a disagreement on the cost. I see only the expense of a >> few extra characters in each file. You seem primarily concerned about >> >> "2. The very existence of an encoding tag will encourage people not to >> use UTF-8 if their own favourite encoding is easier and the recipient >> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15). >> We are thus encouraging people to play fast and loose with the >> standard." > > This is indeed an important concern, but I am also concerned about the > other points in the previous email, including adding complexity for a > doubtful gain, but to continue: > >> I do not see the problem here. We are designing a tool to use, and, of >> course, people will extend annd adapt it, just as both the IUCr and the >> PDB already "play fast and loose with the standard". Aren't we better off >> if we provide a clean documented way for people to flag their deviations >> from the standard than to force them to secretly engage in deviant >> practices? CIF is a tool, not a religion. If the IUCr or the PDB needs >> to do something different from the standard to get their jobs done, we >> should look at ways to document, not to conceal, those practices. > > We have a pretty clear datapoint here: CIF1 has been pure ASCII for 16 > years, and as far as I know all the CIFs submitted to the IUCr have > managed to stick to ASCII. Therefore, there is no great need or > desire to use different encodings. If you have some counterexamples > e.g. from imgCIF, that would be good to hear about. Pending that > counterexample, I would assert that insofar as ASCII was already > sufficient to cover all character encoding needs, UTF-8 is even more > so. If we design a tool that clearly does the job, where will this > need to change it come from? I would have argued completely > differently 10 years ago, when a variety of encodings was a fact of > life and no encoding was obviously superior. But I really think that > UTF-8 does the job so well that there will be no need to move beyond > it in the lifetime of the CIF2 standard. And yes, I know that this > email will be recorded forever on the IUCr website...but given how > long ASCII has lasted I feel safe. > >> CIF stopped being simple when mmCIF was introduced. As Frances says, it >> is like dealing with PostScript. Unlike core CIF, most people would be >> well advised not to try to read mmCIF (and certainly not imgCIF) or do >> hand editing of it, even though it looks like something you should be able >> to read. As much as possible, it should be handled by appropriate >> syntax-aware tools, and the primary target for this proposal is to make it >> easy for the programmers of those tools to have a way to deal with the >> reality of varying character encoding and to be able to reliably deliver >> the UTF-8 version for external transmission, even on systems for which >> UTF-8 is _not_ the natve encoding. > > My apologies, I was referring to simple syntax rather than the whole > CIF shooting match. mmCIF has perfectly legal, simple CIF syntax, > upon which you can indeed build beautiful, complex semantic > structures. Or not. But that is your choice. Contrast that to XML, > where the syntax is far more complex (I believe there are 11 different > node types, for example), and they accept all those different > encodings to boot. Also, in what sense is a given encoding a 'native' > encoding for an OS? I always assumed that this referred to the > encoding of the filenames. Surely there are not OSs out there that > magically detect a text file and convert it always to UCS-2? If such > systems don't exist, then why do we care what the native encoding is? > >> I disagree about whether we should be looking at python and at XML. Both >> are successful tools that are, in fact, serving major communities that >> have strong overlap with our community. Both provide tools our >> communities use. We do not have to slavishly adopt every feature of >> either one, but it certainly will pay to look at the choices that they >> have made and to consider what lessons we may learn that will be of value >> in the development of CIF. > > I do agree with this sentiment wholeheartedly. My point was that > using Emacs and vim tags for encoding is rather pointless in our > context, whereas it is reasonable for a programming language. > Likewise, adding features such as support for a variety of encodings > is simple for them, because only one program has to change: the > CPython interpreter, whereas in our case there are a couple of orders > of magnitude more programs involved, and we need to think about > transferability as well. > > all the best, > James. > >> On Mon, 26 Oct 2009, SIMON WESTRIP wrote: >> >>> Perhaps 'benifit' is the wrong word - I was reading Herbert's argument >>> as suggesting that it would be >> 'good practice' to include some sort of flag so that if different >> encodings are permitted in the future, a mechanism is already in place to >> identify them? >> >> In practice, if the only permitted encoding is UTF-8, for Acta type CIFs I >> suspect we would adopt a zero-tolerance policy with respect to other >> encodings, though just as we will provide tools for converting between >> CIF1 and CIF2, we may well also include tools for converting encodings. >> >> Cheers >> >> Simon >> >> >> >> >> ________________________________ >> From: James Hester <jamesrhester@gmail.com> >> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >> Sent: Monday, 26 October, 2009 5:28:30 >> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >> >> The more I think about the proposal to add an encoding header or tag, >> the more it frightens me. Before dealing with the specific email >> below, let me make the following additional points: >> >> 1. We have no way of knowing whether or not the correct encoding has >> been included in the header, as Simon points out. This is not just a >> case of 'oh well, it was worth a try'. This is a case of making >> things worse: if the character set for ISO-8859-1 is used instead of >> ISO-8859-15, only attentive human reading of the text will turn up the >> problem. So, while helping to read in a few extra files, for the >> majority of encodings this proposal opens up the possibility of >> introducing hard-to-find errors in human-readable text. For this >> reason I would strongly, strongly recommend that only self-identifying >> encodings are tolerated in the first place, and that no encoding >> header is recognised. >> >> 2. The very existence of an encoding tag will encourage people not to >> use UTF-8 if their own favourite encoding is easier and the recipient >> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15). >> We are thus encouraging people to play fast and loose with the >> standard. >> >> 3. Emacs and Vim are not the preferred file editing platform for most >> of the crystallographic world. Of course Python can use vi/emacs >> encoding tags, as considerably more programmers are likely to be >> familiar with vi and Emacs. >> >> 4. Let's be cautious when adopting practices from Python or (e.g.) >> XML. We need to appreciate the differences between them and us. For >> example, Python is essentially a single program (CPython) and so >> upgrade paths are easier to manage. >> >> 5. Don't forget the data archives - if the IUCr don't remediate the >> ISO-8859-15 file to UTF-8, the archives will have to, as they have to >> be able to deliver CIFs which are readable for all recipients. So >> there is guaranteed additional work and complexity involved as soon as >> anybody starts agreeing to take other encodings. >> >> 6. A key virtue of CIF is its simplicity. A single acceptable >> encoding is simple. Multiple optional encodings is not. >> >> Returning to Herbert's latest email, I'm glad Simon can see the >> benefit, but I still fail to. Let's go through this again. There are >> two proposals on the table: both myself and Herbert are happy to state >> that UTF-8 is the only official encoding for CIF2.0 files. Herbert >> further proposes describing encoding flags in the standard, whereas I >> think this is a bad idea. >> >> Let's go through Herbert's cases, with two more added for completeness: >> H1. Somebody submits a UTF-8 file without the flag >> H2. Somebody submits a UTF-8 file with the UTF-8 flag >> H3. Somebody submits a non-UTF-8 file either with no flag >> or with a UTF-8 flag >> H4. (deleted, becomes H5 or H6) >> H5. Somebody submits a non-UTF-8 file with a flag correctly telling >> us that it is encoding xxx >> H6. Somebody submits a non-UTF-8 file with a flag incorrectly >> telling us that it is encoding xxx >> >> Under my proposal, this list degenerates to: >> >> J1. Somebody submits a UTF-8 file >> J2. Somebody submits a non-UTF-8 file >> >> In case J2 (the equivalent of both case H3, H5 and H6 above), the file >> is rejected as a syntactically incorrect CIF, just as incorrect CIFs >> are rejected today. I don't see anything wrong with this - in the >> IUCr use case, the author is around to deal with the problem and >> resubmit correctly. Alternatively, under Herbert's proposal, a >> further level of checking can be done by using the encoding flag to >> see if a correct CIF can be produced - and will probably still fail. >> 'Probably' because, as Simon points out, if the author thinks they've >> sent a UTF-8 file and haven't, they are unlikely to get niceties like >> an encoding flag correct, so H6 (and H3) are the most likely sorts of >> files that will reach the IUCr. Furthermore, for a large number of >> encodings, it will fail in a way that cannot be detected automatically >> (see point 1 at the beginning). >> >> It would be good to hear from those who haven't said anything yet, >> even if only to hear if they are undecided. >> >> James. >> >> On Sun, Oct 25, 2009 at 3:18 AM, SIMON WESTRIP >> <simonwestrip@btinternet.com> wrote: >>> Dear Herbert >>> >>> thanks for clarifying this for me - I can now see the benefits of such flags >>> (actually, if I'd stopped to think about it, I should have spotted an >>> analogy >>> with the use of <meta charset=UTF-8...> tags in html...) >>> >>> Cheers >>> >>> Simon >>> >>> ________________________________ >>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >>> Sent: Saturday, 24 October, 2009 16:07:50 >>> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >>> >>> Dear Simon, >>> >>> The world is not a perfect place, but there is increasing use of clear >>> flags for encoding. If we provide a place for the flag in CIf there are >>> four possibilities on a submission: >>> >>> 1. Somebody submits a UTF-8 file without the flag >>> 2. Somebody submits a UTF-8 file with the UTF-8 flag >>> 3. Somebody submits a non-UTF-8 file either with no flag >>> or with a UTF-8 flag >>> 4. Somebody submits a non-UTF-8 file with a flag telling >>> us that it is a non-UTF-8 file >>> >>> Cases 1,2 and 4 all allow for rational handling of the file. >>> Case 3 can result in mishandling >>> >>> If we do not have the flag, we cannot have case 4, and all non-UTF-8 >>> files are highly likely to cause trouble. Yes, getting into case 4 right >>> now depends on users who know what encoding they are using, but python >>> programmers are aready learning to be careful about it, and both vi and >>> emacs are pretty good at recognizing mismatches to users are learning >>> to fix the comment if it is wrong. >>> >>> What is the worst that happens if we include the identification of the >>> encoding? -- everybody just leaves it set at UTF-8 no matter what they do. >>> We will have lost nothing. But is just one submission makes use of the >>> identification propoerly for a non-UTF-8 encoding we will have gained, and >>> over the next few years, as the editors and their supporting scripts get >>> smarter, I expect you will start to see significant use of the encoding >>> flags, especially to distinguigh UTF-8 from other popular unicode >>> encodings, such as USC-2. >>> >>> vim supports both the comment and the BOM. I personally prefer the BOM to >>> other methods, but the comment is increasingly used. >>> >>> Regards, >>> Herbert >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >>> >>> On Sat, 24 Oct 2009, SIMON WESTRIP wrote: >>> >>>> Herbert wrote: >>> >>> "I am saying that it would be a very good idea to conform to the >>> vim or emacs editor conventions in marking CIF with their encoding, so >>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2 >>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the >>> error." >>> >>> I'm not sure what you're getting at here. Having a UTF-8 identifier would >>> not help in this case? Or if you mean that the actual encoding used should >>> be tagged, it seems unlikely that having already mistakingly (and probably >>> unknowingly) used the wrong encoding, anyone would include such a tag? So >>> unless the encoding can be determined from something inherent to the >>> encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any >>> benefit? >>> >>> If the standard specifies UTF-8 there should be no reason to identify this >>> in the CIF. >>> >>> However, I can see the advantages of such a tag if its envisaged that >>> other encodings will be allowed in the future, or even simply to reinforce >>> that the CIF is CIF2 (especially if the magic number has been ommitted)? >>> >>> I have to confess that I am starting to worry about all this slightly. As >>> much as in the work I do I can happily read/write UTF-8 and convert from >>> other encodings, at this stage I would probably struggle to convert from >>> an unrecognized encoding - which is fair enough because if its CIF2 it >>> should be UTF-8 and I shouldnt need to convert anyway (!), but it is a >>> worry with respect to the issue of trying to make adoption of CIF-2 as >>> painless as possible for the end users. But then again, I'm having a bad >>> day :-) >>> >>> Cheers >>> >>> Simon >>> >>> >>> >>> >>> >>> >>> >>> >>> ________________________________ >>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >>> Sent: Friday, 23 October, 2009 20:47:40 >>> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >>> >>> Dear Colleagues, >>> >>> I have only mild objections to saying the "UTF-8 is the only official >>> encoding for CIF 2". My mild objection is that imgCIF will not be >>> compliant in sereral of its variants, but it certainly will always be able >>> to provide at least one compliant translation of any file, 50-60% bigger >>> than it has to be in, say, UCS-2, but compliant. >>> >>> No, the real problem is not what is officially the "right" way to write >>> CIFs, but what people will really do. People will do what they have >>> always done -- work with CIF on whatever system they have. That system >>> may be modern and support UTF-8, but, even then, its "native" mode may be >>> something different. If we are lucky, the differences will be >>> sufficiently dramatic to allow the encoding used to be detected from >>> context. If somebody decides they are still using EBCDIC, we will have no >>> trouble figuring that out, but sometimes the differences are more subtle. >>> I just took a French message catalog for RasMol and converted it to the >>> Latin-1 encoding. Most of the text is absolutely the same. Just a few >>> acented characters differ. In a large text with just a few accents, this >>> could easily be missed, and lots of people in Europe use the Latin-1 >>> encoding. I am not saying that we should handle Latin-1 in all CIF-2 >>> parsers. I am saying that it would be a very good idea to conform to the >>> vim or emacs editor conventions in marking CIF with their encoding, so >>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2 >>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the >>> error. >>> >>> The is the same issue as having the magic number #\# CIF 2.0 so we have a >>> chance to spotting cases where somebody is trying to feed in a different >>> CIF level. Just because somebody might, somewhere, sometime, decide to >>> send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0 >>> does not mean that suddenly we have to tell the person with the CIF 1 >>> parser that their parser is broken. It just means the person with the CIF >>> 1 parser or the person with the CIF 2 file have a better chance of quickly >>> figuring out they have a mismatch. >>> >>> People will edit in different encodings, whether we approve of it or not. >>> >>> We lose nothing by flagging the UTF-8 encoding, and we can save people a >>> lot of time in the future. >>> >>> Regards, >>> Herbert >>> >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >>> >>> On Fri, 23 Oct 2009, David Brown wrote: >>> >>>> I would just like to point out a philosophical principle which we tried to >>>> observe in the earlier CIFs, and which I think very important, namely that >>>> in >>>> a standard like CIF it is only necessary to define one convention for each >>>> feature in the standard. Writers are required to convert the input to >>>> this >>>> convention and readers can always be confident that they will only have to >>>> read this one convention. Every time you allow alternative ways of >>>> encoding >>>> a piece of information you *require* the reader to be able to read both >>>> alternatives. If you allow three different encodings, you require three >>>> different parsers. If you allow ten different codings, you require ten >>>> different parsers in every piece of reading software. With one standard, >>>> a >>>> single parser works everywhere. >>>> >>>> If a standard allows two different codings, it is no longer a standard, it >>>> is >>>> two standards, and that is something we have tried to avoid (not always >>>> successfully) in CIF. It should be a goal. >>>> David >>>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (David Brown)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] Relationship of CIF2 to legacy platforms
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):