[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Mon, 26 Oct 2009 10:34:51 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <[email protected]><[email protected]><[email protected]> <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>
Dear James, We see the balance points differently. I would suggest taking a straw poll of those who are interested and moving on to other issues. imgCIF will use UTF-8, UCS-2 and several true binary representations, but I guess that all but the UTF-8 can go under the "binary" heading. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 [email protected] ===================================================== On Mon, 26 Oct 2009, James Hester wrote: > Hi Herbert and others: > >> � The heart of the problem is the person who submits a non-UTF-8 file to a >> system, such as the IUCr, that chooses consider anything other thatn UTF-8 >> an error. �If you have no explicit flag citing what the person thinks they >> used as an encoding, the only way you can detect and flag this error is by >> examining the text character by character, looking for two things: >> >> � keywords and tags that are invalid >> � strings that contain invalid characters >> >> Keywordd and tags are not likely to raise warning flags in, say, the >> Latin-1 vs. UTF-8 encoding, bacause the keywords are all from the common >> ASCII portion of both encodings, and the tag names from the official >> dictionaries are also all from the common ASCII portion of both encodings. > > Agreed. > >> That leaves us only with the contents of the strings themselves to use to >> spot the differences, a dubious proposition if the person has only a few >> accented letters. > > As I have explained previously, this is actually a distinguishing > feature of UTF-8: a single accented character in Latin-1 or any > ISO-8859 encoding will be broken UTF-8 immediately. Two accented > characters side by side have an apriori chance of accidentally being > correct UTF-8 of 1/32, and if even only two such combinations occur > the chance of it being correct UTF-8 is <0.1%. The more accented > characters, the less likely an encoding error will not be detected, > and the easier it is for a machine to detect the error; the less > accented characters, the easier it is for a human to check the > particular word involved. And if there are any single high-bit set > characters, you immediately know that it is not UTF-8. So, I would > assert that in the particular case of UTF-8 you do not need to worry > about not being able to detect non-UTF-8 files. > > The IUCr is not really a good use case for alternative encodings, > because (1) the strings that are likely to have accented etc. > characters in them are those that appear in print and so will be > examined by human eyes (2) the author is available, so straightforward > rejection of non UTF-8 files is perfectly OK (just as a non-ASCII file > at the moment would be treated). > >> If we give people a standard place to flag their encoding, then, if they >> ignore that option and the editors they use ignore that option, we are no >> worse off than off than if the option was not made available, but if we >> provide the option and they either pay attention to what they are doing >> (very unlikely) or their software pays attention to what they are doing >> (an increasing reality) then the chances of producing a journal article >> with a mistransliterated accent are reduced. > > I strongly disagree, for the reasons outlined in point 1 of my > previous email. At the same time as introducing the possibility of > catching a rogue encoding or two, you introduce the possibility of > using the wrong encoding and not noticing, which is much less likely > under UTF-8. And if we do say 'UTF-8 only', why would anyone write > software in the first place for CIF2.0 that uses a different encoding? > (Not only a rhetorical question - perhaps you have a use case in > mind). > >> I do not claim this is a huge benefit, but inasmuch as the cost is very >> low for providing it, it seems worth having, as it is worth having in XML. >> >> That does leave a disagreement on the cost. �I see only the expense of a >> few extra characters in each file. �You seem primarily concerned about >> >> "2. The very existence of an encoding tag will encourage people not to >> use UTF-8 if their own favourite encoding is easier and the recipient >> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15). >> We are thus encouraging people to play fast and loose with the >> standard." > > This is indeed an important concern, but I am also concerned about the > other points in the previous email, including adding complexity for a > doubtful gain, but to continue: > >> I do not see the problem here. �We are designing a tool to use, and, of >> course, people will extend annd adapt it, just as both the IUCr and the >> PDB already "play fast and loose with the standard". �Aren't we better off >> if we provide a clean documented way for people to flag their deviations >> from the standard than to force them to secretly engage in deviant >> practices? �CIF is a tool, not a religion. �If the IUCr or the PDB needs >> to do something different from the standard to get their jobs done, we >> should look at ways to document, not to conceal, those practices. > > We have a pretty clear datapoint here: CIF1 has been pure ASCII for 16 > years, and as far as I know all the CIFs submitted to the IUCr have > managed to stick to ASCII. Therefore, there is no great need or > desire to use different encodings. If you have some counterexamples > e.g. from imgCIF, that would be good to hear about. Pending that > counterexample, I would assert that insofar as ASCII was already > sufficient to cover all character encoding needs, UTF-8 is even more > so. If we design a tool that clearly does the job, where will this > need to change it come from? I would have argued completely > differently 10 years ago, when a variety of encodings was a fact of > life and no encoding was obviously superior. But I really think that > UTF-8 does the job so well that there will be no need to move beyond > it in the lifetime of the CIF2 standard. And yes, I know that this > email will be recorded forever on the IUCr website...but given how > long ASCII has lasted I feel safe. > >> CIF stopped being simple when mmCIF was introduced. �As Frances says, it >> is like dealing with PostScript. �Unlike core CIF, most people would be >> well advised not to try to read mmCIF (and certainly not imgCIF) or do >> hand editing of it, even though it looks like something you should be able >> to read. �As much as possible, it should be handled by appropriate >> syntax-aware tools, and the primary target for this proposal is to make it >> easy for the programmers of those tools to have a way to deal with the >> reality of varying character encoding and to be able to reliably deliver >> the UTF-8 version for external transmission, even on systems for which >> UTF-8 is _not_ the natve encoding. > > My apologies, I was referring to simple syntax rather than the whole > CIF shooting match. mmCIF has perfectly legal, simple CIF syntax, > upon which you can indeed build beautiful, complex semantic > structures. Or not. But that is your choice. Contrast that to XML, > where the syntax is far more complex (I believe there are 11 different > node types, for example), and they accept all those different > encodings to boot. Also, in what sense is a given encoding a 'native' > encoding for an OS? I always assumed that this referred to the > encoding of the filenames. Surely there are not OSs out there that > magically detect a text file and convert it always to UCS-2? If such > systems don't exist, then why do we care what the native encoding is? > >> I disagree about whether we should be looking at python and at XML. �Both >> are successful tools that are, in fact, serving major communities that >> have strong overlap with our community. �Both provide tools our >> communities use. �We do not have to slavishly adopt every feature of >> either one, but it certainly will pay to look at the choices that they >> have made and to consider what lessons we may learn that will be of value >> in the development of CIF. > > I do agree with this sentiment wholeheartedly. My point was that > using Emacs and vim tags for encoding is rather pointless in our > context, whereas it is reasonable for a programming language. > Likewise, adding features such as support for a variety of encodings > is simple for them, because only one program has to change: the > CPython interpreter, whereas in our case there are a couple of orders > of magnitude more programs involved, and we need to think about > transferability as well. > > all the best, > James. > >> On Mon, 26 Oct 2009, SIMON WESTRIP wrote: >> >>> Perhaps 'benifit' is the wrong word - I was reading Herbert's argument >>> as suggesting that it would be >> 'good practice' to include some sort of flag so that if different >> encodings are permitted in the future, a mechanism is already in place to >> identify them? >> >> In practice, if the only permitted encoding is UTF-8, for Acta type CIFs I >> suspect we would adopt a zero-tolerance policy with respect to other >> encodings, though just as we will provide tools for converting between >> CIF1 and CIF2, we may well also include tools for converting encodings. >> >> Cheers >> >> Simon >> >> >> >> >> ________________________________ >> From: James Hester <[email protected]> >> To: Group finalising DDLm and associated dictionaries <[email protected]> >> Sent: Monday, 26 October, 2009 5:28:30 >> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >> >> The more I think about the proposal to add an encoding header or tag, >> the more it frightens me. �Before dealing with the specific email >> below, let me make the following additional points: >> >> 1. �We have no way of knowing whether or not the correct encoding has >> been included in the header, as Simon points out. �This is not just a >> case of 'oh well, it was worth a try'. �This is a case of making >> things worse: if the character set for ISO-8859-1 is used instead of >> ISO-8859-15, only attentive human reading of the text will turn up the >> problem. �So, while helping to read in a few extra files, for the >> majority of encodings this proposal opens up the possibility of >> introducing hard-to-find errors in human-readable text. �For this >> reason I would strongly, strongly recommend that only self-identifying >> encodings are tolerated in the first place, and that no encoding >> header is recognised. >> >> 2. The very existence of an encoding tag will encourage people not to >> use UTF-8 if their own favourite encoding is easier and the recipient >> agrees (e.g. the IUCr might agree to additionally accept ISO-8859-15). >> We are thus encouraging people to play fast and loose with the >> standard. >> >> 3. Emacs and Vim are not the preferred file editing platform for most >> of the crystallographic world. �Of course Python can use vi/emacs >> encoding tags, as considerably more programmers are likely to be >> familiar with vi and Emacs. >> >> 4. Let's be cautious when adopting practices from Python or (e.g.) >> XML. �We need to appreciate the differences between them and us. �For >> example, Python is essentially a single program (CPython) and so >> upgrade paths are easier to manage. >> >> 5. �Don't forget the data archives - if the IUCr don't remediate the >> ISO-8859-15 file to UTF-8, the archives will have to, as they have to >> be able to deliver CIFs which are readable for all recipients. �So >> there is guaranteed additional work and complexity involved as soon as >> anybody starts agreeing to take other encodings. >> >> 6. �A key virtue of CIF is its simplicity. �A single acceptable >> encoding is simple. �Multiple optional encodings is not. >> >> Returning to Herbert's latest email, I'm glad Simon can see the >> benefit, but I still fail to. �Let's go through this again. �There are >> two proposals on the table: both myself and Herbert are happy to state >> that UTF-8 is the only official encoding for CIF2.0 files. Herbert >> further proposes describing encoding flags in the standard, whereas I >> think this is a bad idea. >> >> Let's go through Herbert's cases, with two more added for completeness: >> � H1. �Somebody submits a UTF-8 file without the flag >> � H2. �Somebody submits a UTF-8 file with the UTF-8 flag >> � H3. �Somebody submits a non-UTF-8 file either with no flag >> or with a UTF-8 flag >> � H4. �(deleted, becomes H5 or H6) >> � H5. �Somebody submits a non-UTF-8 file with a flag correctly telling >> us that it is encoding xxx >> � H6. �Somebody submits a non-UTF-8 file with a flag incorrectly >> telling us that it is encoding xxx >> >> Under my proposal, this list degenerates to: >> >> J1. Somebody submits a UTF-8 file >> J2. Somebody submits a non-UTF-8 file >> >> In case J2 (the equivalent of both case H3, H5 and H6 above), the file >> is rejected as a syntactically incorrect CIF, just as incorrect CIFs >> are rejected today. �I don't see anything wrong with this - in the >> IUCr use case, the author is around to deal with the problem and >> resubmit correctly. �Alternatively, under Herbert's proposal, a >> further level of checking can be done by using the encoding flag to >> see if a correct CIF can be produced - and will probably still fail. >> 'Probably' because, as Simon points out, if the author thinks they've >> sent a UTF-8 file and haven't, they are unlikely to get niceties like >> an encoding flag correct, so H6 (and H3) are the most likely sorts of >> files that will reach the IUCr. �Furthermore, for a large number of >> encodings, it will fail in a way that cannot be detected automatically >> (see point 1 at the beginning). >> >> It would be good to hear from those who haven't said anything yet, >> even if only to hear if they are undecided. >> >> James. >> >> On Sun, Oct 25, 2009 at 3:18 AM, SIMON WESTRIP >> <[email protected]> wrote: >>> Dear Herbert >>> >>> thanks for clarifying this for me - I can now see the benefits of such flags >>> (actually, if I'd stopped to think about it, I should have spotted an >>> analogy >>> with the use of <meta charset=UTF-8...> tags in html...) >>> >>> Cheers >>> >>> Simon >>> >>> ________________________________ >>> From: Herbert J. Bernstein <[email protected]> >>> To: Group finalising DDLm and associated dictionaries <[email protected]> >>> Sent: Saturday, 24 October, 2009 16:07:50 >>> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >>> >>> Dear Simon, >>> >>> � The world is not a perfect place, but there is increasing use of clear >>> flags for encoding. �If we provide a place for the flag in CIf there are >>> four possibilities on a submission: >>> >>> � 1. �Somebody submits a UTF-8 file without the flag >>> � 2. �Somebody submits a UTF-8 file with the UTF-8 flag >>> � 3. �Somebody submits a non-UTF-8 file either with no flag >>> or with a UTF-8 flag >>> � 4. �Somebody submits a non-UTF-8 file with a flag telling >>> us that it is a non-UTF-8 file >>> >>> Cases 1,2 and 4 all allow for rational handling of the file. >>> Case 3 can result in mishandling >>> >>> If we do not have the flag, we cannot have case 4, and all non-UTF-8 >>> files are highly likely to cause trouble. �Yes, getting into case 4 right >>> now depends on users who know what encoding they are using, but python >>> programmers are aready learning to be careful about it, and both vi and >>> emacs are pretty good at recognizing mismatches to users are learning >>> to fix the comment if it is wrong. >>> >>> What is the worst that happens if we include the identification of the >>> encoding? -- everybody just leaves it set at UTF-8 no matter what they do. >>> We will have lost nothing. �But is just one submission makes use of the >>> identification propoerly for a non-UTF-8 encoding we will have gained, and >>> over the next few years, as the editors and their supporting scripts get >>> smarter, I expect you will start to see significant use of the encoding >>> flags, especially to distinguigh UTF-8 from other popular unicode >>> encodings, such as USC-2. >>> >>> vim supports both the comment and the BOM. �I personally prefer the BOM to >>> other methods, but the comment is increasingly used. >>> >>> Regards, >>> � Herbert >>> ===================================================== >>> � Herbert J. Bernstein, Professor of Computer Science >>> � � Dowling College, Kramer Science Center, KSC 121 >>> � � � � Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> � � � � � � � � � +1-631-244-3035 >>> � � � � � � � � � [email protected] >>> ===================================================== >>> >>> On Sat, 24 Oct 2009, SIMON WESTRIP wrote: >>> >>>> Herbert wrote: >>> >>> "I am saying that it would be a very good idea to conform to the >>> vim or emacs editor conventions in marking CIF with their encoding, so >>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2 >>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the >>> error." >>> >>> I'm not sure what you're getting at here. Having a UTF-8 identifier would >>> not help in this case? Or if you mean that the actual encoding used should >>> be tagged, it seems unlikely that having already mistakingly (and probably >>> unknowingly) used the wrong encoding, anyone would include such a tag? So >>> unless the encoding can be determined from something inherent to the >>> encoding, e.g. a UTF-16 BOM, I cant see that a comment-type tag is of any >>> benefit? >>> >>> If the standard specifies UTF-8 there should be no reason to identify this >>> in the CIF. >>> >>> However, I can see the advantages of such a tag if its envisaged that >>> other encodings will be allowed in the future, or even simply to reinforce >>> that the CIF is CIF2 (especially if the magic number has been ommitted)? >>> >>> I have to confess that I am starting to worry about all this slightly. As >>> much as in the work I do I can happily read/write UTF-8 and convert from >>> other encodings, at this stage I would probably struggle to convert from >>> an unrecognized encoding - which is fair enough because if its CIF2 it >>> should be UTF-8 and I shouldnt need to convert anyway (!), but it is a >>> worry with respect to the issue of trying to make adoption of CIF-2 as >>> painless as possible for the end users. But then again, I'm having a bad >>> day :-) >>> >>> Cheers >>> >>> Simon >>> >>> >>> >>> >>> >>> >>> >>> >>> ________________________________ >>> From: Herbert J. Bernstein <[email protected]> >>> To: Group finalising DDLm and associated dictionaries <[email protected]> >>> Sent: Friday, 23 October, 2009 20:47:40 >>> Subject: Re: [ddlm-group] [THREAD 4] UTF8 >>> >>> Dear Colleagues, >>> >>> � � I have only mild objections to saying the "UTF-8 is the only official >>> encoding for CIF 2". �My mild objection is that imgCIF will not be >>> compliant in sereral of its variants, but it certainly will always be able >>> to provide at least one compliant translation of any file, 50-60% bigger >>> than it has to be in, say, UCS-2, but compliant. >>> >>> � � No, the real problem is not what is officially the "right" way to write >>> CIFs, but what people will really do. �People will do what they have >>> always done -- work with CIF on whatever system they have. �That system >>> may be modern and support UTF-8, but, even then, its "native" mode may be >>> something different. �If we are lucky, the differences will be >>> sufficiently dramatic to allow the encoding used to be detected from >>> context. �If somebody decides they are still using EBCDIC, we will have no >>> trouble figuring that out, but sometimes the differences are more subtle. >>> I just took a French message catalog for RasMol and converted it to the >>> Latin-1 encoding. �Most of the text is absolutely the same. �Just a few >>> acented characters differ. �In a large text with just a few accents, this >>> could easily be missed, and lots of people in Europe use the Latin-1 >>> encoding. �I am not saying that we should handle Latin-1 in all CIF-2 >>> parsers. �I am saying that it would be a very good idea to conform to the >>> vim or emacs editor conventions in marking CIF with their encoding, so >>> that if somebody does make a mistake and send a journal a Latin-1 CIF-2 >>> file instead of a UTF-8 CIF-2, there will be some chance of spotting the >>> error. >>> >>> The is the same issue as having the magic number #\# CIF 2.0 so we have a >>> chance to spotting cases where somebody is trying to feed in a different >>> CIF level. �Just because somebody might, somewhere, sometime, decide to >>> send in a file to a CIF 1 parser with a magic number such as #\# CIF 2.0 >>> does not mean that suddenly we have to tell the person with the CIF 1 >>> parser that their parser is broken. �It just means the person with the CIF >>> 1 parser or the person with the CIF 2 file have a better chance of quickly >>> figuring out they have a mismatch. >>> >>> People will edit in different encodings, whether we approve of it or not. >>> >>> We lose nothing by flagging the UTF-8 encoding, and we can save people a >>> lot of time in the future. >>> >>> Regards, >>> � � Herbert >>> >>> ===================================================== >>> � Herbert J. Bernstein, Professor of Computer Science >>> � � Dowling College, Kramer Science Center, KSC 121 >>> � � � � � Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> � � � � � � � � � +1-631-244-3035 >>> � � � � � � � � � [email protected] >>> ===================================================== >>> >>> On Fri, 23 Oct 2009, David Brown wrote: >>> >>>> I would just like to point out a philosophical principle which we tried to >>>> observe in the earlier CIFs, and which I think very important, namely that >>>> in >>>> a standard like CIF it is only necessary to define one convention for each >>>> feature in the standard. �Writers are required to convert the input to >>>> this >>>> convention and readers can always be confident that they will only have to >>>> read this one convention. �Every time you allow alternative ways of >>>> encoding >>>> a piece of information you *require* the reader to be able to read both >>>> alternatives. �If you allow three different encodings, you require three >>>> different parsers. �If you allow ten different codings, you require ten >>>> different parsers in every piece of reading software. �With one standard, >>>> a >>>> single parser works everywhere. >>>> >>>> If a standard allows two different codings, it is no longer a standard, it >>>> is >>>> two standards, and that is something we have tried to avoid (not always >>>> successfully) in CIF. �It should be a goal. >>>> David >>>> >>> _______________________________________________ >>> ddlm-group mailing list >>> [email protected] >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> _______________________________________________ >>> ddlm-group mailing list >>> [email protected] >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> [email protected] >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> ddlm-group mailing list >> [email protected] >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> _______________________________________________ >> ddlm-group mailing list >> [email protected] >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (David Brown)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] Relationship of CIF2 to legacy platforms
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):