[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] A new(?) compromise position
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] A new(?) compromise position
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 29 Sep 2010 22:20:06 -0400 (EDT)
- In-Reply-To: <AANLkTin=s98ZB_9nenmyhN9xnn4iQrcLr8LSQadaAztW@mail.gmail.com>
- References: <AANLkTin=991tdnoUif0J40hsKZ0iYTf506A4i-Z_rdy2@mail.gmail.com><alpine.BSF.2.00.1009291027320.12237@epsilon.pair.com><AANLkTimzVZw_VUFeGLMWpr_xAP9_hb95ysLBWF51eKx2@mail.gmail.com><alpine.BSF.2.00.1009291230210.30232@epsilon.pair.com><AANLkTin=s98ZB_9nenmyhN9xnn4iQrcLr8LSQadaAztW@mail.gmail.com>
Dear James, You are mistaken, John said the opposite about determining UTF8from context. The place where he and I differ, is he thinks you can't do it even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently disambiguated. The Wikipedia says no such thing about being able to reliably detect non-UTF8 files. Let us use the Wikipedea's very own UTF8 example as a test case: The code point ' U+00A2 = 00000000 10100010 11000010 10100010 which is 0xC2 0xA2 as a UTF8 byte string Now let us look at the Latin-1 code page at http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm which tells us that 0xc2 is  in Latin 1 and 0xa2 is ¢. There is no evidence that I have seen anywhere to support your prosition that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected." All the evidence I have seen points the other way. James, please look at the facts -- UTF8 without a BOM does not fit your characterization of being self-identifying. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > My simple objective (for files containing non-ASCII characters) is that an application is > able to determine the encoding of an incoming file with a high degree of certainty with no > information beyond the CIF standard, the encoding standard, and the file contents. If the > only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding. > Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably > detected(*). This appears to me to be an excellent state of affairs. > > Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding > paragraph, this is simply in an effort to get something workable on the table that we can > move forward with. I have no particular agenda to limit in future the possible encodings > for CIF files, provided that those encodings can be reliably identified subject to the above > restrictions. Indeed, this particular group was formed in part to work out a system for > including those other encodings. > > I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on > the principle I hope we are able to polish it up to everybody's satisfaction. > > James. > > (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia > entry also contains useful discussion. > > On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > Dear James, > > I know from long and painful experience that files with just a few accented > characters are very, very difficult to clearly identify, and can look like valid > UTF8 files. UTF8 is _not_ self-identifying without the BOM. > > The case that really convinced me that there was a problem was a > French document with a lower case e with an accent acute on the E. I nearly > missed a misencoding of a mac native file that because it was being misread as > a capital E in a UTF8 file showed the accent as grave. > > There are simply too many cases like that in which a file written in a non-UTF8 > encoding looks like something reasonable, but wrong, to say that UTF without the > BOM is self-identifying. > > As for the question of standards and applications, many programming > language standards specify the action of processors of the language. > In our case, to have a meaninful standard, we need to specify what > is a syntactically valid CIF2 file, to specify the semantics for > a compliant CIF2 reader and specify the required actions for > a compliant CIF2 writer. We need to do so in a way that breaks > as few existing applications as possible. > > I believe that applications are highly relevant to what we are trying to do. > In particular, I favor strict rules on writers and liberal rules > on readers, so that files get processed when possible, but tend to get > cleaned up when being processed. > > That same frame of mind is why a lot of text editors invisibly add > a BOM at the start of all UTF8 files, but try to accept UTF8 files > with or without the BOM. > > > Regards, > Herbert > > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Thu, 30 Sep 2010, James Hester wrote: > > Hi Herbert (I should be in bed, but whatever): I do not think it is > appropriate to require the *application* to unambiguously identify the > encoding, as no widely-recognised standard procedure exists to do this. > The > means of identification should rather be based on the international > standard > describing the encoding. Only UTF16 and UTF8 currently meet this > requirement, I believe. I will try to express this better after a > sleep... > > Regarding UTF8: I'm glad to see such vigilance in the cause of correctly > identifying file encoding. A UTF8 file, naturally, can also look like a > file > in a variety of single-byte encodings regardless of a BOM at the front. > However, a file in a non-UTF8 encoding is highly unlikely to be mistaken > for > a UTF8 file. Therefore, providing an input file is first checked for UTF8 > encoding, I do not see any significant danger of a mistaken encoding. I'd > be happy to include recommendations to use a UTF8 BOM and to check for > UTF8 > encoding before any others that we may eventually add to the list. > > I'm curious to see what these files are that you have trouble identifying > as > UTF8, as they may represent obscure corner cases. Any chance you could > dig > one or two up? > > James. > On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: > Dear James, > > I respect the attempt to compromise, but the sentence "At > present only UTF8 and UTF16 are considered to satisfy this > constraint" is not quite > right without some additional work on the spec. UTF16 with a > BOM is > self-identifying. UTF8 with a BOM is also self-identifying. > However, > UTF8 without a BOM and without some other disambiguator (e.g. > the > accented o's), is _not_ self identifying. I know, because my > students > and I hit this problem all the time in working with > multi-linguage, > multi-code-page message catalogs for RasMol. Sometimes the only > way > we can figure out whether a UTF8 file is really a UTF8 file is > to > start translating the actual strings and see if they make sense. > > Another problem is what the "ASCII range" means to various > people. > I suggest being much more restrictive and saying "the printable > ASCII characters, code points 32-126 plus CR, LF and HT" > > Combined the statment I would suggest > > If a CIF2 text stream contains only characters equivalent to the > printable ASCII characters plus HT, LF and CR, i.e. decimal code > points 32-126, 9, 10 and 13, then to ensure compatibility with > CIF1, the CIF2 specification does not require any explicit > specification of the particular encoding used, but recommends > the use of UTF8. If a CIF2 text stream contains any characters > equivalent to Unicode code points not in that range, then for > any encoding other then UTF8 it is the responsibility of any > application writing such a CIF to unambigously specify the > particular encoding used, preferably within the file itself. > UTF16 with a BOM conforms to this requirement. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > Here is a newish compromise: > > Encoding: The encoding of CIF2 text streams containing > only code points in the ASCII > range is not specified. CIF2 text streams containing any > code points outside the ASCII > range must be encoded such that the encoding can be > reliably identified from the file > contents. At present only UTF8 and UTF16 are considered > to satisfy this constraint. > > Commentary: this is intended to mean that encoding works > 'as for CIF1' (Proposals 1,2) > for files containing only ASCII text, and works as for > Proposal 4 for any other files. > I believe that this allows legacy workflows to operate > smoothly on CIF2 files (legacy > workflows do not process non ASCII text) but also avoids > the tower of Babel effect that > will ensue if non-ASCII codepoints are encoded using local > conventions. > > To explain the thinking further, perhaps I could take > another stab at Herbert's point of > view in my own words. Herbert (I think correctly) > surmises that all currently used CIF > applications do not explicitly specify the encoding of > their input and output files, and > so therefore are conceptually working with CIFs in a > variety of local encodings. > Mandating any encoding for CIF2 would therefore force at > least some and perhaps most of > these applications to change the way they read and write > text, which is disruptive and > obtuse when the system works fine as it is. Proposals 1 > and 2 are aimed at avoiding > this disruption. > > On the other hand, I look at the same situation and see > that all this software is in > fact reading and writing ASCII, because all of these local > encodings are actually > equivalent to ASCII for characters used in CIFs, and I > further assert that this happy > coincidence between encodings is the single reason CIF > files are easily transferable > between different systems. > > These two points of view create two different results if > the CIF character repertoire is > extended beyond the ASCII range. If we allow the current > approach to encoding to > continue, the happy coincidence of encodings ceases to > operate outside the ASCII range > and CIF files are no longer easily interchangeable. If we > make explicit the commonality > of CIF1 encodings by mandating a common set of > identifiable encodings, the use of > default encodings has to be abandoned with accompanying > effort from programmers. > > I believe that this latest proposal respects Herbert's > concerns as well as mine, and is > eminently workable as a starting point for going forward. > I'm now off to do a sample > change and expect unanimous support from all parties when > I return in an hour's time :) > > On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon > <bm@iucr.org> wrote: > I think the crux of issue is as follows: > > [But part of our difficulty is that we are all having > separate > epiphanies, and focusing on five different "cruxes". > Clarifying > the real divergence between our views would be a > genuine benefit of > a Skype conference, to which I have no personal > objection.] > > In the real world, a need may arise to exchange CIFs > constructed in > non-canonical encodings. ("Canonical" probably means > UTF-8 and/or > UTF-16). Such a need would involve some transcoding > strategy. > > What is the actual likelihood of that need arising? > > I would characterise James's position as "not very, > and even less > if the software written to generate CIFs is > constrained to use > canonical encodings within the standard". > > I would characterise the position of the rest of us > as "reasonable to > high, so that we wish to formulate the standard in a > way that > recognises non-canonical encodings and helps to > establish or at > least inform appropriate transcoding strategies". > There appear to be > strong disagreements among us, but in fact there's a > lot of common > ground, and a drafting exercise would probably move > us towards a > consensus. > > Do you agree that that is a fair assessment? > > If so, we can analyse further: what are the > implications of mandating > a canonical encoding or not if judgement (a) is wrong > and if judgement > (b) is wrong? My feeling is that the world will not > end - or even > change very much - in any case; but it could > determine whether we > need to formulate an optimal transcoding strategy > now, or can defer > it to a later date. > > However, if anyone thinks this is just another > diversion, I'll drop > this line of approach so as not to slow things down > even more. > > Regards > Brian > > On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. > Bernstein wrote: > > John, > > > > Now I am totally confused about what you are proposing > and agree with Simon > > that what is needed for you to state your proposal as > the precise wording > > that you propose to insert and/or change in the current > CIF2 change document > > "5 July 2010: draft of changes to the existing CIF 1.1 > specification > > for public discussion" > > > > If I understand your proposal correctly, the _only_ > thing you are proposing > > that differs in any way from my proposed motion is a > mandate that a > > CIF2 conformant reader must be able to read a UTF8 CIF2 > file, but > > that _no_ CIF application would actually be required to > provide such > > code, provided there was some mechanism available to > transcode from > > UTF8 to the local encoding, > > which does not seem to be a mandate on the conformant > CIF2 reader at > > all, but a requirement for the provision of a portable > utility to > > do that external transcoding. > > > > If that is the case, wouldn't it make more sense to just > provide that > > utility that to argue about whether my motion requires > somebody to write > > their own? Having the utility in hand would avoid > having multiple, > > conflicting interpretations of this input transcoding > requirement. > > > > If I have read your message correctly, please just write > the utility you > > are proposing. If I have read your message incorrectly, > please > > write the specification changes you propose for the > draft changes > > in place of the changes in my motion. > > > > _This_ is why it was, is, and will remain a good idea to > simply have > > a meeting and talk these things out. > > > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding@iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > cif2-encoding@iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Cif2-encoding] A new(?) compromise position (SIMON WESTRIP)
- References:
- [Cif2-encoding] A new(?) compromise position (James Hester)
- Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)
- Re: [Cif2-encoding] A new(?) compromise position (James Hester)
- Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)
- Re: [Cif2-encoding] A new(?) compromise position (James Hester)
- Prev by Date: Re: [Cif2-encoding] A new(?) compromise position
- Next by Date: Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010
- Prev by thread: Re: [Cif2-encoding] A new(?) compromise position
- Next by thread: Re: [Cif2-encoding] A new(?) compromise position
- Index(es):