[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] A new(?) compromise position
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] A new(?) compromise position
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 29 Sep 2010 22:20:06 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>
Dear James, You are mistaken, John said the opposite about determining UTF8from context. The place where he and I differ, is he thinks you can't do it even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently disambiguated. The Wikipedia says no such thing about being able to reliably detect non-UTF8 files. Let us use the Wikipedea's very own UTF8 example as a test case: The code point ' U+00A2 = 00000000 10100010 11000010 10100010 which is 0xC2 0xA2 as a UTF8 byte string Now let us look at the Latin-1 code page at http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm which tells us that 0xc2 is  in Latin 1 and 0xa2 is ¢. There is no evidence that I have seen anywhere to support your prosition that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected." All the evidence I have seen points the other way. James, please look at the facts -- UTF8 without a BOM does not fit your characterization of being self-identifying. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 [email protected] ===================================================== On Thu, 30 Sep 2010, James Hester wrote: > My simple objective (for files containing non-ASCII characters) is that an application is > able to determine the encoding of an incoming file with a high degree of certainty with no > information beyond the CIF standard, the encoding standard, and the file contents.� If the > only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding.� > Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably > detected(*).� This appears to me to be an excellent state of affairs. > > Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding > paragraph, this is simply in an effort to get something workable on the table that we can > move forward with.� I have no particular agenda to limit in future the possible encodings > for CIF files, provided that those encodings can be reliably identified subject to the above > restrictions.� Indeed, this particular group was formed in part to work out a system for > including those other encodings.� > > I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on > the principle I hope we are able to polish it up to everybody's satisfaction. > > James. > > (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia > entry also contains useful discussion. > > On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <[email protected]> wrote: > Dear James, > > �I know from long and painful experience that files with just a few accented > characters are very, very difficult to clearly identify, and can look like valid > UTF8 files. �UTF8 is _not_ self-identifying without the BOM. > > �The case that really convinced me that there was a problem was a > French document with a lower case e with an accent acute on the E. �I nearly > missed a �misencoding of a mac native file that because it was being misread as > a capital E in a UTF8 file showed the accent as grave. > > �There are simply too many cases like that in which a file written in a non-UTF8 > encoding looks like something reasonable, but wrong, to say that UTF without the > BOM is self-identifying. > > �As for the question of standards and applications, many programming > language standards specify the action of processors of the language. > In our case, to have a meaninful standard, we need to specify what > is a syntactically valid CIF2 file, to specify the semantics for > a compliant CIF2 reader and specify the required actions for > a compliant CIF2 writer. �We need to do so in a way that breaks > as few existing applications as possible. > > �I believe that applications are highly relevant to what we are trying to do. > �In particular, I favor strict rules on writers and liberal rules > on readers, so that files get processed when possible, but tend to get > cleaned up when being processed. > > �That same frame of mind is why a lot of text editors invisibly add > a BOM at the start of all UTF8 files, but try to accept UTF8 files > with or without the BOM. > > > �Regards, > � �Herbert > > > ===================================================== > �Herbert J. Bernstein, Professor of Computer Science > � Dowling College, Kramer Science Center, KSC 121 > � � � �Idle Hour Blvd, Oakdale, NY, 11769 > > � � � � � � � � +1-631-244-3035 > � � � � � � � � [email protected] > ===================================================== > > On Thu, 30 Sep 2010, James Hester wrote: > > Hi Herbert (I should be in bed, but whatever): I do not think it is > appropriate to require the *application* to unambiguously identify the > encoding, as no widely-recognised standard procedure exists to do this.� > The > means of identification should rather be based on the international > standard > describing the encoding.� Only UTF16 and UTF8 currently meet this > requirement, I believe.� I will try to express this better after a > sleep... > > Regarding UTF8: I'm glad to see such vigilance in the cause of correctly > identifying file encoding. A UTF8 file, naturally, can also look like a > file > in a variety of single-byte encodings regardless of a BOM at the front.� > However, a file in a non-UTF8 encoding is highly unlikely to be mistaken > for > a UTF8 file.� Therefore, providing an input file is first checked for UTF8 > encoding, I do not see any significant danger of a mistaken encoding.� I'd > be happy to include recommendations to use a UTF8 BOM and to check for > UTF8 > encoding before any others that we may eventually add to the list. > > I'm curious to see what these files are that you have trouble identifying > as > UTF8, as they may represent obscure corner cases.� Any chance you could > dig > one or two up? > > James. > On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein > <[email protected]> wrote: > � � �Dear James, > > � � ��I respect the attempt to compromise, but the sentence "At > � � �present only UTF8 and UTF16 are considered to satisfy this > � � �constraint" is not quite > � � �right without some additional work on the spec. �UTF16 with a > � � �BOM is > � � �self-identifying. �UTF8 with a BOM is also self-identifying. > � � ��However, > � � �UTF8 without a BOM and without some other disambiguator (e.g. > � � �the > � � �accented o's), is _not_ self identifying. �I know, because my > � � �students > � � �and I hit this problem all the time in working with > � � �multi-linguage, > � � �multi-code-page message catalogs for RasMol. �Sometimes the only > � � �way > � � �we can figure out whether a UTF8 file is really a UTF8 file is > � � �to > � � �start translating the actual strings and see if they make sense. > > � � ��Another problem is what the "ASCII range" means to various > � � �people. > � � �I suggest being much more restrictive and saying "the printable > � � �ASCII characters, code points 32-126 plus CR, LF and HT" > > � � ��Combined the statment I would suggest > > � � �If a CIF2 text stream contains only characters equivalent to the > � � �printable ASCII characters plus HT, LF and CR, i.e. decimal code > � � �points 32-126, 9, 10 and 13, then to ensure compatibility with > � � �CIF1, the CIF2 specification does not require any explicit > � � �specification of the particular encoding used, but recommends > � � �the use of UTF8. �If a CIF2 text stream contains any characters > � � �equivalent to Unicode code points not in that range, then for > � � �any encoding other then UTF8 it is the responsibility of any > � � �application writing such a CIF to unambigously specify the > � � �particular encoding used, preferably within the file itself. > � � �UTF16 with a BOM conforms to this requirement. > > � � ��Regards, > � � �� �Herbert > > � � �===================================================== > � � ��Herbert J. Bernstein, Professor of Computer Science > � � �� Dowling College, Kramer Science Center, KSC 121 > � � �� � � �Idle Hour Blvd, Oakdale, NY, 11769 > > � � �� � � � � � � � +1-631-244-3035 > � � �� � � � � � � � [email protected] > � � �===================================================== > > > On Thu, 30 Sep 2010, James Hester wrote: > > � � �Here is a newish compromise: > > � � �Encoding: The encoding of CIF2 text streams containing > � � �only code points in the ASCII > � � �range is not specified. CIF2 text streams containing any > � � �code points outside the ASCII > � � �range must be encoded such that the encoding can be > � � �reliably identified from the file > � � �contents.� At present only UTF8 and UTF16 are considered > � � �to satisfy this constraint. > > � � �Commentary: this is intended to mean that encoding works > � � �'as for CIF1' (Proposals 1,2) > � � �for files containing only ASCII text, and works as for > � � �Proposal 4 for any other files.� > � � �I believe that this allows legacy workflows to operate > � � �smoothly on CIF2 files (legacy > � � �workflows do not process non ASCII text) but also avoids > � � �the tower of Babel effect that > � � �will ensue if non-ASCII codepoints are encoded using local > � � �conventions.� > > � � �To explain the thinking further, perhaps I could take > � � �another stab at Herbert's point of > � � �view in my own words.� Herbert (I think correctly) > � � �surmises that all currently used CIF > � � �applications do not explicitly specify the encoding of > � � �their input and output files, and > � � �so therefore are conceptually working with CIFs in a > � � �variety of local encodings.� > � � �Mandating any encoding for CIF2 would therefore force at > � � �least some and perhaps most of > � � �these applications to change the way they read and write > � � �text, which is disruptive and > � � �obtuse when the system works fine as it is.� Proposals 1 > � � �and 2 are aimed at avoiding > � � �this disruption. > > � � �On the other hand, I look at the same situation and see > � � �that all this software is in > � � �fact reading and writing ASCII, because all of these local > � � �encodings are actually > � � �equivalent to ASCII for characters used in CIFs, and I > � � �further assert that this happy > � � �coincidence between encodings is the single reason CIF > � � �files are easily transferable > � � �between different systems. > > � � �These two points of view create two different results if > � � �the CIF character repertoire is > � � �extended beyond the ASCII range.� If we allow the current > � � �approach to encoding to > � � �continue, the happy coincidence of encodings ceases to > � � �operate outside the ASCII range > � � �and CIF files are no longer easily interchangeable.� If we > � � �make explicit the commonality > � � �of CIF1 encodings by mandating a common set of > � � �identifiable encodings, the use of > � � �default encodings has to be abandoned with accompanying > � � �effort from programmers. > > � � �I believe that this latest proposal respects Herbert's > � � �concerns as well as mine, and is > � � �eminently workable as a starting point for going forward.� > � � �I'm now off to do a sample > � � �change and expect unanimous support from all parties when > � � �I return in an hour's time :) > > � � �On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon > � � �<[email protected]> wrote: > � � �� � �I think the crux of issue is as follows: > > � � �� � �[But part of our difficulty is that we are all having > � � �separate > � � �� � �epiphanies, and focusing on five different "cruxes". > � � �Clarifying > � � �� � �the real divergence between our views would be a > � � �genuine benefit of > � � �� � �a Skype conference, to which I have no personal > � � �objection.] > > � � �� � �In the real world, a need may arise to exchange CIFs > � � �constructed in > � � �� � �non-canonical encodings. ("Canonical" probably means > � � �UTF-8 and/or > � � �� � �UTF-16). Such a need would involve some transcoding > � � �strategy. > > � � �� � �What is the actual likelihood of that need arising? > > � � �� � �I would characterise James's position as "not very, > � � �and even less > � � �� � �if the software written to generate CIFs is > � � �constrained to use > � � �� � �canonical encodings within the standard". > > � � �� � �I would characterise the position of the rest of us > � � �as "reasonable to > � � �� � �high, so that we wish to formulate the standard in a > � � �way that > � � �� � �recognises non-canonical encodings and helps to > � � �establish or at > � � �� � �least inform appropriate transcoding strategies". > � � �There appear to be > � � �� � �strong disagreements among us, but in fact there's a > � � �lot of common > � � �� � �ground, and a drafting exercise would probably move > � � �us towards a > � � �� � �consensus. > > � � �� � �Do you agree that that is a fair assessment? > > � � �� � �If so, we can analyse further: what are the > � � �implications of mandating > � � �� � �a canonical encoding or not if judgement (a) is wrong > � � �and if judgement > � � �� � �(b) is wrong? My feeling is that the world will not > � � �end - or even > � � �� � �change very much - in any case; but it could > � � �determine whether we > � � �� � �need to formulate an optimal transcoding strategy > � � �now, or can defer > � � �� � �it to a later date. > > � � �� � �However, if anyone thinks this is just another > � � �diversion, I'll drop > � � �� � �this line of approach so as not to slow things down > � � �even more. > > � � �� � �Regards > � � �� � �Brian > > � � �On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. > � � �Bernstein wrote: > � � �> John, > � � �> > � � �> Now I am totally confused about what you are proposing > � � �and agree with Simon > � � �> that what is needed for you to state your proposal as > � � �the precise wording > � � �> that you propose to insert and/or change in the current > � � �CIF2 change document > � � �> "5 July 2010: draft of changes to the existing CIF 1.1 > � � �specification > � � �> for public discussion" > � � �> > � � �> If I understand your proposal correctly, the _only_ > � � �thing you are proposing > � � �> that differs in any way from my proposed motion is a > � � �mandate that a > � � �> CIF2 conformant reader must be able to read a UTF8 CIF2 > � � �file, but > � � �> that _no_ CIF application would actually be required to > � � �provide such > � � �> code, provided there was some mechanism available to > � � �transcode from > � � �> UTF8 to the local encoding, > � � �> which does not seem to be a mandate on the conformant > � � �CIF2 reader at > � � �> all, but a requirement for the provision of a portable > � � �utility to > � � �> do that external transcoding. > � � �> > � � �> If that is the case, wouldn't it make more sense to just > � � �provide that > � � �> utility that to argue about whether my motion requires > � � �somebody to write > � � �> their own? �Having the utility in hand would avoid > � � �having multiple, > � � �> conflicting interpretations of this input transcoding > � � �requirement. > � � �> > � � �> If I have read your message correctly, please just write > � � �the utility you > � � �> are proposing. �If I have read your message incorrectly, > � � �please > � � �> write the specification changes you propose for the > � � �draft changes > � � �> in place of the changes in my motion. > � � �> > � � �> _This_ is why it was, is, and will remain a good idea to > � � �simply have > � � �> a meeting and talk these things out. > � � �> > � � �> > � � �> > > � � �-- > � � �T +61 (02) 9717 9907 > � � �F +61 (02) 9717 3145 > � � �M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > > > _______________________________________________ > cif2-encoding mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ cif2-encoding mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Cif2-encoding] A new(?) compromise position (SIMON WESTRIP)
- References:
- [Cif2-encoding] A new(?) compromise position (James Hester)
- Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)
- Re: [Cif2-encoding] A new(?) compromise position (James Hester)
- Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)
- Re: [Cif2-encoding] A new(?) compromise position (James Hester)
- Prev by Date: Re: [Cif2-encoding] A new(?) compromise position
- Next by Date: Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010
- Prev by thread: Re: [Cif2-encoding] A new(?) compromise position
- Next by thread: Re: [Cif2-encoding] A new(?) compromise position
- Index(es):