[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position

To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
Subject: Re: [Cif2-encoding] A new(?) compromise position
From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 29 Sep 2010 22:20:06 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Dear James,

   You are mistaken, John said the opposite about determining UTF8from 
context.  The place where he and I differ, is he thinks you can't do it 
even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently 
disambiguated.

   The Wikipedia says no such thing about being able to reliably detect 
non-UTF8 files.

   Let us use the Wikipedea's very own UTF8 example as a test case:

The code point ' U+00A2 = 00000000 10100010  11000010 10100010
which is  0xC2 0xA2 as a UTF8 byte string

   Now let us look at the Latin-1 code page at 
http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm

which tells us that 0xc2 is &Acirc; in Latin 1 and 0xa2 is &cent.

   There is no evidence that I have seen anywhere to support your prosition 
that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 
files can be reliably detected."  All the evidence I have seen points the 
other way.

   James, please look at the facts -- UTF8 without a BOM does not fit your 
characterization of being self-identifying.

   Regards,
     Herbert 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Thu, 30 Sep 2010, James Hester wrote:

> My simple objective (for files containing non-ASCII characters) is that an application is
> able to determine the encoding of an incoming file with a high degree of certainty with no
> information beyond the CIF standard, the encoding standard, and the file contents.� If the
> only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding.�
> Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably
> detected(*).� This appears to me to be an excellent state of affairs.
> 
> Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding
> paragraph, this is simply in an effort to get something workable on the table that we can
> move forward with.� I have no particular agenda to limit in future the possible encodings
> for CIF files, provided that those encodings can be reliably identified subject to the above
> restrictions.� Indeed, this particular group was formed in part to work out a system for
> including those other encodings.�
> 
> I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on
> the principle I hope we are able to polish it up to everybody's satisfaction.
> 
> James.
> 
> (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia
> entry also contains useful discussion.
> 
> On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <[email protected]> wrote:
>       Dear James,
>
>       �I know from long and painful experience that files with just a few accented
>       characters are very, very difficult to clearly identify, and can look like valid
>       UTF8 files. �UTF8 is _not_ self-identifying without the BOM.
>
>       �The case that really convinced me that there was a problem was a
>       French document with a lower case e with an accent acute on the E. �I nearly
>       missed a �misencoding of a mac native file that because it was being misread as
>       a capital E in a UTF8 file showed the accent as grave.
>
>       �There are simply too many cases like that in which a file written in a non-UTF8
>       encoding looks like something reasonable, but wrong, to say that UTF without the
>       BOM is self-identifying.
>
>       �As for the question of standards and applications, many programming
>       language standards specify the action of processors of the language.
>       In our case, to have a meaninful standard, we need to specify what
>       is a syntactically valid CIF2 file, to specify the semantics for
>       a compliant CIF2 reader and specify the required actions for
>       a compliant CIF2 writer. �We need to do so in a way that breaks
>       as few existing applications as possible.
>
>       �I believe that applications are highly relevant to what we are trying to do.
>       �In particular, I favor strict rules on writers and liberal rules
>       on readers, so that files get processed when possible, but tend to get
>       cleaned up when being processed.
>
>       �That same frame of mind is why a lot of text editors invisibly add
>       a BOM at the start of all UTF8 files, but try to accept UTF8 files
>       with or without the BOM.
> 
> 
> �Regards,
> � �Herbert
> 
> 
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
> 
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
> 
> On Thu, 30 Sep 2010, James Hester wrote:
>
>       Hi Herbert (I should be in bed, but whatever): I do not think it is
>       appropriate to require the *application* to unambiguously identify the
>       encoding, as no widely-recognised standard procedure exists to do this.�
>       The
>       means of identification should rather be based on the international
>       standard
>       describing the encoding.� Only UTF16 and UTF8 currently meet this
>       requirement, I believe.� I will try to express this better after a
>       sleep...
>
>       Regarding UTF8: I'm glad to see such vigilance in the cause of correctly
>       identifying file encoding. A UTF8 file, naturally, can also look like a
>       file
>       in a variety of single-byte encodings regardless of a BOM at the front.�
>       However, a file in a non-UTF8 encoding is highly unlikely to be mistaken
>       for
>       a UTF8 file.� Therefore, providing an input file is first checked for UTF8
>       encoding, I do not see any significant danger of a mistaken encoding.� I'd
>       be happy to include recommendations to use a UTF8 BOM and to check for
>       UTF8
>       encoding before any others that we may eventually add to the list.
>
>       I'm curious to see what these files are that you have trouble identifying
>       as
>       UTF8, as they may represent obscure corner cases.� Any chance you could
>       dig
>       one or two up?
>
>       James.
>       On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein
>       <[email protected]> wrote:
>       � � �Dear James,
>
>       � � ��I respect the attempt to compromise, but the sentence "At
>       � � �present only UTF8 and UTF16 are considered to satisfy this
>       � � �constraint" is not quite
>       � � �right without some additional work on the spec. �UTF16 with a
>       � � �BOM is
>       � � �self-identifying. �UTF8 with a BOM is also self-identifying.
>       � � ��However,
>       � � �UTF8 without a BOM and without some other disambiguator (e.g.
>       � � �the
>       � � �accented o's), is _not_ self identifying. �I know, because my
>       � � �students
>       � � �and I hit this problem all the time in working with
>       � � �multi-linguage,
>       � � �multi-code-page message catalogs for RasMol. �Sometimes the only
>       � � �way
>       � � �we can figure out whether a UTF8 file is really a UTF8 file is
>       � � �to
>       � � �start translating the actual strings and see if they make sense.
>
>       � � ��Another problem is what the "ASCII range" means to various
>       � � �people.
>       � � �I suggest being much more restrictive and saying "the printable
>       � � �ASCII characters, code points 32-126 plus CR, LF and HT"
>
>       � � ��Combined the statment I would suggest
>
>       � � �If a CIF2 text stream contains only characters equivalent to the
>       � � �printable ASCII characters plus HT, LF and CR, i.e. decimal code
>       � � �points 32-126, 9, 10 and 13, then to ensure compatibility with
>       � � �CIF1, the CIF2 specification does not require any explicit
>       � � �specification of the particular encoding used, but recommends
>       � � �the use of UTF8. �If a CIF2 text stream contains any characters
>       � � �equivalent to Unicode code points not in that range, then for
>       � � �any encoding other then UTF8 it is the responsibility of any
>       � � �application writing such a CIF to unambigously specify the
>       � � �particular encoding used, preferably within the file itself.
>       � � �UTF16 with a BOM conforms to this requirement.
>
>       � � ��Regards,
>       � � �� �Herbert
>
>       � � �=====================================================
>       � � ��Herbert J. Bernstein, Professor of Computer Science
>       � � �� Dowling College, Kramer Science Center, KSC 121
>       � � �� � � �Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � �� � � � � � � � +1-631-244-3035
>       � � �� � � � � � � � [email protected]
>       � � �=====================================================
> 
>
>       On Thu, 30 Sep 2010, James Hester wrote:
>
>       � � �Here is a newish compromise:
>
>       � � �Encoding: The encoding of CIF2 text streams containing
>       � � �only code points in the ASCII
>       � � �range is not specified. CIF2 text streams containing any
>       � � �code points outside the ASCII
>       � � �range must be encoded such that the encoding can be
>       � � �reliably identified from the file
>       � � �contents.� At present only UTF8 and UTF16 are considered
>       � � �to satisfy this constraint.
>
>       � � �Commentary: this is intended to mean that encoding works
>       � � �'as for CIF1' (Proposals 1,2)
>       � � �for files containing only ASCII text, and works as for
>       � � �Proposal 4 for any other files.�
>       � � �I believe that this allows legacy workflows to operate
>       � � �smoothly on CIF2 files (legacy
>       � � �workflows do not process non ASCII text) but also avoids
>       � � �the tower of Babel effect that
>       � � �will ensue if non-ASCII codepoints are encoded using local
>       � � �conventions.�
>
>       � � �To explain the thinking further, perhaps I could take
>       � � �another stab at Herbert's point of
>       � � �view in my own words.� Herbert (I think correctly)
>       � � �surmises that all currently used CIF
>       � � �applications do not explicitly specify the encoding of
>       � � �their input and output files, and
>       � � �so therefore are conceptually working with CIFs in a
>       � � �variety of local encodings.�
>       � � �Mandating any encoding for CIF2 would therefore force at
>       � � �least some and perhaps most of
>       � � �these applications to change the way they read and write
>       � � �text, which is disruptive and
>       � � �obtuse when the system works fine as it is.� Proposals 1
>       � � �and 2 are aimed at avoiding
>       � � �this disruption.
>
>       � � �On the other hand, I look at the same situation and see
>       � � �that all this software is in
>       � � �fact reading and writing ASCII, because all of these local
>       � � �encodings are actually
>       � � �equivalent to ASCII for characters used in CIFs, and I
>       � � �further assert that this happy
>       � � �coincidence between encodings is the single reason CIF
>       � � �files are easily transferable
>       � � �between different systems.
>
>       � � �These two points of view create two different results if
>       � � �the CIF character repertoire is
>       � � �extended beyond the ASCII range.� If we allow the current
>       � � �approach to encoding to
>       � � �continue, the happy coincidence of encodings ceases to
>       � � �operate outside the ASCII range
>       � � �and CIF files are no longer easily interchangeable.� If we
>       � � �make explicit the commonality
>       � � �of CIF1 encodings by mandating a common set of
>       � � �identifiable encodings, the use of
>       � � �default encodings has to be abandoned with accompanying
>       � � �effort from programmers.
>
>       � � �I believe that this latest proposal respects Herbert's
>       � � �concerns as well as mine, and is
>       � � �eminently workable as a starting point for going forward.�
>       � � �I'm now off to do a sample
>       � � �change and expect unanimous support from all parties when
>       � � �I return in an hour's time :)
>
>       � � �On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon
>       � � �<[email protected]> wrote:
>       � � �� � �I think the crux of issue is as follows:
>
>       � � �� � �[But part of our difficulty is that we are all having
>       � � �separate
>       � � �� � �epiphanies, and focusing on five different "cruxes".
>       � � �Clarifying
>       � � �� � �the real divergence between our views would be a
>       � � �genuine benefit of
>       � � �� � �a Skype conference, to which I have no personal
>       � � �objection.]
>
>       � � �� � �In the real world, a need may arise to exchange CIFs
>       � � �constructed in
>       � � �� � �non-canonical encodings. ("Canonical" probably means
>       � � �UTF-8 and/or
>       � � �� � �UTF-16). Such a need would involve some transcoding
>       � � �strategy.
>
>       � � �� � �What is the actual likelihood of that need arising?
>
>       � � �� � �I would characterise James's position as "not very,
>       � � �and even less
>       � � �� � �if the software written to generate CIFs is
>       � � �constrained to use
>       � � �� � �canonical encodings within the standard".
>
>       � � �� � �I would characterise the position of the rest of us
>       � � �as "reasonable to
>       � � �� � �high, so that we wish to formulate the standard in a
>       � � �way that
>       � � �� � �recognises non-canonical encodings and helps to
>       � � �establish or at
>       � � �� � �least inform appropriate transcoding strategies".
>       � � �There appear to be
>       � � �� � �strong disagreements among us, but in fact there's a
>       � � �lot of common
>       � � �� � �ground, and a drafting exercise would probably move
>       � � �us towards a
>       � � �� � �consensus.
>
>       � � �� � �Do you agree that that is a fair assessment?
>
>       � � �� � �If so, we can analyse further: what are the
>       � � �implications of mandating
>       � � �� � �a canonical encoding or not if judgement (a) is wrong
>       � � �and if judgement
>       � � �� � �(b) is wrong? My feeling is that the world will not
>       � � �end - or even
>       � � �� � �change very much - in any case; but it could
>       � � �determine whether we
>       � � �� � �need to formulate an optimal transcoding strategy
>       � � �now, or can defer
>       � � �� � �it to a later date.
>
>       � � �� � �However, if anyone thinks this is just another
>       � � �diversion, I'll drop
>       � � �� � �this line of approach so as not to slow things down
>       � � �even more.
>
>       � � �� � �Regards
>       � � �� � �Brian
>
>       � � �On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J.
>       � � �Bernstein wrote:
>       � � �> John,
>       � � �>
>       � � �> Now I am totally confused about what you are proposing
>       � � �and agree with Simon
>       � � �> that what is needed for you to state your proposal as
>       � � �the precise wording
>       � � �> that you propose to insert and/or change in the current
>       � � �CIF2 change document
>       � � �> "5 July 2010: draft of changes to the existing CIF 1.1
>       � � �specification
>       � � �> for public discussion"
>       � � �>
>       � � �> If I understand your proposal correctly, the _only_
>       � � �thing you are proposing
>       � � �> that differs in any way from my proposed motion is a
>       � � �mandate that a
>       � � �> CIF2 conformant reader must be able to read a UTF8 CIF2
>       � � �file, but
>       � � �> that _no_ CIF application would actually be required to
>       � � �provide such
>       � � �> code, provided there was some mechanism available to
>       � � �transcode from
>       � � �> UTF8 to the local encoding,
>       � � �> which does not seem to be a mandate on the conformant
>       � � �CIF2 reader at
>       � � �> all, but a requirement for the provision of a portable
>       � � �utility to
>       � � �> do that external transcoding.
>       � � �>
>       � � �> If that is the case, wouldn't it make more sense to just
>       � � �provide that
>       � � �> utility that to argue about whether my motion requires
>       � � �somebody to write
>       � � �> their own? �Having the utility in hand would avoid
>       � � �having multiple,
>       � � �> conflicting interpretations of this input transcoding
>       � � �requirement.
>       � � �>
>       � � �> If I have read your message correctly, please just write
>       � � �the utility you
>       � � �> are proposing. �If I have read your message incorrectly,
>       � � �please
>       � � �> write the specification changes you propose for the
>       � � �draft changes
>       � � �> in place of the changes in my motion.
>       � � �>
>       � � �> _This_ is why it was, is, and will remain a good idea to
>       � � �simply have
>       � � �> a meeting and talk these things out.
>       � � �>
>       � � �>
>       � � �>
>
>       � � �--
>       � � �T +61 (02) 9717 9907
>       � � �F +61 (02) 9717 3145
>       � � �M +61 (04) 0249 4148
> 
>
>       _______________________________________________
>       cif2-encoding mailing list
>       [email protected]
>       http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> 
>
>       --
>       T +61 (02) 9717 9907
>       F +61 (02) 9717 3145
>       M +61 (04) 0249 4148
> 
> 
> _______________________________________________
> cif2-encoding mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>

_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]

Follow-Ups:

Re: [Cif2-encoding] A new(?) compromise position (SIMON WESTRIP)

References:

[Cif2-encoding] A new(?) compromise position (James Hester)

Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)

Re: [Cif2-encoding] A new(?) compromise position (James Hester)

Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)

Re: [Cif2-encoding] A new(?) compromise position (James Hester)

Prev by Date: Re: [Cif2-encoding] A new(?) compromise position

Next by Date: Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010

Prev by thread: Re: [Cif2-encoding] A new(?) compromise position

Next by thread: Re: [Cif2-encoding] A new(?) compromise position

Index(es):

Date

Thread

Discussion List Archives

Re: [Cif2-encoding] A new(?) compromise position