[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position

"UTF8 without a BOM does not fit your
characterization of being self-identifying"

I believe this is ture, which is why UTF8 has to be specified as the default
when the encoding is not self-identifying. With the prescription of
their use for 'non-ASCII CIFs' in the spec, UTF8 or 'self-identifying' seems
quite satisfactory to me (obviously not worded like this:) - allowing 'ASCII' CIFs to
be prepared/used as they always have been, and allowing the encoding of
'non-ASCII CIFs' to be determined with minimal uncertainty.

Cheers

Simon


From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Thursday, 30 September, 2010 3:20:06
Subject: Re: [Cif2-encoding] A new(?) compromise position

Dear James,

  You are mistaken, John said the opposite about determining UTF8from
context.  The place where he and I differ, is he thinks you can't do it
even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently
disambiguated.

  The Wikipedia says no such thing about being able to reliably detect
non-UTF8 files.

  Let us use the Wikipedea's very own UTF8 example as a test case:

The code point ' U+00A2 = 00000000 10100010  11000010 10100010
which is  0xC2 0xA2 as a UTF8 byte string

  Now let us look at the Latin-1 code page at
http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm

which tells us that 0xc2 is &Acirc; in Latin 1 and 0xa2 is &cent.

  There is no evidence that I have seen anywhere to support your prosition
that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8
files can be reliably detected."  All the evidence I have seen points the
other way.

  James, please look at the facts -- UTF8 without a BOM does not fit your
characterization of being self-identifying.

  Regards,
    Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 30 Sep 2010, James Hester wrote:

> My simple objective (for files containing non-ASCII characters) is that an application is
> able to determine the encoding of an incoming file with a high degree of certainty with no
> information beyond the CIF standard, the encoding standard, and the file contents.  If the
> only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding. 
> Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably
> detected(*).  This appears to me to be an excellent state of affairs.
>
> Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding
> paragraph, this is simply in an effort to get something workable on the table that we can
> move forward with.  I have no particular agenda to limit in future the possible encodings
> for CIF files, provided that those encodings can be reliably identified subject to the above
> restrictions.  Indeed, this particular group was formed in part to work out a system for
> including those other encodings. 
>
> I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on
> the principle I hope we are able to polish it up to everybody's satisfaction.
>
> James.
>
> (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia
> entry also contains useful discussion.
>
> On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
>      Dear James,
>
>       I know from long and painful experience that files with just a few accented
>      characters are very, very difficult to clearly identify, and can look like valid
>      UTF8 files.  UTF8 is _not_ self-identifying without the BOM.
>
>       The case that really convinced me that there was a problem was a
>      French document with a lower case e with an accent acute on the E.  I nearly
>      missed a  misencoding of a mac native file that because it was being misread as
>      a capital E in a UTF8 file showed the accent as grave.
>
>       There are simply too many cases like that in which a file written in a non-UTF8
>      encoding looks like something reasonable, but wrong, to say that UTF without the
>      BOM is self-identifying.
>
>       As for the question of standards and applications, many programming
>      language standards specify the action of processors of the language.
>      In our case, to have a meaninful standard, we need to specify what
>      is a syntactically valid CIF2 file, to specify the semantics for
>      a compliant CIF2 reader and specify the required actions for
>      a compliant CIF2 writer.  We need to do so in a way that breaks
>      as few existing applications as possible.
>
>       I believe that applications are highly relevant to what we are trying to do.
>       In particular, I favor strict rules on writers and liberal rules
>      on readers, so that files get processed when possible, but tend to get
>      cleaned up when being processed.
>
>       That same frame of mind is why a lot of text editors invisibly add
>      a BOM at the start of all UTF8 files, but try to accept UTF8 files
>      with or without the BOM.
>
>
>  Regards,
>    Herbert
>
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Thu, 30 Sep 2010, James Hester wrote:
>
>      Hi Herbert (I should be in bed, but whatever): I do not think it is
>      appropriate to require the *application* to unambiguously identify the
>      encoding, as no widely-recognised standard procedure exists to do this. 
>      The
>      means of identification should rather be based on the international
>      standard
>      describing the encoding.  Only UTF16 and UTF8 currently meet this
>      requirement, I believe.  I will try to express this better after a
>      sleep...
>
>      Regarding UTF8: I'm glad to see such vigilance in the cause of correctly
>      identifying file encoding. A UTF8 file, naturally, can also look like a
>      file
>      in a variety of single-byte encodings regardless of a BOM at the front. 
>      However, a file in a non-UTF8 encoding is highly unlikely to be mistaken
>      for
>      a UTF8 file.  Therefore, providing an input file is first checked for UTF8
>      encoding, I do not see any significant danger of a mistaken encoding.  I'd
>      be happy to include recommendations to use a UTF8 BOM and to check for
>      UTF8
>      encoding before any others that we may eventually add to the list.
>
>      I'm curious to see what these files are that you have trouble identifying
>      as
>      UTF8, as they may represent obscure corner cases.  Any chance you could
>      dig
>      one or two up?
>
>      James.
>      On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein
>      <yaya@bernstein-plus-sons.com> wrote:
>           Dear James,
>
>            I respect the attempt to compromise, but the sentence "At
>           present only UTF8 and UTF16 are considered to satisfy this
>           constraint" is not quite
>           right without some additional work on the spec.  UTF16 with a
>           BOM is
>           self-identifying.  UTF8 with a BOM is also self-identifying.
>            However,
>           UTF8 without a BOM and without some other disambiguator (e.g.
>           the
>           accented o's), is _not_ self identifying.  I know, because my
>           students
>           and I hit this problem all the time in working with
>           multi-linguage,
>           multi-code-page message catalogs for RasMol.  Sometimes the only
>           way
>           we can figure out whether a UTF8 file is really a UTF8 file is
>           to
>           start translating the actual strings and see if they make sense.
>
>            Another problem is what the "ASCII range" means to various
>           people.
>           I suggest being much more restrictive and saying "the printable
>           ASCII characters, code points 32-126 plus CR, LF and HT"
>
>            Combined the statment I would suggest
>
>           If a CIF2 text stream contains only characters equivalent to the
>           printable ASCII characters plus HT, LF and CR, i.e. decimal code
>           points 32-126, 9, 10 and 13, then to ensure compatibility with
>           CIF1, the CIF2 specification does not require any explicit
>           specification of the particular encoding used, but recommends
>           the use of UTF8.  If a CIF2 text stream contains any characters
>           equivalent to Unicode code points not in that range, then for
>           any encoding other then UTF8 it is the responsibility of any
>           application writing such a CIF to unambigously specify the
>           particular encoding used, preferably within the file itself.
>           UTF16 with a BOM conforms to this requirement.
>
>            Regards,
>              Herbert
>
>           =====================================================
>            Herbert J. Bernstein, Professor of Computer Science
>             Dowling College, Kramer Science Center, KSC 121
>                  Idle Hour Blvd, Oakdale, NY, 11769
>
>                           +1-631-244-3035
>                           yaya@dowling.edu
>           =====================================================
>
>
>      On Thu, 30 Sep 2010, James Hester wrote:
>
>           Here is a newish compromise:
>
>           Encoding: The encoding of CIF2 text streams containing
>           only code points in the ASCII
>           range is not specified. CIF2 text streams containing any
>           code points outside the ASCII
>           range must be encoded such that the encoding can be
>           reliably identified from the file
>           contents.  At present only UTF8 and UTF16 are considered
>           to satisfy this constraint.
>
>           Commentary: this is intended to mean that encoding works
>           'as for CIF1' (Proposals 1,2)
>           for files containing only ASCII text, and works as for
>           Proposal 4 for any other files. 
>           I believe that this allows legacy workflows to operate
>           smoothly on CIF2 files (legacy
>           workflows do not process non ASCII text) but also avoids
>           the tower of Babel effect that
>           will ensue if non-ASCII codepoints are encoded using local
>           conventions. 
>
>           To explain the thinking further, perhaps I could take
>           another stab at Herbert's point of
>           view in my own words.  Herbert (I think correctly)
>           surmises that all currently used CIF
>           applications do not explicitly specify the encoding of
>           their input and output files, and
>           so therefore are conceptually working with CIFs in a
>           variety of local encodings. 
>           Mandating any encoding for CIF2 would therefore force at
>           least some and perhaps most of
>           these applications to change the way they read and write
>           text, which is disruptive and
>           obtuse when the system works fine as it is.  Proposals 1
>           and 2 are aimed at avoiding
>           this disruption.
>
>           On the other hand, I look at the same situation and see
>           that all this software is in
>           fact reading and writing ASCII, because all of these local
>           encodings are actually
>           equivalent to ASCII for characters used in CIFs, and I
>           further assert that this happy
>           coincidence between encodings is the single reason CIF
>           files are easily transferable
>           between different systems.
>
>           These two points of view create two different results if
>           the CIF character repertoire is
>           extended beyond the ASCII range.  If we allow the current
>           approach to encoding to
>           continue, the happy coincidence of encodings ceases to
>           operate outside the ASCII range
>           and CIF files are no longer easily interchangeable.  If we
>           make explicit the commonality
>           of CIF1 encodings by mandating a common set of
>           identifiable encodings, the use of
>           default encodings has to be abandoned with accompanying
>           effort from programmers.
>
>           I believe that this latest proposal respects Herbert's
>           concerns as well as mine, and is
>           eminently workable as a starting point for going forward. 
>           I'm now off to do a sample
>           change and expect unanimous support from all parties when
>           I return in an hour's time :)
>
>           On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon
>           <bm@iucr.org> wrote:
>                I think the crux of issue is as follows:
>
>                [But part of our difficulty is that we are all having
>           separate
>                epiphanies, and focusing on five different "cruxes".
>           Clarifying
>                the real divergence between our views would be a
>           genuine benefit of
>                a Skype conference, to which I have no personal
>           objection.]
>
>                In the real world, a need may arise to exchange CIFs
>           constructed in
>                non-canonical encodings. ("Canonical" probably means
>           UTF-8 and/or
>                UTF-16). Such a need would involve some transcoding
>           strategy.
>
>                What is the actual likelihood of that need arising?
>
>                I would characterise James's position as "not very,
>           and even less
>                if the software written to generate CIFs is
>           constrained to use
>                canonical encodings within the standard".
>
>                I would characterise the position of the rest of us
>           as "reasonable to
>                high, so that we wish to formulate the standard in a
>           way that
>                recognises non-canonical encodings and helps to
>           establish or at
>                least inform appropriate transcoding strategies".
>           There appear to be
>                strong disagreements among us, but in fact there's a
>           lot of common
>                ground, and a drafting exercise would probably move
>           us towards a
>                consensus.
>
>                Do you agree that that is a fair assessment?
>
>                If so, we can analyse further: what are the
>           implications of mandating
>                a canonical encoding or not if judgement (a) is wrong
>           and if judgement
>                (b) is wrong? My feeling is that the world will not
>           end - or even
>                change very much - in any case; but it could
>           determine whether we
>                need to formulate an optimal transcoding strategy
>           now, or can defer
>                it to a later date.
>
>                However, if anyone thinks this is just another
>           diversion, I'll drop
>                this line of approach so as not to slow things down
>           even more.
>
>                Regards
>                Brian
>
>           On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J.
>           Bernstein wrote:
>           > John,
>           >
>           > Now I am totally confused about what you are proposing
>           and agree with Simon
>           > that what is needed for you to state your proposal as
>           the precise wording
>           > that you propose to insert and/or change in the current
>           CIF2 change document
>           > "5 July 2010: draft of changes to the existing CIF 1.1
>           specification
>           > for public discussion"
>           >
>           > If I understand your proposal correctly, the _only_
>           thing you are proposing
>           > that differs in any way from my proposed motion is a
>           mandate that a
>           > CIF2 conformant reader must be able to read a UTF8 CIF2
>           file, but
>           > that _no_ CIF application would actually be required to
>           provide such
>           > code, provided there was some mechanism available to
>           transcode from
>           > UTF8 to the local encoding,
>           > which does not seem to be a mandate on the conformant
>           CIF2 reader at
>           > all, but a requirement for the provision of a portable
>           utility to
>           > do that external transcoding.
>           >
>           > If that is the case, wouldn't it make more sense to just
>           provide that
>           > utility that to argue about whether my motion requires
>           somebody to write
>           > their own?  Having the utility in hand would avoid
>           having multiple,
>           > conflicting interpretations of this input transcoding
>           requirement.
>           >
>           > If I have read your message correctly, please just write
>           the utility you
>           > are proposing.  If I have read your message incorrectly,
>           please
>           > write the specification changes you propose for the
>           draft changes
>           > in place of the changes in my motion.
>           >
>           > _This_ is why it was, is, and will remain a good idea to
>           simply have
>           > a meeting and talk these things out.
>           >
>           >
>           >
>
>           --
>           T +61 (02) 9717 9907
>           F +61 (02) 9717 3145
>           M +61 (04) 0249 4148
>
>
>      _______________________________________________
>      cif2-encoding mailing list
>      cif2-encoding@iucr.org
>      http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
>
>
>      --
>      T +61 (02) 9717 9907
>      F +61 (02) 9717 3145
>      M +61 (04) 0249 4148
>
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]