Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position

My simple objective (for files containing non-ASCII characters) is that an application is able to determine the encoding of an incoming file with a high degree of certainty with no information beyond the CIF standard, the encoding standard, and the file contents.  If the only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding.  Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably detected(*).  This appears to me to be an excellent state of affairs.

Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding paragraph, this is simply in an effort to get something workable on the table that we can move forward with.  I have no particular agenda to limit in future the possible encodings for CIF files, provided that those encodings can be reliably identified subject to the above restrictions.  Indeed, this particular group was formed in part to work out a system for including those other encodings. 

I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on the principle I hope we are able to polish it up to everybody's satisfaction.

James.

(*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia entry also contains useful discussion.

On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear James,

 I know from long and painful experience that files with just a few accented characters are very, very difficult to clearly identify, and can look like valid UTF8 files.  UTF8 is _not_ self-identifying without the BOM.

 The case that really convinced me that there was a problem was a
French document with a lower case e with an accent acute on the E.  I nearly missed a  misencoding of a mac native file that because it was being misread as a capital E in a UTF8 file showed the accent as grave.

 There are simply too many cases like that in which a file written in a non-UTF8 encoding looks like something reasonable, but wrong, to say that UTF without the BOM is self-identifying.

 As for the question of standards and applications, many programming
language standards specify the action of processors of the language.
In our case, to have a meaninful standard, we need to specify what
is a syntactically valid CIF2 file, to specify the semantics for
a compliant CIF2 reader and specify the required actions for
a compliant CIF2 writer.  We need to do so in a way that breaks
as few existing applications as possible.

 I believe that applications are highly relevant to what we are trying to do.  In particular, I favor strict rules on writers and liberal rules
on readers, so that files get processed when possible, but tend to get
cleaned up when being processed.

 That same frame of mind is why a lot of text editors invisibly add
a BOM at the start of all UTF8 files, but try to accept UTF8 files
with or without the BOM.


 Regards,
   Herbert


=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Thu, 30 Sep 2010, James Hester wrote:

Hi Herbert (I should be in bed, but whatever): I do not think it is
appropriate to require the *application* to unambiguously identify the
encoding, as no widely-recognised standard procedure exists to do this.  The
means of identification should rather be based on the international standard
describing the encoding.  Only UTF16 and UTF8 currently meet this
requirement, I believe.  I will try to express this better after a sleep...

Regarding UTF8: I'm glad to see such vigilance in the cause of correctly
identifying file encoding. A UTF8 file, naturally, can also look like a file
in a variety of single-byte encodings regardless of a BOM at the front. 
However, a file in a non-UTF8 encoding is highly unlikely to be mistaken for
a UTF8 file.  Therefore, providing an input file is first checked for UTF8
encoding, I do not see any significant danger of a mistaken encoding.  I'd
be happy to include recommendations to use a UTF8 BOM and to check for UTF8
encoding before any others that we may eventually add to the list.

I'm curious to see what these files are that you have trouble identifying as
UTF8, as they may represent obscure corner cases.  Any chance you could dig
one or two up?

James.
On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
     Dear James,

      I respect the attempt to compromise, but the sentence "At
     present only UTF8 and UTF16 are considered to satisfy this
     constraint" is not quite
     right without some additional work on the spec.  UTF16 with a
     BOM is
     self-identifying.  UTF8 with a BOM is also self-identifying.
      However,
     UTF8 without a BOM and without some other disambiguator (e.g.
     the
     accented o's), is _not_ self identifying.  I know, because my
     students
     and I hit this problem all the time in working with
     multi-linguage,
     multi-code-page message catalogs for RasMol.  Sometimes the only
     way
     we can figure out whether a UTF8 file is really a UTF8 file is
     to
     start translating the actual strings and see if they make sense.

      Another problem is what the "ASCII range" means to various
     people.
     I suggest being much more restrictive and saying "the printable
     ASCII characters, code points 32-126 plus CR, LF and HT"

      Combined the statment I would suggest

     If a CIF2 text stream contains only characters equivalent to the
     printable ASCII characters plus HT, LF and CR, i.e. decimal code
     points 32-126, 9, 10 and 13, then to ensure compatibility with
     CIF1, the CIF2 specification does not require any explicit
     specification of the particular encoding used, but recommends
     the use of UTF8.  If a CIF2 text stream contains any characters
     equivalent to Unicode code points not in that range, then for
     any encoding other then UTF8 it is the responsibility of any
     application writing such a CIF to unambigously specify the
     particular encoding used, preferably within the file itself.
     UTF16 with a BOM conforms to this requirement.

      Regards,
        Herbert

     =====================================================
      Herbert J. Bernstein, Professor of Computer Science
       Dowling College, Kramer Science Center, KSC 121
            Idle Hour Blvd, Oakdale, NY, 11769

                     +1-631-244-3035
                     yaya@dowling.edu
     =====================================================


On Thu, 30 Sep 2010, James Hester wrote:

     Here is a newish compromise:

     Encoding: The encoding of CIF2 text streams containing
     only code points in the ASCII
     range is not specified. CIF2 text streams containing any
     code points outside the ASCII
     range must be encoded such that the encoding can be
     reliably identified from the file
     contents.  At present only UTF8 and UTF16 are considered
     to satisfy this constraint.

     Commentary: this is intended to mean that encoding works
     'as for CIF1' (Proposals 1,2)
     for files containing only ASCII text, and works as for
     Proposal 4 for any other files. 
     I believe that this allows legacy workflows to operate
     smoothly on CIF2 files (legacy
     workflows do not process non ASCII text) but also avoids
     the tower of Babel effect that
     will ensue if non-ASCII codepoints are encoded using local
     conventions. 

     To explain the thinking further, perhaps I could take
     another stab at Herbert's point of
     view in my own words.  Herbert (I think correctly)
     surmises that all currently used CIF
     applications do not explicitly specify the encoding of
     their input and output files, and
     so therefore are conceptually working with CIFs in a
     variety of local encodings. 
     Mandating any encoding for CIF2 would therefore force at
     least some and perhaps most of
     these applications to change the way they read and write
     text, which is disruptive and
     obtuse when the system works fine as it is.  Proposals 1
     and 2 are aimed at avoiding
     this disruption.

     On the other hand, I look at the same situation and see
     that all this software is in
     fact reading and writing ASCII, because all of these local
     encodings are actually
     equivalent to ASCII for characters used in CIFs, and I
     further assert that this happy
     coincidence between encodings is the single reason CIF
     files are easily transferable
     between different systems.

     These two points of view create two different results if
     the CIF character repertoire is
     extended beyond the ASCII range.  If we allow the current
     approach to encoding to
     continue, the happy coincidence of encodings ceases to
     operate outside the ASCII range
     and CIF files are no longer easily interchangeable.  If we
     make explicit the commonality
     of CIF1 encodings by mandating a common set of
     identifiable encodings, the use of
     default encodings has to be abandoned with accompanying
     effort from programmers.

     I believe that this latest proposal respects Herbert's
     concerns as well as mine, and is
     eminently workable as a starting point for going forward. 
     I'm now off to do a sample
     change and expect unanimous support from all parties when
     I return in an hour's time :)

     On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon
     <bm@iucr.org> wrote:
          I think the crux of issue is as follows:

          [But part of our difficulty is that we are all having
     separate
          epiphanies, and focusing on five different "cruxes".
     Clarifying
          the real divergence between our views would be a
     genuine benefit of
          a Skype conference, to which I have no personal
     objection.]

          In the real world, a need may arise to exchange CIFs
     constructed in
          non-canonical encodings. ("Canonical" probably means
     UTF-8 and/or
          UTF-16). Such a need would involve some transcoding
     strategy.

          What is the actual likelihood of that need arising?

          I would characterise James's position as "not very,
     and even less
          if the software written to generate CIFs is
     constrained to use
          canonical encodings within the standard".

          I would characterise the position of the rest of us
     as "reasonable to
          high, so that we wish to formulate the standard in a
     way that
          recognises non-canonical encodings and helps to
     establish or at
          least inform appropriate transcoding strategies".
     There appear to be
          strong disagreements among us, but in fact there's a
     lot of common
          ground, and a drafting exercise would probably move
     us towards a
          consensus.

          Do you agree that that is a fair assessment?

          If so, we can analyse further: what are the
     implications of mandating
          a canonical encoding or not if judgement (a) is wrong
     and if judgement
          (b) is wrong? My feeling is that the world will not
     end - or even
          change very much - in any case; but it could
     determine whether we
          need to formulate an optimal transcoding strategy
     now, or can defer
          it to a later date.

          However, if anyone thinks this is just another
     diversion, I'll drop
          this line of approach so as not to slow things down
     even more.

          Regards
          Brian

     On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J.
     Bernstein wrote:
     > John,
     >
     > Now I am totally confused about what you are proposing
     and agree with Simon
     > that what is needed for you to state your proposal as
     the precise wording
     > that you propose to insert and/or change in the current
     CIF2 change document
     > "5 July 2010: draft of changes to the existing CIF 1.1
     specification
     > for public discussion"
     >
     > If I understand your proposal correctly, the _only_
     thing you are proposing
     > that differs in any way from my proposed motion is a
     mandate that a
     > CIF2 conformant reader must be able to read a UTF8 CIF2
     file, but
     > that _no_ CIF application would actually be required to
     provide such
     > code, provided there was some mechanism available to
     transcode from
     > UTF8 to the local encoding,
     > which does not seem to be a mandate on the conformant
     CIF2 reader at
     > all, but a requirement for the provision of a portable
     utility to
     > do that external transcoding.
     >
     > If that is the case, wouldn't it make more sense to just
     provide that
     > utility that to argue about whether my motion requires
     somebody to write
     > their own?  Having the utility in hand would avoid
     having multiple,
     > conflicting interpretations of this input transcoding
     requirement.
     >
     > If I have read your message correctly, please just write
     the utility you
     > are proposing.  If I have read your message incorrectly,
     please
     > write the specification changes you propose for the
     draft changes
     > in place of the changes in my motion.
     >
     > _This_ is why it was, is, and will remain a good idea to
     simply have
     > a meeting and talk these things out.
     >
     >
     >

     --
     T +61 (02) 9717 9907
     F +61 (02) 9717 3145
     M +61 (04) 0249 4148


_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.