[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position

Hi Herbert (I should be in bed, but whatever): I do not think it is appropriate to require the *application* to unambiguously identify the encoding, as no widely-recognised standard procedure exists to do this.  The means of identification should rather be based on the international standard describing the encoding.  Only UTF16 and UTF8 currently meet this requirement, I believe.  I will try to express this better after a sleep...

Regarding UTF8: I'm glad to see such vigilance in the cause of correctly identifying file encoding. A UTF8 file, naturally, can also look like a file in a variety of single-byte encodings regardless of a BOM at the front.  However, a file in a non-UTF8 encoding is highly unlikely to be mistaken for a UTF8 file.  Therefore, providing an input file is first checked for UTF8 encoding, I do not see any significant danger of a mistaken encoding.  I'd be happy to include recommendations to use a UTF8 BOM and to check for UTF8 encoding before any others that we may eventually add to the list.

I'm curious to see what these files are that you have trouble identifying as UTF8, as they may represent obscure corner cases.  Any chance you could dig one or two up?

James.
On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear James,

 I respect the attempt to compromise, but the sentence "At present only UTF8 and UTF16 are considered to satisfy this constraint" is not quite
right without some additional work on the spec.  UTF16 with a BOM is
self-identifying.  UTF8 with a BOM is also self-identifying.  However,
UTF8 without a BOM and without some other disambiguator (e.g. the
accented o's), is _not_ self identifying.  I know, because my students
and I hit this problem all the time in working with multi-linguage,
multi-code-page message catalogs for RasMol.  Sometimes the only way
we can figure out whether a UTF8 file is really a UTF8 file is to
start translating the actual strings and see if they make sense.

 Another problem is what the "ASCII range" means to various people.
I suggest being much more restrictive and saying "the printable
ASCII characters, code points 32-126 plus CR, LF and HT"

 Combined the statment I would suggest

If a CIF2 text stream contains only characters equivalent to the
printable ASCII characters plus HT, LF and CR, i.e. decimal code
points 32-126, 9, 10 and 13, then to ensure compatibility with
CIF1, the CIF2 specification does not require any explicit
specification of the particular encoding used, but recommends
the use of UTF8.  If a CIF2 text stream contains any characters
equivalent to Unicode code points not in that range, then for
any encoding other then UTF8 it is the responsibility of any
application writing such a CIF to unambigously specify the
particular encoding used, preferably within the file itself.
UTF16 with a BOM conforms to this requirement.

 Regards,
   Herbert

=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================


On Thu, 30 Sep 2010, James Hester wrote:

Here is a newish compromise:

Encoding: The encoding of CIF2 text streams containing only code points in the ASCII
range is not specified. CIF2 text streams containing any code points outside the ASCII
range must be encoded such that the encoding can be reliably identified from the file
contents.  At present only UTF8 and UTF16 are considered to satisfy this constraint.

Commentary: this is intended to mean that encoding works 'as for CIF1' (Proposals 1,2)
for files containing only ASCII text, and works as for Proposal 4 for any other files. 
I believe that this allows legacy workflows to operate smoothly on CIF2 files (legacy
workflows do not process non ASCII text) but also avoids the tower of Babel effect that
will ensue if non-ASCII codepoints are encoded using local conventions. 

To explain the thinking further, perhaps I could take another stab at Herbert's point of
view in my own words.  Herbert (I think correctly) surmises that all currently used CIF
applications do not explicitly specify the encoding of their input and output files, and
so therefore are conceptually working with CIFs in a variety of local encodings. 
Mandating any encoding for CIF2 would therefore force at least some and perhaps most of
these applications to change the way they read and write text, which is disruptive and
obtuse when the system works fine as it is.  Proposals 1 and 2 are aimed at avoiding
this disruption.

On the other hand, I look at the same situation and see that all this software is in
fact reading and writing ASCII, because all of these local encodings are actually
equivalent to ASCII for characters used in CIFs, and I further assert that this happy
coincidence between encodings is the single reason CIF files are easily transferable
between different systems.

These two points of view create two different results if the CIF character repertoire is
extended beyond the ASCII range.  If we allow the current approach to encoding to
continue, the happy coincidence of encodings ceases to operate outside the ASCII range
and CIF files are no longer easily interchangeable.  If we make explicit the commonality
of CIF1 encodings by mandating a common set of identifiable encodings, the use of
default encodings has to be abandoned with accompanying effort from programmers.

I believe that this latest proposal respects Herbert's concerns as well as mine, and is
eminently workable as a starting point for going forward.  I'm now off to do a sample
change and expect unanimous support from all parties when I return in an hour's time :)

On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon <bm@iucr.org> wrote:
     I think the crux of issue is as follows:

     [But part of our difficulty is that we are all having separate
     epiphanies, and focusing on five different "cruxes". Clarifying
     the real divergence between our views would be a genuine benefit of
     a Skype conference, to which I have no personal objection.]

     In the real world, a need may arise to exchange CIFs constructed in
     non-canonical encodings. ("Canonical" probably means UTF-8 and/or
     UTF-16). Such a need would involve some transcoding strategy.

     What is the actual likelihood of that need arising?

     I would characterise James's position as "not very, and even less
     if the software written to generate CIFs is constrained to use
     canonical encodings within the standard".

     I would characterise the position of the rest of us as "reasonable to
     high, so that we wish to formulate the standard in a way that
     recognises non-canonical encodings and helps to establish or at
     least inform appropriate transcoding strategies". There appear to be
     strong disagreements among us, but in fact there's a lot of common
     ground, and a drafting exercise would probably move us towards a
     consensus.

     Do you agree that that is a fair assessment?

     If so, we can analyse further: what are the implications of mandating
     a canonical encoding or not if judgement (a) is wrong and if judgement
     (b) is wrong? My feeling is that the world will not end - or even
     change very much - in any case; but it could determine whether we
     need to formulate an optimal transcoding strategy now, or can defer
     it to a later date.

     However, if anyone thinks this is just another diversion, I'll drop
     this line of approach so as not to slow things down even more.

     Regards
     Brian

On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J. Bernstein wrote:
> John,
>
> Now I am totally confused about what you are proposing and agree with Simon
> that what is needed for you to state your proposal as the precise wording
> that you propose to insert and/or change in the current CIF2 change document
> "5 July 2010: draft of changes to the existing CIF 1.1 specification
> for public discussion"
>
> If I understand your proposal correctly, the _only_ thing you are proposing
> that differs in any way from my proposed motion is a mandate that a
> CIF2 conformant reader must be able to read a UTF8 CIF2 file, but
> that _no_ CIF application would actually be required to provide such
> code, provided there was some mechanism available to transcode from
> UTF8 to the local encoding,
> which does not seem to be a mandate on the conformant CIF2 reader at
> all, but a requirement for the provision of a portable utility to
> do that external transcoding.
>
> If that is the case, wouldn't it make more sense to just provide that
> utility that to argue about whether my motion requires somebody to write
> their own?  Having the utility in hand would avoid having multiple,
> conflicting interpretations of this input transcoding requirement.
>
> If I have read your message correctly, please just write the utility you
> are proposing.  If I have read your message incorrectly, please
> write the specification changes you propose for the draft changes
> in place of the changes in my motion.
>
> _This_ is why it was, is, and will remain a good idea to simply have
> a meeting and talk these things out.
>
>
>

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]