[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] A new(?) compromise position

To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
Subject: Re: [Cif2-encoding] A new(?) compromise position
From: SIMON WESTRIP <simonwestrip@xxxxxxxxxxxxxx>
Date: Thu, 30 Sep 2010 09:01:15 +0000 (GMT)
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

"UTF8 without a BOM does not fit your
characterization of being self-identifying"

I believe this is ture, which is why UTF8 has to be specified as the default
when the encoding is not self-identifying. With the prescription of
their use for 'non-ASCII CIFs' in the spec, UTF8 or 'self-identifying' seems
quite satisfactory to me (obviously not worded like this:) - allowing 'ASCII' CIFs to
be prepared/used as they always have been, and allowing the encoding of
'non-ASCII CIFs' to be determined with minimal uncertainty.

Cheers

Simon

From: Herbert J. Bernstein <[email protected]>
To: Group for discussing encoding and content validation schemes for CIF2 <[email protected]>
Sent: Thursday, 30 September, 2010 3:20:06
Subject: Re: [Cif2-encoding] A new(?) compromise position

Dear James,

You are mistaken, John said the opposite about determining UTF8from
context. The place where he and I differ, is he thinks you can't do it
even with a BOM and I am willing to accept UTF8 with a BOM as sufficiently
disambiguated.

The Wikipedia says no such thing about being able to reliably detect
non-UTF8 files.

Let us use the Wikipedea's very own UTF8 example as a test case:

The code point ' U+00A2 = 00000000 10100010 11000010 10100010
which is 0xC2 0xA2 as a UTF8 byte string

Now let us look at the Latin-1 code page at
http://www.utoronto.ca/web/HTMLdocs/NewHTML/iso_table.htm

which tells us that 0xc2 is Â in Latin 1 and 0xa2 is &cent.

There is no evidence that I have seen anywhere to support your prosition
that "furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8
files can be reliably detected." All the evidence I have seen points the
other way.

James, please look at the facts -- UTF8 without a BOM does not fit your
characterization of being self-identifying.

Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

+1-631-244-3035
[email protected]
=====================================================

On Thu, 30 Sep 2010, James Hester wrote:

> My simple objective (for files containing non-ASCII characters) is that an application is
> able to determine the encoding of an incoming file with a high degree of certainty with no
> information beyond the CIF standard, the encoding standard, and the file contents. If the
> only choices are UTF8 or UTF16 there is no danger of a misassignment of encoding.
> Furthermore, as UTF8 files have a distinctive bit-pattern, non-UTF8 files can be reliably
> detected(*). This appears to me to be an excellent state of affairs.
>
> Although I have apparently restricted encodings to UTF8 and UTF16 in the preceding
> paragraph, this is simply in an effort to get something workable on the table that we can
> move forward with. I have no particular agenda to limit in future the possible encodings
> for CIF files, provided that those encodings can be reliably identified subject to the above
> restrictions. Indeed, this particular group was formed in part to work out a system for
> including those other encodings.
>
> I realise my wordsmithing on the new proposal is somewhat lax, but if we are in agreement on
> the principle I hope we are able to polish it up to everybody's satisfaction.
>
> James.
>
> (*) John's email has addressed the UTF8 question in his post adequately, and the wikipedia
> entry also contains useful discussion.
>
> On Thu, Sep 30, 2010 at 3:00 AM, Herbert J. Bernstein <[email protected]> wrote:
> Dear James,
>
> I know from long and painful experience that files with just a few accented
> characters are very, very difficult to clearly identify, and can look like valid
> UTF8 files. UTF8 is _not_ self-identifying without the BOM.
>
> The case that really convinced me that there was a problem was a
> French document with a lower case e with an accent acute on the E. I nearly
> missed a misencoding of a mac native file that because it was being misread as
> a capital E in a UTF8 file showed the accent as grave.
>
> There are simply too many cases like that in which a file written in a non-UTF8
> encoding looks like something reasonable, but wrong, to say that UTF without the
> BOM is self-identifying.
>
> As for the question of standards and applications, many programming
> language standards specify the action of processors of the language.
> In our case, to have a meaninful standard, we need to specify what
> is a syntactically valid CIF2 file, to specify the semantics for
> a compliant CIF2 reader and specify the required actions for
> a compliant CIF2 writer. We need to do so in a way that breaks
> as few existing applications as possible.
>
> I believe that applications are highly relevant to what we are trying to do.
> In particular, I favor strict rules on writers and liberal rules
> on readers, so that files get processed when possible, but tend to get
> cleaned up when being processed.
>
> That same frame of mind is why a lot of text editors invisibly add
> a BOM at the start of all UTF8 files, but try to accept UTF8 files
> with or without the BOM.
>
>
> Regards,
> Herbert
>
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
> Dowling College, Kramer Science Center, KSC 121
> Idle Hour Blvd, Oakdale, NY, 11769
>
> +1-631-244-3035
> [email protected]
> =====================================================
>
> On Thu, 30 Sep 2010, James Hester wrote:
>
> Hi Herbert (I should be in bed, but whatever): I do not think it is
> appropriate to require the *application* to unambiguously identify the
> encoding, as no widely-recognised standard procedure exists to do this.
> The
> means of identification should rather be based on the international
> standard
> describing the encoding. Only UTF16 and UTF8 currently meet this
> requirement, I believe. I will try to express this better after a
> sleep...
>
> Regarding UTF8: I'm glad to see such vigilance in the cause of correctly
> identifying file encoding. A UTF8 file, naturally, can also look like a
> file
> in a variety of single-byte encodings regardless of a BOM at the front.
> However, a file in a non-UTF8 encoding is highly unlikely to be mistaken
> for
> a UTF8 file. Therefore, providing an input file is first checked for UTF8
> encoding, I do not see any significant danger of a mistaken encoding. I'd
> be happy to include recommendations to use a UTF8 BOM and to check for
> UTF8
> encoding before any others that we may eventually add to the list.
>
> I'm curious to see what these files are that you have trouble identifying
> as
> UTF8, as they may represent obscure corner cases. Any chance you could
> dig
> one or two up?
>
> James.
> On Thu, Sep 30, 2010 at 12:45 AM, Herbert J. Bernstein
> <[email protected]> wrote:
> Dear James,
>
>   I respect the attempt to compromise, but the sentence "At
> present only UTF8 and UTF16 are considered to satisfy this
> constraint" is not quite
> right without some additional work on the spec. UTF16 with a
> BOM is
> self-identifying. UTF8 with a BOM is also self-identifying.
>   However,
> UTF8 without a BOM and without some other disambiguator (e.g.
> the
> accented o's), is _not_ self identifying. I know, because my
> students
> and I hit this problem all the time in working with
> multi-linguage,
> multi-code-page message catalogs for RasMol. Sometimes the only
> way
> we can figure out whether a UTF8 file is really a UTF8 file is
> to
> start translating the actual strings and see if they make sense.
>
>   Another problem is what the "ASCII range" means to various
> people.
> I suggest being much more restrictive and saying "the printable
> ASCII characters, code points 32-126 plus CR, LF and HT"
>
>   Combined the statment I would suggest
>
> If a CIF2 text stream contains only characters equivalent to the
> printable ASCII characters plus HT, LF and CR, i.e. decimal code
> points 32-126, 9, 10 and 13, then to ensure compatibility with
> CIF1, the CIF2 specification does not require any explicit
> specification of the particular encoding used, but recommends
> the use of UTF8. If a CIF2 text stream contains any characters
> equivalent to Unicode code points not in that range, then for
> any encoding other then UTF8 it is the responsibility of any
> application writing such a CIF to unambigously specify the
> particular encoding used, preferably within the file itself.
> UTF16 with a BOM conforms to this requirement.
>
>   Regards,
>    Herbert
>
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>    Idle Hour Blvd, Oakdale, NY, 11769
>
>    +1-631-244-3035
>    [email protected]
> =====================================================
>
>
> On Thu, 30 Sep 2010, James Hester wrote:
>
> Here is a newish compromise:
>
> Encoding: The encoding of CIF2 text streams containing
> only code points in the ASCII
> range is not specified. CIF2 text streams containing any
> code points outside the ASCII
> range must be encoded such that the encoding can be
> reliably identified from the file
> contents. At present only UTF8 and UTF16 are considered
> to satisfy this constraint.
>
> Commentary: this is intended to mean that encoding works
> 'as for CIF1' (Proposals 1,2)
> for files containing only ASCII text, and works as for
> Proposal 4 for any other files.
> I believe that this allows legacy workflows to operate
> smoothly on CIF2 files (legacy
> workflows do not process non ASCII text) but also avoids
> the tower of Babel effect that
> will ensue if non-ASCII codepoints are encoded using local
> conventions.
>
> To explain the thinking further, perhaps I could take
> another stab at Herbert's point of
> view in my own words. Herbert (I think correctly)
> surmises that all currently used CIF
> applications do not explicitly specify the encoding of
> their input and output files, and
> so therefore are conceptually working with CIFs in a
> variety of local encodings.
> Mandating any encoding for CIF2 would therefore force at
> least some and perhaps most of
> these applications to change the way they read and write
> text, which is disruptive and
> obtuse when the system works fine as it is. Proposals 1
> and 2 are aimed at avoiding
> this disruption.
>
> On the other hand, I look at the same situation and see
> that all this software is in
> fact reading and writing ASCII, because all of these local
> encodings are actually
> equivalent to ASCII for characters used in CIFs, and I
> further assert that this happy
> coincidence between encodings is the single reason CIF
> files are easily transferable
> between different systems.
>
> These two points of view create two different results if
> the CIF character repertoire is
> extended beyond the ASCII range. If we allow the current
> approach to encoding to
> continue, the happy coincidence of encodings ceases to
> operate outside the ASCII range
> and CIF files are no longer easily interchangeable. If we
> make explicit the commonality
> of CIF1 encodings by mandating a common set of
> identifiable encodings, the use of
> default encodings has to be abandoned with accompanying
> effort from programmers.
>
> I believe that this latest proposal respects Herbert's
> concerns as well as mine, and is
> eminently workable as a starting point for going forward.
> I'm now off to do a sample
> change and expect unanimous support from all parties when
> I return in an hour's time :)
>
> On Wed, Sep 29, 2010 at 8:25 PM, Brian McMahon
> <[email protected]> wrote:
>    I think the crux of issue is as follows:
>
>    [But part of our difficulty is that we are all having
> separate
>    epiphanies, and focusing on five different "cruxes".
> Clarifying
>    the real divergence between our views would be a
> genuine benefit of
>    a Skype conference, to which I have no personal
> objection.]
>
>    In the real world, a need may arise to exchange CIFs
> constructed in
>    non-canonical encodings. ("Canonical" probably means
> UTF-8 and/or
>    UTF-16). Such a need would involve some transcoding
> strategy.
>
>    What is the actual likelihood of that need arising?
>
>    I would characterise James's position as "not very,
> and even less
>    if the software written to generate CIFs is
> constrained to use
>    canonical encodings within the standard".
>
>    I would characterise the position of the rest of us
> as "reasonable to
>    high, so that we wish to formulate the standard in a
> way that
>    recognises non-canonical encodings and helps to
> establish or at
>    least inform appropriate transcoding strategies".
> There appear to be
>    strong disagreements among us, but in fact there's a
> lot of common
>    ground, and a drafting exercise would probably move
> us towards a
>    consensus.
>
>    Do you agree that that is a fair assessment?
>
>    If so, we can analyse further: what are the
> implications of mandating
>    a canonical encoding or not if judgement (a) is wrong
> and if judgement
>    (b) is wrong? My feeling is that the world will not
> end - or even
>    change very much - in any case; but it could
> determine whether we
>    need to formulate an optimal transcoding strategy
> now, or can defer
>    it to a later date.
>
>    However, if anyone thinks this is just another
> diversion, I'll drop
>    this line of approach so as not to slow things down
> even more.
>
>    Regards
>    Brian
>
> On Tue, Sep 28, 2010 at 09:28:25PM -0400, Herbert J.
> Bernstein wrote:
> > John,
> >
> > Now I am totally confused about what you are proposing
> and agree with Simon
> > that what is needed for you to state your proposal as
> the precise wording
> > that you propose to insert and/or change in the current
> CIF2 change document
> > "5 July 2010: draft of changes to the existing CIF 1.1
> specification
> > for public discussion"
> >
> > If I understand your proposal correctly, the _only_
> thing you are proposing
> > that differs in any way from my proposed motion is a
> mandate that a
> > CIF2 conformant reader must be able to read a UTF8 CIF2
> file, but
> > that _no_ CIF application would actually be required to
> provide such
> > code, provided there was some mechanism available to
> transcode from
> > UTF8 to the local encoding,
> > which does not seem to be a mandate on the conformant
> CIF2 reader at
> > all, but a requirement for the provision of a portable
> utility to
> > do that external transcoding.
> >
> > If that is the case, wouldn't it make more sense to just
> provide that
> > utility that to argue about whether my motion requires
> somebody to write
> > their own? Having the utility in hand would avoid
> having multiple,
> > conflicting interpretations of this input transcoding
> requirement.
> >
> > If I have read your message correctly, please just write
> the utility you
> > are proposing. If I have read your message incorrectly,
> please
> > write the specification changes you propose for the
> draft changes
> > in place of the changes in my motion.
> >
> > _This_ is why it was, is, and will remain a good idea to
> simply have
> > a meeting and talk these things out.
> >
> >
> >
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>
> _______________________________________________
> cif2-encoding mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>
> _______________________________________________
> cif2-encoding mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
>

_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]

References:

[Cif2-encoding] A new(?) compromise position (James Hester)

Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)

Re: [Cif2-encoding] A new(?) compromise position (James Hester)

Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)

Re: [Cif2-encoding] A new(?) compromise position (James Hester)

Re: [Cif2-encoding] A new(?) compromise position (Herbert J. Bernstein)

Prev by Date: Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010

Next by Date: Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010

Prev by thread: Re: [Cif2-encoding] A new(?) compromise position

Next by thread: Re: [Cif2-encoding] A new(?) compromise position

Index(es):

Date

Thread

Discussion List Archives

Re: [Cif2-encoding] A new(?) compromise position