[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...

Thank you John for your response.

I will state my position in due course (hopefully with more clarity than I usually employ!)
but in the meantime, I'll briefly answer your question regarding extending the UTF8/16 set:

Yes, I was thinking of the existing 'UTF family', while also allowing extension in the future to any encodings that fall within the same class of 'inherently identifiable' encodings. By 'inherently identifiable' I mean encodings that are identifiable by e.g. BOM; but
as you explain, this is not appropriate for your proposal.



From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Tuesday, 14 September, 2010 15:46:10
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .


On Tuesday, September 14, 2010 7:20 AM, SIMON WESTRIP wrote:
>I sense some common ground here with my previous post.

I hope so.  My proposal is intended as a compromise position, and I hope it will give all the participants in this discussion enough of what they want that we can finally come to an agreement.

>The UTF8/16 pair could possibly be extended to any unicode encoding that is unambiguously/inherently identifiable?

Did you have any particular other encodings you would put in that category?  The only one(s) I think would qualify are UTF-32 variants, and, to the extent it is distinct from UTF-16, perhaps UTF-16LE.  If we're don't tag CIFs with encoding information (and that's not part of my proposal) then I don't think it safe to deem encodings that we do not explicitly enumerate as "inherently identifiable".  My proposal intentionally minimizes the list of allowed encodings (even inclusion of UTF-16 is left open to debate) because (i) having more than one allowed encoding already requires the UTF-8 only side to yield some ground, and (ii) having fewer alternatives makes for much simpler autodetection.

>The 'local' encodings then encompass everything else?

Sort of.  "local" is environment-specific.  It is what the system's text editors read and (especially) write by default, what the local Fortran I/O library expects of a 'formatted' file, what a Java InputStreamReader in that environment handles correctly when no encoding is explicitly specified to it, etc..

>However, I think we've yet to agree that anything but UTF8 is to be allowed at all. We have a draft spec that stipulates UTF8,
>but I infer from this thread that there is scope to relax that restriction.

Um, yes.  I think perhaps we've snuck one past you: this entire list (Cif2-encoding) was split off from the ddlm-group list for the purpose of discussing that topic, as there strong opinions on both sides.  Brian administratively subscribed several of the ddlm-group members to this list when he created it, including you.

>The views seem to range from at least 'leaving the door open'
>in recognition of the variety of encodings available, to advocating that the encoding should not be part of the specification at all, and it will be down to developers to accommodate/influence user practice.

I think a better characterization of the views on the main CIF representation is that they range from 'no encoding but UTF-8 should be permitted' to 'all text conventions must be supported'.  We have also discussed a side issue or two, such as what to do about embedding CIF text in other files, but those seem not to be very contentious.  A central pillar of the multiple conventions camp's arguments is CIF1's position that CIFs are text files complying with local text conventions.  Many CIF1 users and programmers have relied on that, and therefore we would like to avid throwing it out the window.  The essential position of the UTF-8-only camp is that CIF2 must be inherently resistant to misinterpretation, especially character encoding mismatches.

> I'm in favour of a default encoding or maybe any encoding that is inherently identifiable, and providing a means to declare other encodings (however untrustworthy the declaration may be, it would at least be available to conscientious users/developers), all documented in the spec.

My proposal comes close to making UTF-8 a default encoding, though if UTF-16 is allowed as well then it would be a viable candidate for that spot.  Inasmuchas these cannot be confused in a CIF context, I don't see the availability of both as a problem.

My proposal intentionally avoids requiring any kind of tagging, as
(i) Proponents of the UTF-8-only position have been relatively unreceptive to tagging as a solution, mainly citing concerns about reliability of encoding tags
(ii) Avoiding tagging avoids giving any impression that CIF processors are expected to handle non-native encodings other than UTF-8[/16]
(iii) Leaving out tags keeps it simpler

There is room for some kind of tagging scheme as a supplementary convention or standard, and with input from James I have advanced 'Scheme B' for this purpose.  You will find discussion of Scheme B in the list archives, especially among the earliest messages on this (cif2-encoding) list.

>Please forgive me if this summary is off the mark; my conclusion is that there's a willingness to accommodate multiple encodings
>in this (albeit very small) group. Given that we are starting from the position of having a single encoding (agreed upon after much earlier debate), I cannot see us performing a complete U-turn to allow any (potentially unrecognizable) encoding as in CIF1, i.e. without some specification of a canonical encoding or mechanisms to identify/declare the encoding. On the other hand, I hope to see
>a revised spec that isnt UTF8 only.

Part of my thesis behind the present compromise proposal is that in the context of any particular computing environment, CIF1 in fact *does not* support every possible encoding.  It supports *only* the local default text conventions.  CIF1 allows all encodings only in the sense that for any given encoding there may be some computing environment, somewhere, for which that encoding is the default -- in that environment, CIF1 supports that encoding.

UTF-8-only would be a complete reversal of CIF1 in the sense that UTF-8 is generally not the default convention in current environments.  Thus, requiring UTF-8 would demand that CIF2 files comply with NON-native conventions instead of with native ones.  Under ASCII-compatible default conventions, the distinction appears only when non-ASCII characters appear in a CIF, but I have come to view that as more of a detriment than an advantage: it would provide fertile ground for bugs and mistakes.

Instead of such a complete reversal, then, my compromise proposal basically adds UTF-8 and maybe UTF-16 as allowed encodings, and explicitly specifies that the only other supported encoding is the local default, whatever that happens to be.  This acknowledges that CIF2 users will have more exposure to text encoding concerns than CIF1 users do.  Herb argues that that is inevitable, and I agree.

>To get to the point - is there any hope of reaching a compromise?

Scheme B was an attempt to build a compromise, but it doesn't look likely to succeed in that capacity.  I think the proposal to which you just responded is the best hope for a compromise that so far has been presented.  If that or something like it is not accepted then I'm having trouble seeing where else to turn.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
cif2-encoding mailing list
cif2-encoding mailing list

Reply to: [list | sender only]