[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

One, hopefully relevant, aside -- ascii files are not as
unambiguous as one might think.  Depending on what localization
one has one one's computer, the code point 0x5c (one of the
characters in the first 127) will be shown as a reverse
solidus, a yen currency symbol or a won currency symbol.  This
is a holdover from the days of national variants of the ISO
character set, and shows no signs of going away any time soon.

This is _not_ the only such case, but it is one that impacts
most programming languages, including dREL, and existing CIF
files, including the PDB's mmCIF files.
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 14 Sep 2010, Herbert J. Bernstein wrote:

> Dear Colleagues,
>  To avoid any misunderstandings, rather than worrying about how
> we got to where we are, let us each just state a clear position.
> Here is mine:
>  I favor CIF2 being stated in terms of UTF-8 for clarity, but
> not specifying any particular _mandatory_ encoding of a CIF2 file
> as long as there is a clearly agreed mechanism between the
> creator and consumer of a given CIF2 file as to how to faithfully
> transform the file between creator's and the consumer's encodings.
>  I favor UTF-8 being the default encoding that any CIF2 creator
> should feel free to use without having to establish any prior
> agreement with consumers, and that all consumers should try
> to make arrangements to be able to read, either directly or
> via some conversion utility or service.  If the consumers don't
> make such arrangements then there may be CIF2 files that they
> will not be able to read.  If a producer creates a CIF2 in any
> encoding other than UTF8 then there may be consumers who have
> difficulty reading that CIF2.
>  I favor the IUCr taking responsibility for collecting and
> disseminating information on particularly useful ways to go
> to and from UTF8 and/or other popular encodings.
>  Regards,
>    Herbert
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
> On Tue, 14 Sep 2010, SIMON WESTRIP wrote:
>> I sense some common ground here with my previous post.
>> The UTF8/16 pair could possibly be extended to any unicode encoding that is
>> unambiguously/inherently identifiable?
>> The 'local' encodings then encompass everything else?
>> However, I think we've yet to agree that anything but UTF8 is to be allowed
>> at all. We have a draft spec that stipulates UTF8,
>> but I infer from this thread that there is scope to relax that restriction.
>> The views seem to range from at least 'leaving the door open'
>>  in recognition of the variety of encodings available, to advocating that
>> the encoding should not be part of the specification at all, and it will be
>> down to developers to accommodate/influence user practice. I'm in favour of
>> a default encoding or maybe any encoding that is inherently identifiable,
>> and providing a means to declare other encodings (however untrustworthy the
>> declaration may be, it would at least be available to conscientious
>> users/developers), all documented in the spec.
>> Please forgive me if this summary is off the mark; my conclusion is that
>> there's a willingness to accommodate multiple encodings
>> in this (albeit very small) group. Given that we are starting from the
>> position of having a single encoding (agreed upon after much earlier
>> debate), I cannot see us performing a complete U-turn to allow any
>> (potentially unrecognizable) encoding as in CIF1, i.e. without some
>> specification of a canonical encoding or mechanisms to identify/declare the
>> encoding. On the other hand, I hope to see
>> a revised spec that isnt UTF8 only.
>> To get to the point - is there any hope of reaching a compromise?
>> Cheers
>> Simon
>> ____________________________________________________________________________
>> From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
>> To: Group for discussing encoding and content validation schemes for CIF2
>> <cif2-encoding@iucr.org>
>> Sent: Monday, 13 September, 2010 19:52:26
>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
>> [...]
>> >To my mind, the encoding of plain CIF files remains an open issue.  I
>> >do not view the mechanisms for managing file encoding that are
>> >provided by current OSs to be sufficiently robust, widespread or
>> >consistent that we can rely on developers or text editors respecting
>> >them [...].
>> I agree that the encoding of plain CIF files remains an open issue.
>> I confess I find your concerns there somewhat vague, especially to the
>> extent that they apply within the confines of a single machine.  Do your
>> concerns extend to that level?  If so, can you provide an example or two of
>> what you fear might go wrong in that context?
>> As Herb recently wrote, "Multiple encodings are a fact of life when working
>> with text."  CIF2 looks like text, it feels like text, and despite some
>> exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot
>> pretend that we're dealing with anything other than text.  We need to
>> accept, therefore, that no matter what we do, authors and programmers will
>> need to account for multiple encodings, one way or another.  The format
>> specification cannot relieve either group of that responsibility.
>> That doesn't necessarily mean, however, that CIF must follow the XML model
>> of being self-defining with regard to text encoding.  Given CIF's various
>> uses, we gain little of practical value in this area by defining CIF2 as
>> UTF-8 only, and perhaps equally little by defining required decorations for
>> expressing random encodings.  Moreover, the best reading of CIF1 is that it
>> relies on the *local* text conventions, whatever they may be, which is 
>> quite
>> a different thing than handling all text conventions that might conceivably
>> be employed.
>> With that being the case, I don't think it needful for CIF2 in any given
>> environment to endorse foreign encoding conventions other than UTF-8.  CIF2
>> reasonably could endorse UTF-16 as well, though, as that cannot be confused
>> with any ASCII-compatible encoding.  Allowing UTF-16 would open up useful
>> possibilities both for imgCIF and for future uses not yet conceived. 
>> Additionally, since CIF is text I still think it important for CIF2 to
>> endorse the default text conventions of its operating environment.
>> Could we agree on those three as allowed encodings?  Consider, given that
>> combination of supported alternatives and no extra support from the spec,
>> how might various parties deal with the unavoidable encoding issue.  Here
>> are some of the more reasonable alternatives I see:
>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
>>         Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The
>> responsibility to perform any needed transcoding is on the other party. 
>> This is just as it might be with UTF-8-only.
>>         Option b) in addition to supporting UTF-8 and/or UTF-16, support
>> other encodings by allowing users to explicitly specify them as part of the
>> submission/retrieval process.  The processor / repository would either
>> ensure the CIF is properly labeled, or, better, transcode it to 
>> UTF-8[/16]. 
>> This also is just as it might be with UTF-8 only.
>> 2. Programs and Libraries:
>>         Option a) On input, detect encoding by checking first for UTF-16,
>> assuming UTF-8 if not UTF-16, and falling back to default text conventions
>> if a UTF-8 decoding error is encountered.  On output, encode as directed by
>> the user (among the two/three options), defaulting to the input encoding
>> when that is available and feasible.  These would be desirable behaviors
>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment,
>> but they do exceed UTF-8-only requirements.
>>         Option b) Require input and produce output according to a fixed set
>> of conventions (whether local text conventions or UTF-8/16).  The program
>> user is responsible for any needed transcoding.  This would be sufficient
>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those
>> differ, however, in which text conventions would be assumed.
>> 3. Users/Authors:
>> 3.1. Creating / editing CIFs
>>         No change from current practice is needed, but users might choose 
>> to
>> store CIFs in UTF-8[/16] form.  This is just as it would likely be under
>> UTF-8 only.
>> 3.2. Transferring CIFs
>>         Unless an alternative agreement on encoding can be reached by some
>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This
>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) 
>> allowed.
>> 3.3. Receiving CIFs
>>         The receiver may reasonably demand that the CIF be provided in
>> UTF-8[/16] form.  He should *expect* that form unless some alternative
>> agreement is established.  Any desired transcoding from UTF-8[/16] to an
>> alternative encoding is the user's responsibility.  Again, this is not
>> significantly different from the UTF-8 only case.
>> A driving force in many of those cases is the well-understood (especially
>> here!) fact that different systems cannot be relied upon to share text
>> conventions, thus leaving UTF-8[/16] as the only available general-purpose
>> medium of exchange.  At the same time, local conventions are not forbidden
>> from use where they can be relied upon -- most notably, within the same
>> computer.  Even if end-users, as a group, do not appreciate those details,
>> we can ensure via the spec that CIF2 implementers do.  That's sufficient.
>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the
>> same as it would be under UTF-8-only, then why prefer the former?  Because
>> under UTF-8[/16]+local, all the behavior described is conformant to the
>> spec, whereas under UTF-8 only, a significant proportion is not.  If the
>> standard adequately covers these behaviors then we can expect more uniform
>> support.  Moreover, this bears directly on community acceptance of the
>> spec.  If flaunting the spec with respect to encoding becomes common, then
>> the spec will have failed, at least in that area.  Having failed in one
>> area, it is more likely to fail in others.
>> Regards,
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
cif2-encoding mailing list

Reply to: [list | sender only]