[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Dear all

My position regarding the encoding issue is based on the fundamental belief that CIF is 'text' - always has been.
It can be read by humans, and even has 'dictionaries' that define its content and can stand alone as human-readable documents (indeed, DDLm dictionaries will probably provide a wealth of human-readable information).

So the battle is against the machines - whereas pen and paper can convey text unambiguously, computers have to translate (encode/decode) that text for us humans to read...

OK - analogy abandoned - no more cliches - my thoughts on CIF2 encoding can be summarized in one initial sequence:

BOM (if required to identify the encoding) + declaration that CIF2 + declaration of encoding (if not inherently identifiable)

This is based on a specification that allows any text encoding, requires the declaration of the encoding if it is not unambiguously identifiable without such a declaration, and defines a default encoding that should be assumed in the absence of any pointers
to the contrary and that should be considered as the base 'language' that all CIF readers should understand.

Though brief and lacking in specifics, I hope this sums up my current thinking with respect to a CIF2 'standard' (I recognize that the reality of dealing with multiple encodings will involve flexibility to accommodate user practice, whatever is specified).

Cheers

Simon

PS Within this framework, I would allow the encoding of CIF data values to be 'switched' according to dictionary definitions... :-)







From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Tuesday, 14 September, 2010 15:46:43
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Dear Colleagues,

  To avoid any misunderstandings, rather than worrying about how
we got to where we are, let us each just state a clear position.
Here is mine:

  I favor CIF2 being stated in terms of UTF-8 for clarity, but
not specifying any particular _mandatory_ encoding of a CIF2 file
as long as there is a clearly agreed mechanism between the
creator and consumer of a given CIF2 file as to how to faithfully
transform the file between creator's and the consumer's encodings.

  I favor UTF-8 being the default encoding that any CIF2 creator
should feel free to use without having to establish any prior
agreement with consumers, and that all consumers should try
to make arrangements to be able to read, either directly or
via some conversion utility or service.  If the consumers don't
make such arrangements then there may be CIF2 files that they
will not be able to read.  If a producer creates a CIF2 in any
encoding other than UTF8 then there may be consumers who have
difficulty reading that CIF2.

  I favor the IUCr taking responsibility for collecting and
disseminating information on particularly useful ways to go
to and from UTF8 and/or other popular encodings.

  Regards,
    Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Tue, 14 Sep 2010, SIMON WESTRIP wrote:

> I sense some common ground here with my previous post.
>
> The UTF8/16 pair could possibly be extended to any unicode encoding that is
> unambiguously/inherently identifiable?
> The 'local' encodings then encompass everything else?
>
> However, I think we've yet to agree that anything but UTF8 is to be allowed
> at all. We have a draft spec that stipulates UTF8,
> but I infer from this thread that there is scope to relax that restriction.
> The views seem to range from at least 'leaving the door open'
>  in recognition of the variety of encodings available, to advocating that
> the encoding should not be part of the specification at all, and it will be
> down to developers to accommodate/influence user practice. I'm in favour of
> a default encoding or maybe any encoding that is inherently identifiable,
> and providing a means to declare other encodings (however untrustworthy the
> declaration may be, it would at least be available to conscientious
> users/developers), all documented in the spec.
>
> Please forgive me if this summary is off the mark; my conclusion is that
> there's a willingness to accommodate multiple encodings
> in this (albeit very small) group. Given that we are starting from the
> position of having a single encoding (agreed upon after much earlier
> debate), I cannot see us performing a complete U-turn to allow any
> (potentially unrecognizable) encoding as in CIF1, i.e. without some
> specification of a canonical encoding or mechanisms to identify/declare the
> encoding. On the other hand, I hope to see
> a revised spec that isnt UTF8 only.
>
> To get to the point - is there any hope of reaching a compromise?
>
> Cheers
>
> Simon
>
>
> ____________________________________________________________________________
> From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
> To: Group for discussing encoding and content validation schemes for CIF2
> <cif2-encoding@iucr.org>
> Sent: Monday, 13 September, 2010 19:52:26
> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
>
>
> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
> [...]
> >To my mind, the encoding of plain CIF files remains an open issue.  I
> >do not view the mechanisms for managing file encoding that are
> >provided by current OSs to be sufficiently robust, widespread or
> >consistent that we can rely on developers or text editors respecting
> >them [...].
>
> I agree that the encoding of plain CIF files remains an open issue.
>
> I confess I find your concerns there somewhat vague, especially to the
> extent that they apply within the confines of a single machine.  Do your
> concerns extend to that level?  If so, can you provide an example or two of
> what you fear might go wrong in that context?
>
> As Herb recently wrote, "Multiple encodings are a fact of life when working
> with text."  CIF2 looks like text, it feels like text, and despite some
> exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot
> pretend that we're dealing with anything other than text.  We need to
> accept, therefore, that no matter what we do, authors and programmers will
> need to account for multiple encodings, one way or another.  The format
> specification cannot relieve either group of that responsibility.
>
> That doesn't necessarily mean, however, that CIF must follow the XML model
> of being self-defining with regard to text encoding.  Given CIF's various
> uses, we gain little of practical value in this area by defining CIF2 as
> UTF-8 only, and perhaps equally little by defining required decorations for
> expressing random encodings.  Moreover, the best reading of CIF1 is that it
> relies on the *local* text conventions, whatever they may be, which is quite
> a different thing than handling all text conventions that might conceivably
> be employed.
>
> With that being the case, I don't think it needful for CIF2 in any given
> environment to endorse foreign encoding conventions other than UTF-8.  CIF2
> reasonably could endorse UTF-16 as well, though, as that cannot be confused
> with any ASCII-compatible encoding.  Allowing UTF-16 would open up useful
> possibilities both for imgCIF and for future uses not yet conceived. 
> Additionally, since CIF is text I still think it important for CIF2 to
> endorse the default text conventions of its operating environment.
>
> Could we agree on those three as allowed encodings?  Consider, given that
> combination of supported alternatives and no extra support from the spec,
> how might various parties deal with the unavoidable encoding issue.  Here
> are some of the more reasonable alternatives I see:
>
> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
>
>         Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The
> responsibility to perform any needed transcoding is on the other party. 
> This is just as it might be with UTF-8-only.
>
>         Option b) in addition to supporting UTF-8 and/or UTF-16, support
> other encodings by allowing users to explicitly specify them as part of the
> submission/retrieval process.  The processor / repository would either
> ensure the CIF is properly labeled, or, better, transcode it to UTF-8[/16]. 
> This also is just as it might be with UTF-8 only.
>
> 2. Programs and Libraries:
>
>         Option a) On input, detect encoding by checking first for UTF-16,
> assuming UTF-8 if not UTF-16, and falling back to default text conventions
> if a UTF-8 decoding error is encountered.  On output, encode as directed by
> the user (among the two/three options), defaulting to the input encoding
> when that is available and feasible.  These would be desirable behaviors
> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment,
> but they do exceed UTF-8-only requirements.
>
>         Option b) Require input and produce output according to a fixed set
> of conventions (whether local text conventions or UTF-8/16).  The program
> user is responsible for any needed transcoding.  This would be sufficient
> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those
> differ, however, in which text conventions would be assumed.
>
> 3. Users/Authors:
> 3.1. Creating / editing CIFs
>         No change from current practice is needed, but users might choose to
> store CIFs in UTF-8[/16] form.  This is just as it would likely be under
> UTF-8 only.
>
> 3.2. Transferring CIFs
>         Unless an alternative agreement on encoding can be reached by some
> means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This
> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed.
>
> 3.3. Receiving CIFs
>         The receiver may reasonably demand that the CIF be provided in
> UTF-8[/16] form.  He should *expect* that form unless some alternative
> agreement is established.  Any desired transcoding from UTF-8[/16] to an
> alternative encoding is the user's responsibility.  Again, this is not
> significantly different from the UTF-8 only case.
>
>
> A driving force in many of those cases is the well-understood (especially
> here!) fact that different systems cannot be relied upon to share text
> conventions, thus leaving UTF-8[/16] as the only available general-purpose
> medium of exchange.  At the same time, local conventions are not forbidden
> from use where they can be relied upon -- most notably, within the same
> computer.  Even if end-users, as a group, do not appreciate those details,
> we can ensure via the spec that CIF2 implementers do.  That's sufficient.
>
> So, if pretty much all my expected behavior under UTF-8[/16]+local is the
> same as it would be under UTF-8-only, then why prefer the former?  Because
> under UTF-8[/16]+local, all the behavior described is conformant to the
> spec, whereas under UTF-8 only, a significant proportion is not.  If the
> standard adequately covers these behaviors then we can expect more uniform
> support.  Moreover, this bears directly on community acceptance of the
> spec.  If flaunting the spec with respect to encoding becomes common, then
> the spec will have failed, at least in that area.  Having failed in one
> area, it is more likely to fail in others.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]