Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Hi John: good to see further constructive suggestions.

Regarding your UTF8/16 + local proposal:  I think I'd be willing to
accept UTF16 in addition to UTF8 (see below).  Regarding local
encoding, note this blog posting from a Microsoft .Net developer,
entitled "Don't Use Encoding.Default"
http://blogs.msdn.com/b/shawnste/archive/2005/03/15/don-t-use-encoding-default.aspx

Indeed, all of the developer-oriented material that I have looked at
concerning Microsoft platforms recommends that the developer
consciously *chooses* a Unicode-based encoding where possible, that
is, ignores any local defaults.  In fact, it is rather difficult to
find any instructions as to how to determine the platform's "local"
encoding.  By reading Python source code, I found two Microsoft API
functions, "GetACP" and "GetOEMCP", mentioned above, that can be used
to determine the default/preferred encoding as an ANSI code page (see
http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx).
The online documentation for both functions contains the following
bland comment:

" The ANSI code pages can be different on different computers, or can
be changed for a single computer, leading to data corruption. For the
most consistent results, applications should use UTF-8 or UTF-16 when
possible."

My concern precisely.  And: these files with local encoding still need
some sort of mechanism to allow reliable transmission. And what about
remote filesystem mounts for shared files?  If one computer has a
different local encoding and stores a file on its "local" filesystem,
the next computer to access that "local" file may have a different
"local" encoding and get it wrong.  And so on. Frankly, I still see no
merit in including local encodings in CIF2 at all.   If the rest of
you disagree, I won't argue about it further, but instead will attempt
to mitigate the damage by supporting the following moves:

(i) compliant CIF processors are *not* required to accept files in
local encoding;
(ii) CIF developer documentation outlines the reasons that "local"
encoding is a bad idea
(iii) the IUCr and databases are urged to make submitters check
round-trip files if they have received files in non UTF8/UTF16 form
(iv) the IUCr and databases encourage UTF8 submission.
(v) CIF developer documentation outlines the techniques for
ascertaining the preferred method of determining local encoding in a
variety of languages and platforms.

(I have added an addendum on local encodings with more information if
anybody is interested)

On Tue, Sep 14, 2010 at 4:52 AM, Bollinger, John C
<John.Bollinger@stjude.org> wrote:
>
> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
> [...]
>>To my mind, the encoding of plain CIF files remains an open issue.  I
>>do not view the mechanisms for managing file encoding that are
>>provided by current OSs to be sufficiently robust, widespread or
>>consistent that we can rely on developers or text editors respecting
>>them [...].
>
> I agree that the encoding of plain CIF files remains an open issue.
>
> I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single
> machine.  Do your concerns extend to that level?  If so, can you provide an example or two of what you fear might go wrong in that
> context?

A concrete example: a scientist in a multilingual country (e.g.
Ukrainian/Russian/English in Ukraine) is used to switching locales to
get legacy programs (ie those that rely on "default" encoding!) to
display and/or input text properly.  CIF files written in "local"
encoding using one locale will not be read correctly in a different
locale on the same machine.

I note the following sentence in Microsoft's guide to encodings at
http://msdn.microsoft.com/en-us/library/ms404377.aspx: "However, when
you have the opportunity to choose an encoding, you are strongly
recommended to use a Unicode encoding, typically either UTF8Encoding
or UnicodeEncoding".  I am simply following this recommendation,
except that I think we can save our developers some angst by making
the appropriate choice for them, so that they don't have to contend
with those developers that haven't thought about the issues.

> As Herb recently wrote, "Multiple encodings are a fact of life when working with text."  CIF2 looks like text, it feels like text, and
> despite some exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot pretend that we're dealing with anything other
> than text.  We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple
> encodings, one way or another.  The format specification cannot relieve either group of that responsibility.

And multiple encodings will continue to be a fact of life if we
actively encourage their proliferation. We can at least reduce the
amount that programmers need to consider multiple encodings by not
building the problem into the specification.  Then programmers only
need to contend with non-conformant behaviour, to which a reasonable
approach is gentle, informative rejection of the file.  I acknowledge
that there seems to be a difference in perceptions as to how
widespread non-conformance will be (I think it will be negligible and
manageable with a little education).

> That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding.
> Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by
> defining required decorations for expressing random encodings.  Moreover, the best reading of CIF1 is that it relies on the *local*
> text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably
> be employed.
>
> With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other
> than UTF-8.  CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible
> encoding.  Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived.  Additionally,
> since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment.

If Microsoft documents are to be believed, they would rather
developers *didn't* try to figure out what the default encoding is.
Perhaps CIF2 should instead endorse the position of just about
everybody writing about encodings instead, including the producers of
the operating environment..."choose UTF8 if you have a choice"?

> Could we agree on those three as allowed encodings?  Consider, given that combination of supported alternatives and no extra
> support from the spec, how might various parties deal with the unavoidable encoding issue.  Here are some of the more reasonable
> alternatives I see:
>
> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
>
>        Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The responsibility to perform any needed transcoding is on the other party.  This is just as it might be with UTF-8-only.
>
>        Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them
> as part of the submission/retrieval process.  The processor / repository would either ensure the CIF is properly labeled, or, better,
> transcode it to UTF-8[/16].  This also is just as it might be with UTF-8 only.

As discussed before, users are not necessarily going to know what
their local encoding is, making the selection untrustworthy.  Only
option (a) is viable.

> 2. Programs and Libraries:
>
>        Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default
> text conventions if a UTF-8 decoding error is encountered.  On output, encode as directed by the user (among the two/three
> options), defaulting to the input encoding when that is available and feasible.  These would be desirable behaviors even in the
> UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements.

I don't think the user would necessarily know which encoding to prefer
if offered a choice.  I believe the safest route is to output in the
same encoding as the input, which at least avoids introducing errors
if the local encoding is different to what the previous program
thought it was and then the resulting errors are preserved when
transcoding to UTF8/16.  So option (a) is not viable

>        Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or
> UTF-8/16).  The program user is responsible for any needed transcoding.  This would be sufficient for the CIF2, UTF-8 only case,
> and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed.

This is acceptable in that it doesn't make anything worse by producing
incorrect UTF8/16 text due to use of incorrect local encoding.  When
the time comes to transcode to UTF8, some user interaction for
checking of the encoding is necessary, so should not be done silently.

> 3. Users/Authors:
> 3.1. Creating / editing CIFs
>        No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form.  This is just as it would
> likely be under UTF-8 only.

I assume by "current practice" you mean editing files in "local" encoding?

> 3.2. Transferring CIFs
>        Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed.

Note of course that I consider that a CIF is transferred every time it
is written to a filesystem, under which definition local encoding
would not be allowed.  In any case, I would tighten up this
requirement to be UTF8 unless both parties agree on UTF16.

> 3.3. Receiving CIFs
>        The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form.  He should *expect* that form unless some
> alternative agreement is established.  Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's
> responsibility.  Again, this is not significantly different from the UTF-8 only case.
>
>
> A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to
> share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange.  At the same time,
> local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer.  Even if
> end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do.  That's sufficient.

As I've said said in my addendum, with guidance, most CIF2 programs
could probably come up with consistent identification of the local
encoding on any given day.  Whether that corresponds to the same
encoding used for any given CIF file on the "local" filesystem is
another thing, depending on what the code page was on the day it was
written and whether it was even written by the same system (ie shared
mounts).  So, saying that local text conventions can be relied up
within the one computer is a bit of a stretch as I've discussed above.
I agree that we only care about the implementers in this case.

> So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer
> the former?  Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a
> significant proportion is not.  If the standard adequately covers these behaviors then we can expect more uniform support.
> Moreover, this bears directly on community acceptance of the spec.  If flaunting the spec with respect to encoding becomes
> common, then the spec will have failed, at least in that area.  Having failed in one area, it is more likely to fail in others.

We disagree on the "significant proportion".  I think (with perhaps as
little hard evidence as you?  Or do you know something I don't?) that
very few CIF2 programmers will want to support the default encoding,
especially given the difficulties described above, and those users
with a penchant for editing CIF files will learn very quickly how to
choose UTF8 in a drop-down menu if said programs provide an error
message pointing to an IUCr webpage (for example).

I have few objections (now) to including UTF16, provided that any
files in UTF16 encoding are explicitly negotiated as such.  My
original objection to UTF16 was based on users with an
ASCII-compatible workflow opening a CIF2 file for viewing or editing
and seeing junk.  If such files only appear on these users' systems by
deliberate request, this is not such a big deal.  In all other aspects
UTF16 satisfies my original requirements, most obviously
identifiability. I would still stack the dice in favour of UTF8,
however.

James
========================================

Addendum on local encoding (not germane to above argument in the end):

Before accepting "local", we need to be sure that we know that "local
encoding" is a well-defined concept.  For "local encoding" to be a
well-defined concept, we would require that programmers using
different programming languages will be able to independently
determine which encoding is the local encoding from within their
various programs.  If different local programs do not agree on what
the local encoding is, one program will write files in one "local"
encoding, which is then input by another program assuming a different
"local" encoding, and all sorts of confusion ensues, especially after
the second program thoughtfully transcodes to UTF8.  (Note that
programs will usually have no way of telling if they have correctly
determined what the "local" encoding is, as the CIF file itself will
parse fine in any ASCII-compatible encoding).

My preliminary investigations suggest that even Windows manages to be
more or less consistent on the "single local encoding" front, via use
of the GetACP() function (used by at least CPython and Gnu Java).
MacOS has a system default encoding, and Unix variants use the LANG
variable.  Fortran 2003 has an ENCODING=DEFAULT option which in
gfortran simply does nothing (ie passes the bytes in a character
string directly as is to disk), so a Fortran program wishing to offer
the local option would need to implement the encoding machinery
themselves.  Anyway, I would not immediately exclude "local encoding"
for being ill-defined.


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.