[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...

In this email I try to pin down what supporting local encoding might imply.

I think it is fair to say that John is advocating including "local"
encoding in the list of CIF2 encodings because:

(i) this will be the default encoding assumed by text editors
(ii) there will be a significant tendency among programmers not to
specify encoding when reading/writing CIF files

I was surprised to read that we were worried about programmers not
getting the message, having assumed up until now that we were
concerned only about ordinary users not coming to grips with non-local
encoding.  Anyway, let's put ourselves in the programmer's shoes on a
system for which local encoding is not UTF8/16:

Programmer A wants to support UTF8, UTF16 and local.  When reading a
CIF file, she *must* first try UTF8, then UTF16, and only then local,
because a UTF8 file will most probably read in without error as a file
in local encoding.  However, this programmer is not one of those
identified in (ii), because she is actively setting UTF8 and 16 as
input encodings.

Programmer B wants no business with setting encodings, and so supports
only reading/writing local encoding.  His program will unfortunately
also read in UTF8 files assuming local encoding. The program thus
behaves correctly only if the user always remembers to either produce
or transcode CIF files to local encoding, assuming that the user has
read the documentation for the program sufficiently to know that this
is even an issue.  As an added bonus, this user has to know what the
local encoding is, as the programmer is presumably not making any
effort to find out and communicate it (as this would actually be more
work than just specifying the damn encoding already). I believe that
this is an unworkable situation, and not one that we should

My point being that reason (ii) (lazy programmers) is not a good
justification for keeping local encoding on the list of acceptable
encodings.  I have never seen reason (i) as sufficient justification.

In any case, I do not think that there will be many Programmer Bs.
Note the following points:
(a) Dealing with encoding in most common languages is simple.  Note
that even modern Fortran can handle UTF8 - see the code snippets at
and http://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html
(b) If UTF8 is to be supported for reading, files have to be opened
explicitly in UTF8, so the programmer is already explicitly specifying
(c) The audience of programmers for CIF is (unfortunately) rather
small. They can all be reached very easily for active education on how
and why UTF8 encoding is specified.

And the lack of local encoding can be managed simply: UTF8 encoding
and "local" will almost always coincide in the ASCII space, so the
absence of local encoding in the acceptable encoding list is invisible
on day one of CIF2.  Introduction of non-ASCII characters into CIFs
can then be managed from Chester through gradual introduction of
non-ASCII dataname values, first in non-critical places.  Chester can
monitor the proportion of incorrectly encoded files received and
calibrate a response.

However you all assess the above points, I think it is clear that John
and I will have to agree to disagree on the value of local encodings.
The root cause, I think, is differing perceptions of programmer
responsiveness to the standard.  I appreciate John's efforts to find a
compromise, but I believe we have exhaused our avenues in this

Well, I'm ready to vote.   Would anybody else like to make any final
points before we call for a vote?

On Sat, Sep 18, 2010 at 1:33 AM, Bollinger, John C
<John.Bollinger@stjude.org> wrote:

> Unfortunately, that train has long since left us behind on the platform.  New standard notwithstanding, I don't see an opportunity to
> effect an abrupt shift in program and user behavior -- specifically, the behavior of using default text conventions implicitly and
> routinely.  If we formally require UTF-8/16, it can only be with the understanding that many users and programs will ignore that
> requirement altogether.  I don't find that at all appealing or useful, and I do not support it.
> I think we will achieve more consistent CIF2 software, and we will better influence programmers and users, by standardizing the
> use of default text conventions with CIF2.  I would be content to deprecate such use.  I would favor non-normative commentary in
> the spec that explains the issue and discourages reliance on default text encoding.  I would also favor publicizing resources
> describing how to convert local text to UTF-8 (or -16), and creating such resources if necessary.  I want to see people using
> UTF-8/16 for their CIFs, but I don't want to cut them off, standards-wise, when they don't.
> [...]
>>In fact, it is rather difficult to
>>find any instructions as to how to determine the platform's "local"
> The point of default conventions is that you don't have to determine what they are, you just use them.  In fact, in some
> programming environments, there is no easy way to do otherwise.  For example, to the best of my knowledge, there is no way to
> write a standard-conformant Fortran 95 program that portably reads text from a file in anything but the default encoding.

If you don't know what your input encoding is, how do you transcode to UTF8?
> The mechanism for reliable transmission is to transcode, if necessary, to UTF-8/16, and transmit the result.  This is exactly the
> same mechanism that would be available for reliable transmission if UTF-8 were the only standardized encoding (under which case
> I include transmission of non-UTF-8 almost-CIFs).  The mechanism is the same for reliably sharing CIFs among environments
> where compatibility of default conventions is uncertain.  I see no reason to believe that users' decisions whether to employ that
> mechanism will be driven by anything other than practical considerations, the standard's position notwithstanding.  I would expect
> some programmers to be more influenced by the standard, but in the end they are faced with the same practical considerations.
>>  And so on. Frankly, I still see no
>>merit in including local encodings in CIF2 at all.
> I value standardizing behavior that we all (I think) expect will be common, even though that behavior isn't ideal.  In that way I expect
> to support well-defined and consistent responses to that behavior (mainly in software).  Given that I have said so before without
> persuading you, we will have to agree to disagree here.

>> but instead will attempt
>>to mitigate the damage by supporting the following moves:
>>(i) compliant CIF processors are *not* required to accept files in
>>local encoding;
> It is inconsistent to allow local text conventions in the file format definition, but to permit conformant processors to reject them.
> Additionally, I oppose inclusion of any explicit requirements on CIF processors, preferring instead to rely on the format
> specification to define what conformant processors must do.  I could, however, accept defining separate flavors of CIF
> distinguished by these encoding distinctions, so that programs could conform to one, the other, or both.  I'm not sure I like that,
> but I think I could agree to it if it helps us wrap this up.

> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
cif2-encoding mailing list

Reply to: [list | sender only]