Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .

Hi James,

On Friday, September 17, 2010 2:42 AM, James Hester wrote:
>Regarding your UTF8/16 + local proposal:  I think I'd be willing to
>accept UTF16 in addition to UTF8 (see below).

I do favor supporting UTF-16 in addition to UTF-8, so I'm pleased you're willing to agree to that, but that's not the central theme of the proposal.  Nevertheless, it feels like we're coming close to a resolution.

My apologies if the rest of my response is long-winded; the key points are
(i)  Are we ready to / do we need to vote on the local encodings question?
(ii) With some caveats, I support your mitigating responses to allowing local encodings.

And so...

>  Regarding local
>encoding, note this blog posting from a Microsoft .Net developer,
>entitled "Don't Use Encoding.Default"

I hadn't seen that particular post before, but many Java people, too, regard explicitly specifying text encodings as the best practice.  Partly from that background sprang my support for various tagging proposals that have been floated here.

Unfortunately, that train has long since left us behind on the platform.  New standard notwithstanding, I don't see an opportunity to effect an abrupt shift in program and user behavior -- specifically, the behavior of using default text conventions implicitly and routinely.  If we formally require UTF-8/16, it can only be with the understanding that many users and programs will ignore that requirement altogether.  I don't find that at all appealing or useful, and I do not support it.

I think we will achieve more consistent CIF2 software, and we will better influence programmers and users, by standardizing the use of default text conventions with CIF2.  I would be content to deprecate such use.  I would favor non-normative commentary in the spec that explains the issue and discourages reliance on default text encoding.  I would also favor publicizing resources describing how to convert local text to UTF-8 (or -16), and creating such resources if necessary.  I want to see people using UTF-8/16 for their CIFs, but I don't want to cut them off, standards-wise, when they don't.

>In fact, it is rather difficult to
>find any instructions as to how to determine the platform's "local"

The point of default conventions is that you don't have to determine what they are, you just use them.  In fact, in some programming environments, there is no easy way to do otherwise.  For example, to the best of my knowledge, there is no way to write a standard-conformant Fortran 95 program that portably reads text from a file in anything but the default encoding.

>" The ANSI code pages can be different on different computers, or can
>be changed for a single computer, leading to data corruption. For the
>most consistent results, applications should use UTF-8 or UTF-16 *when

(Emphasis added.)  I second that advice, and I would be happy to have non-normative comments to that effect in the CIF2 standard.  The situation for the standard, however, is not the same as for a program.  It is valuable to standardize even practices that we frown upon when we have every reason to expect that such practices will continue.

>My concern precisely.  And: these files with local encoding still need
>some sort of mechanism to allow reliable transmission. And what about
>remote filesystem mounts for shared files?  If one computer has a
>different local encoding and stores a file on its "local" filesystem,
>the next computer to access that "local" file may have a different
>"local" encoding and get it wrong.

The mechanism for reliable transmission is to transcode, if necessary, to UTF-8/16, and transmit the result.  This is exactly the same mechanism that would be available for reliable transmission if UTF-8 were the only standardized encoding (under which case I include transmission of non-UTF-8 almost-CIFs).  The mechanism is the same for reliably sharing CIFs among environments where compatibility of default conventions is uncertain.  I see no reason to believe that users' decisions whether to employ that mechanism will be driven by anything other than practical considerations, the standard's position notwithstanding.  I would expect some programmers to be more influenced by the standard, but in the end they are faced with the same practical considerations.

>  And so on. Frankly, I still see no
>merit in including local encodings in CIF2 at all.

I value standardizing behavior that we all (I think) expect will be common, even though that behavior isn't ideal.  In that way I expect to support well-defined and consistent responses to that behavior (mainly in software).  Given that I have said so before without persuading you, we will have to agree to disagree here.

>If the rest of
>you disagree, I won't argue about it further,

Is that a call for a vote?

> but instead will attempt
>to mitigate the damage by supporting the following moves:
>(i) compliant CIF processors are *not* required to accept files in
>local encoding;

It is inconsistent to allow local text conventions in the file format definition, but to permit conformant processors to reject them.  Additionally, I oppose inclusion of any explicit requirements on CIF processors, preferring instead to rely on the format specification to define what conformant processors must do.  I could, however, accept defining separate flavors of CIF distinguished by these encoding distinctions, so that programs could conform to one, the other, or both.  I'm not sure I like that, but I think I could agree to it if it helps us wrap this up.

>(ii) CIF developer documentation outlines the reasons that "local"
>encoding is a bad idea

I support that fully.

>(iii) the IUCr and databases are urged to make submitters check
>round-trip files if they have received files in non UTF8/UTF16 form

I think that's a good idea.

>(iv) the IUCr and databases encourage UTF8 submission.

Absolutely.  As I have written before, I think it would be an even better idea for IUCr and databases to *require* that CIFs be submitted to them in UTF-8/16 form (or even in UTF-8 form exclusively), but there are legitimate reasons why they might not want to adopt such a policy.

>(v) CIF developer documentation outlines the techniques for
>ascertaining the preferred method of determining local encoding in a
>variety of languages and platforms.

Ok.  As I wrote above, the whole point of default encodings is you don't need to figure it out.  By definition, it's what you get when you don't specify (or when you generically ask for defaults).  On the other hand, there might be special cases (or ordinary ones that I have not considered) where you do have to figure it out after all.  Information about how to do so is relevant.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.