[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- From: James Hester <jamesrhester@xxxxxxxxx>
- Date: Fri, 17 Sep 2010 17:41:48 +1000
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local><AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local><AANLkTintziXhwVCEFD0yUtTDo9KG8ut=oL4OgmkjmEBe@mail.gmail.com><alpine.BSF.2.00.1008240629120.23114@epsilon.pair.com><AANLkTi=+qZQrWJ3duOzWyPq5H=w1GOVbeKRfFLTR8u5a@mail.gmail.com><alpine.BSF.2.00.1008240920580.23114@epsilon.pair.com><AANLkTikRLKp6oREvD4KcgUd-H-Cu6xoOrGWgQE1zUyx7@mail.gmail.com><alpine.BSF.2.00.1009022333190.52468@epsilon.pair.com><AANLkTimLUnUjNuS9EmMbtTurxB3MGtGvM6gWxZw6aRLE@mail.gmail.com><alpine.BSF.2.00.1009030735110.95035@epsilon.pair.com><AANLkTinxkquC5cY0m23yzBVgm7afmYYfh6+2yMz=Hr_w@mail.gmail.com><alpine.BSF.2.00.1009100711070.59446@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikuoQEU-rv9GkTqqc0u0qgd1ugf+cGTfqF77j-E@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local>
Hi John: good to see further constructive suggestions. Regarding your UTF8/16 + local proposal: I think I'd be willing to accept UTF16 in addition to UTF8 (see below). Regarding local encoding, note this blog posting from a Microsoft .Net developer, entitled "Don't Use Encoding.Default" http://blogs.msdn.com/b/shawnste/archive/2005/03/15/don-t-use-encoding-default.aspx Indeed, all of the developer-oriented material that I have looked at concerning Microsoft platforms recommends that the developer consciously *chooses* a Unicode-based encoding where possible, that is, ignores any local defaults. In fact, it is rather difficult to find any instructions as to how to determine the platform's "local" encoding. By reading Python source code, I found two Microsoft API functions, "GetACP" and "GetOEMCP", mentioned above, that can be used to determine the default/preferred encoding as an ANSI code page (see http://msdn.microsoft.com/en-us/library/dd318070%28VS.85%29.aspx). The online documentation for both functions contains the following bland comment: " The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use UTF-8 or UTF-16 when possible." My concern precisely. And: these files with local encoding still need some sort of mechanism to allow reliable transmission. And what about remote filesystem mounts for shared files? If one computer has a different local encoding and stores a file on its "local" filesystem, the next computer to access that "local" file may have a different "local" encoding and get it wrong. And so on. Frankly, I still see no merit in including local encodings in CIF2 at all. If the rest of you disagree, I won't argue about it further, but instead will attempt to mitigate the damage by supporting the following moves: (i) compliant CIF processors are *not* required to accept files in local encoding; (ii) CIF developer documentation outlines the reasons that "local" encoding is a bad idea (iii) the IUCr and databases are urged to make submitters check round-trip files if they have received files in non UTF8/UTF16 form (iv) the IUCr and databases encourage UTF8 submission. (v) CIF developer documentation outlines the techniques for ascertaining the preferred method of determining local encoding in a variety of languages and platforms. (I have added an addendum on local encodings with more information if anybody is interested) On Tue, Sep 14, 2010 at 4:52 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote: > > On Sunday, September 12, 2010 11:26 PM, James Hester wrote: > [...] >>To my mind, the encoding of plain CIF files remains an open issue. I >>do not view the mechanisms for managing file encoding that are >>provided by current OSs to be sufficiently robust, widespread or >>consistent that we can rely on developers or text editors respecting >>them [...]. > > I agree that the encoding of plain CIF files remains an open issue. > > I confess I find your concerns there somewhat vague, especially to the extent that they apply within the confines of a single > machine. Do your concerns extend to that level? If so, can you provide an example or two of what you fear might go wrong in that > context? A concrete example: a scientist in a multilingual country (e.g. Ukrainian/Russian/English in Ukraine) is used to switching locales to get legacy programs (ie those that rely on "default" encoding!) to display and/or input text properly. CIF files written in "local" encoding using one locale will not be read correctly in a different locale on the same machine. I note the following sentence in Microsoft's guide to encodings at http://msdn.microsoft.com/en-us/library/ms404377.aspx: "However, when you have the opportunity to choose an encoding, you are strongly recommended to use a Unicode encoding, typically either UTF8Encoding or UnicodeEncoding". I am simply following this recommendation, except that I think we can save our developers some angst by making the appropriate choice for them, so that they don't have to contend with those developers that haven't thought about the issues. > As Herb recently wrote, "Multiple encodings are a fact of life when working with text." CIF2 looks like text, it feels like text, and > despite some exotic spice, it tastes like text -- even in UTF-8 only form. We cannot pretend that we're dealing with anything other > than text. We need to accept, therefore, that no matter what we do, authors and programmers will need to account for multiple > encodings, one way or another. The format specification cannot relieve either group of that responsibility. And multiple encodings will continue to be a fact of life if we actively encourage their proliferation. We can at least reduce the amount that programmers need to consider multiple encodings by not building the problem into the specification. Then programmers only need to contend with non-conformant behaviour, to which a reasonable approach is gentle, informative rejection of the file. I acknowledge that there seems to be a difference in perceptions as to how widespread non-conformance will be (I think it will be negligible and manageable with a little education). > That doesn't necessarily mean, however, that CIF must follow the XML model of being self-defining with regard to text encoding. > Given CIF's various uses, we gain little of practical value in this area by defining CIF2 as UTF-8 only, and perhaps equally little by > defining required decorations for expressing random encodings. Moreover, the best reading of CIF1 is that it relies on the *local* > text conventions, whatever they may be, which is quite a different thing than handling all text conventions that might conceivably > be employed. > > With that being the case, I don't think it needful for CIF2 in any given environment to endorse foreign encoding conventions other > than UTF-8. CIF2 reasonably could endorse UTF-16 as well, though, as that cannot be confused with any ASCII-compatible > encoding. Allowing UTF-16 would open up useful possibilities both for imgCIF and for future uses not yet conceived. Additionally, > since CIF is text I still think it important for CIF2 to endorse the default text conventions of its operating environment. If Microsoft documents are to be believed, they would rather developers *didn't* try to figure out what the default encoding is. Perhaps CIF2 should instead endorse the position of just about everybody writing about encodings instead, including the producers of the operating environment..."choose UTF8 if you have a choice"? > Could we agree on those three as allowed encodings? Consider, given that combination of supported alternatives and no extra > support from the spec, how might various parties deal with the unavoidable encoding issue. Here are some of the more reasonable > alternatives I see: > > 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: > > Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The responsibility to perform any needed transcoding is on the other party. This is just as it might be with UTF-8-only. > > Option b) in addition to supporting UTF-8 and/or UTF-16, support other encodings by allowing users to explicitly specify them > as part of the submission/retrieval process. The processor / repository would either ensure the CIF is properly labeled, or, better, > transcode it to UTF-8[/16]. This also is just as it might be with UTF-8 only. As discussed before, users are not necessarily going to know what their local encoding is, making the selection untrustworthy. Only option (a) is viable. > 2. Programs and Libraries: > > Option a) On input, detect encoding by checking first for UTF-16, assuming UTF-8 if not UTF-16, and falling back to default > text conventions if a UTF-8 decoding error is encountered. On output, encode as directed by the user (among the two/three > options), defaulting to the input encoding when that is available and feasible. These would be desirable behaviors even in the > UTF-8 only case, especially in a mixed CIF1/CIF2 environment, but they do exceed UTF-8-only requirements. I don't think the user would necessarily know which encoding to prefer if offered a choice. I believe the safest route is to output in the same encoding as the input, which at least avoids introducing errors if the local encoding is different to what the previous program thought it was and then the resulting errors are preserved when transcoding to UTF8/16. So option (a) is not viable > Option b) Require input and produce output according to a fixed set of conventions (whether local text conventions or > UTF-8/16). The program user is responsible for any needed transcoding. This would be sufficient for the CIF2, UTF-8 only case, > and is typical in the CIF1 case; those differ, however, in which text conventions would be assumed. This is acceptable in that it doesn't make anything worse by producing incorrect UTF8/16 text due to use of incorrect local encoding. When the time comes to transcode to UTF8, some user interaction for checking of the encoding is necessary, so should not be done silently. > 3. Users/Authors: > 3.1. Creating / editing CIFs > No change from current practice is needed, but users might choose to store CIFs in UTF-8[/16] form. This is just as it would > likely be under UTF-8 only. I assume by "current practice" you mean editing files in "local" encoding? > 3.2. Transferring CIFs > Unless an alternative agreement on encoding can be reached by some means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) allowed. Note of course that I consider that a CIF is transferred every time it is written to a filesystem, under which definition local encoding would not be allowed. In any case, I would tighten up this requirement to be UTF8 unless both parties agree on UTF16. > 3.3. Receiving CIFs > The receiver may reasonably demand that the CIF be provided in UTF-8[/16] form. He should *expect* that form unless some > alternative agreement is established. Any desired transcoding from UTF-8[/16] to an alternative encoding is the user's > responsibility. Again, this is not significantly different from the UTF-8 only case. > > > A driving force in many of those cases is the well-understood (especially here!) fact that different systems cannot be relied upon to > share text conventions, thus leaving UTF-8[/16] as the only available general-purpose medium of exchange. At the same time, > local conventions are not forbidden from use where they can be relied upon -- most notably, within the same computer. Even if > end-users, as a group, do not appreciate those details, we can ensure via the spec that CIF2 implementers do. That's sufficient. As I've said said in my addendum, with guidance, most CIF2 programs could probably come up with consistent identification of the local encoding on any given day. Whether that corresponds to the same encoding used for any given CIF file on the "local" filesystem is another thing, depending on what the code page was on the day it was written and whether it was even written by the same system (ie shared mounts). So, saying that local text conventions can be relied up within the one computer is a bit of a stretch as I've discussed above. I agree that we only care about the implementers in this case. > So, if pretty much all my expected behavior under UTF-8[/16]+local is the same as it would be under UTF-8-only, then why prefer > the former? Because under UTF-8[/16]+local, all the behavior described is conformant to the spec, whereas under UTF-8 only, a > significant proportion is not. If the standard adequately covers these behaviors then we can expect more uniform support. > Moreover, this bears directly on community acceptance of the spec. If flaunting the spec with respect to encoding becomes > common, then the spec will have failed, at least in that area. Having failed in one area, it is more likely to fail in others. We disagree on the "significant proportion". I think (with perhaps as little hard evidence as you? Or do you know something I don't?) that very few CIF2 programmers will want to support the default encoding, especially given the difficulties described above, and those users with a penchant for editing CIF files will learn very quickly how to choose UTF8 in a drop-down menu if said programs provide an error message pointing to an IUCr webpage (for example). I have few objections (now) to including UTF16, provided that any files in UTF16 encoding are explicitly negotiated as such. My original objection to UTF16 was based on users with an ASCII-compatible workflow opening a CIF2 file for viewing or editing and seeing junk. If such files only appear on these users' systems by deliberate request, this is not such a big deal. In all other aspects UTF16 satisfies my original requirements, most obviously identifiability. I would still stack the dice in favour of UTF8, however. James ======================================== Addendum on local encoding (not germane to above argument in the end): Before accepting "local", we need to be sure that we know that "local encoding" is a well-defined concept. For "local encoding" to be a well-defined concept, we would require that programmers using different programming languages will be able to independently determine which encoding is the local encoding from within their various programs. If different local programs do not agree on what the local encoding is, one program will write files in one "local" encoding, which is then input by another program assuming a different "local" encoding, and all sorts of confusion ensues, especially after the second program thoughtfully transcodes to UTF8. (Note that programs will usually have no way of telling if they have correctly determined what the "local" encoding is, as the CIF file itself will parse fine in any ASCII-compatible encoding). My preliminary investigations suggest that even Windows manages to be more or less consistent on the "single local encoding" front, via use of the GetACP() function (used by at least CPython and Gnu Java). MacOS has a system default encoding, and Unix variants use the LANG variable. Fortran 2003 has an ENCODING=DEFAULT option which in gfortran simply does nothing (ie passes the bytes in a character string directly as is to disk), so a Fortran program wishing to offer the local option would need to implement the encoding machinery themselves. Anyway, I would not immediately exclude "local encoding" for being ill-defined. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... . (Bollinger, John C)
- References:
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . (Bollinger, John C)
- Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Bollinger, John C)
- Prev by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...
- Next by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Prev by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Next by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Index(es):