[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 15 Sep 2010 09:42:19 -0400 (EDT)
- In-Reply-To: <20100915123927.GA26246@emerald.iucr.org>
- References: <AANLkTimLUnUjNuS9EmMbtTurxB3MGtGvM6gWxZw6aRLE@mail.gmail.com><alpine.BSF.2.00.1009030735110.95035@epsilon.pair.com><AANLkTinxkquC5cY0m23yzBVgm7afmYYfh6+2yMz=Hr_w@mail.gmail.com><alpine.BSF.2.00.1009100711070.59446@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikuoQEU-rv9GkTqqc0u0qgd1ugf+cGTfqF77j-E@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local><930138.36485.qm@web87008.mail.ird.yahoo.com><alpine.BSF.2.00.1009141032080.26597@epsilon.pair.com><alpine.BSF.2.00.1009141050260.26597@epsilon.pair.com><20100915123927.GA26246@emerald.iucr.org>
Dear Colleagues, 1. For a Mac under OSX, I use cyclone for conversion of encodings. 2. No hash scheme will survive random trips through random editors or random systems. 3. Embedded strings of characters (e.g. the 5 accented o's or more) will also undergo strange transformations, but they will be easier to deal with without a lot of external software support. 4. There is no way to make a "pure ascii version" of a general UTF-8 file without adopting some reserved characters stirngs at the lexical level -- \U... or &#...; or somesuch as used in many other systems, but with such an extension, it is easy. 5. We can keep going on this forever -- we need to make some decisions. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 15 Sep 2010, Brian McMahon wrote: > I have said little or nothing on this list so far, because I'm > not sure that I can add anything that's of concrete use. I've read > the many contributions, all of them carefully thought through, and > I still see both sides (actually, all sides) of the arguments. I > am disinterested in the eventual outcome (but not "uninterested"). > > But, whatever the outcome, the IUCr will undoubtedly receive files > *intended* by the authors as CIF submissions, that come in a variety of > character-set encodings. For the most part, we will want to accept > these without asking the author what the encoding was, not least > because the typical author will have no idea (and increasingly, > our typical author will struggle to understand the questions we are > posing since English is not his or her native language - or perhaps we > will struggle to understand the reply). > > So my concerns are: > > (1) how easily can we determine the correct encoding with which the > file was generated; > > (2) how easily can we convert it into our canonical encoding(s) for > in-house production, archiving and delivery? > > First a few comments on that "canonical encoding(s)". Simon and I have > both been happy enough to consider UTF-8 as a lingua franca, since we > perceive it as a reasonably widespread vehicle for carrying a large > (multilingual) character set, and that is widely supported by many > generic text processors and platforms. However, many of our existing > CIF applications may choke on a UTF-8 file, and we may need to > create working formats that are pure ASCII. I would also prefer to > retain a single archival version of a CIF (well, ideally several > identical copies for redundancy, but nonetheless a single *version*), > from which alternative encodings that we choose to support for > delivery from the archive can be generated on the fly. > > So, really, the desire would be to have standalone applications that > can convert between character encodings on the fly. Does anyone know > of the general availability of such tools? The more, reliable, > conversions that can be made, the more relaxed we are about accepting > multiple input encodings. I have to say that a very quick Google > search hasn't yet thrown up much encouragement here. > > Now, back to (1). In similar vein, do you know of any standalone > utilities that help in determining a text-file character encoding? > > [I'm happy to be educated, ideally off-list, in whether > Content-Encoding negotiation in web forms can help here, since many > of our CIF submissions come by that route, but I'm more interested in > the general question of how you determine the encoding of a text file > that you just happen to find sitting on the filesystem.] > > One utility we use heavily in the submission system is "file" > (http://freshmeat.net/projects/file - we currently use version 4.26 > with an augmented and slightly modified magic file). This is rather > quiet about different character encodings, though I notice the magic > file distributed with the more recent version 5.04 does have a > "Unicode" section, namely: > > #------------------------------------------------------------------------------ > # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $ > # Unicode: BOM prefixed text files - Adrian Havill <havill@turbolinux.co.jp> > # GRR: These types should be recognised in file_ascmagic so these > # encodings can be treated by text patterns. > # Missing types are already dealt with internally. > # > 0 string +/v8 Unicode text, UTF-7 > 0 string +/v9 Unicode text, UTF-7 > 0 string +/v+ Unicode text, UTF-7 > 0 string +/v/ Unicode text, UTF-7 > 0 string \335\163\146\163 Unicode text, UTF-8-EBCDIC > 0 string \376\377\000\000 Unicode text, UTF-32, big-endian > 0 string \377\376\000\000 Unicode text, UTF-32, little-endian > 0 string \016\376\377 Unicode text, SCSU (Standard Compression Scheme for Unicode) > > Interestingly, the "animation" module of this new magic file > conflicts with other possible UTF encodings: > > # MPA, M1A > # updated by Joerg Jenderek > # GRR the original test are too common for many DOS files, so test 32 <= kbits < > = 448 > # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE) > # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries > > > And, by the way, the "augmented" magic file we use (the one distributed as > part of the KDE desktop distribution) already includes this section: > > # chemical/x-cif 50 > 0 string #\#CIF_1.1 > >10 byte 9 chemical/x-cif > >10 byte 10 chemical/x-cif > >10 byte 13 chemical/x-cif > > > > It seems to me that without some reasonably reliable discriminator, > John's endorsement of support for "local" encodings will allow files > to leak out into the wider world where they can't at all easily be > handled or even properly identified. (Though, as many have argued > persuasively, "forbidding" them is not going to prevent such files > from being created, and possibly even used fruitfully within local > environments.) > > Remember that many CIFs will come to us in the end after passage across > many heterogeneous systems. I referred in a previous post to my own > daily working environment - Solaris, Linux and Windows systems linked > by a variety of X servers, X emulators, NFS and SMB cross-mounted > filesystems, clipboards communicating with diverse applications > and OSes running different default locales... > [Incidentally, hasn't SMB now been superseded by "CIFS" !] > > Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll > also see files shuttled between co-authors with different languages, > locales, OSes, and exchanged via email, ftp, USB stick etc. > "Corruptions" will inevitably be introduced in these interchanges - > sometimes subtle ones. For example, outside the CIF world altogether, > we see Greek characters change their identity when we run some files > through a PDF -> PostScript -> PDF cycle (all using software from the > same software house, Adobe). The reason has to do with differences in > Windows and Mac encodings, and the failure of the Acrobat software to > track and maintain the character mappings through such a cycle. > > Well, I'll stop here, because in spite of my best intentions I don't > think I'm moving the debate along very much, and I apologise if > everything here has already been so obvious as not to need saying. > > I'll defer further comment until I've learned if there are already > standard text-encoding identifiers and transcoders. > > Regards > Brian > _________________________________________________________________________ > Brian McMahon tel: +44 1244 342878 > Research and Development Officer fax: +44 1244 314888 > International Union of Crystallography e-mail: bm@iucr.org > 5 Abbey Square, Chester CH1 2HU, England > > > On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote: >> One, hopefully relevant, aside -- ascii files are not as >> unambiguous as one might think. Depending on what localization >> one has one one's computer, the code point 0x5c (one of the >> characters in the first 127) will be shown as a reverse >> solidus, a yen currency symbol or a won currency symbol. This >> is a holdover from the days of national variants of the ISO >> character set, and shows no signs of going away any time soon. >> >> This is _not_ the only such case, but it is one that impacts >> most programming languages, including dREL, and existing CIF >> files, including the PDB's mmCIF files. >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote: >> >>> Dear Colleagues, >>> >>> To avoid any misunderstandings, rather than worrying about how >>> we got to where we are, let us each just state a clear position. >>> Here is mine: >>> >>> I favor CIF2 being stated in terms of UTF-8 for clarity, but >>> not specifying any particular _mandatory_ encoding of a CIF2 file >>> as long as there is a clearly agreed mechanism between the >>> creator and consumer of a given CIF2 file as to how to faithfully >>> transform the file between creator's and the consumer's encodings. >>> >>> I favor UTF-8 being the default encoding that any CIF2 creator >>> should feel free to use without having to establish any prior >>> agreement with consumers, and that all consumers should try >>> to make arrangements to be able to read, either directly or >>> via some conversion utility or service. If the consumers don't >>> make such arrangements then there may be CIF2 files that they >>> will not be able to read. If a producer creates a CIF2 in any >>> encoding other than UTF8 then there may be consumers who have >>> difficulty reading that CIF2. >>> >>> I favor the IUCr taking responsibility for collecting and >>> disseminating information on particularly useful ways to go >>> to and from UTF8 and/or other popular encodings. >>> >>> Regards, >>> Herbert >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >>> >>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote: >>> >>>> I sense some common ground here with my previous post. >>>> >>>> The UTF8/16 pair could possibly be extended to any unicode encoding that >>>> is >>>> unambiguously/inherently identifiable? >>>> The 'local' encodings then encompass everything else? >>>> >>>> However, I think we've yet to agree that anything but UTF8 is to be >>>> allowed >>>> at all. We have a draft spec that stipulates UTF8, >>>> but I infer from this thread that there is scope to relax that >>>> restriction. >>>> The views seem to range from at least 'leaving the door open' >>>> in recognition of the variety of encodings available, to advocating that >>>> the encoding should not be part of the specification at all, and it will >>>> be >>>> down to developers to accommodate/influence user practice. I'm in favour >>>> of >>>> a default encoding or maybe any encoding that is inherently identifiable, >>>> and providing a means to declare other encodings (however untrustworthy >>>> the >>>> declaration may be, it would at least be available to conscientious >>>> users/developers), all documented in the spec. >>>> >>>> Please forgive me if this summary is off the mark; my conclusion is that >>>> there's a willingness to accommodate multiple encodings >>>> in this (albeit very small) group. Given that we are starting from the >>>> position of having a single encoding (agreed upon after much earlier >>>> debate), I cannot see us performing a complete U-turn to allow any >>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some >>>> specification of a canonical encoding or mechanisms to identify/declare >>>> the >>>> encoding. On the other hand, I hope to see >>>> a revised spec that isnt UTF8 only. >>>> >>>> To get to the point - is there any hope of reaching a compromise? >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> ____________________________________________________________________________ >>>> From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG> >>>> To: Group for discussing encoding and content validation schemes for CIF2 >>>> <cif2-encoding@iucr.org> >>>> Sent: Monday, 13 September, 2010 19:52:26 >>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. >>>> . >>>> >>>> >>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote: >>>> [...] >>>>> To my mind, the encoding of plain CIF files remains an open issue. I >>>>> do not view the mechanisms for managing file encoding that are >>>>> provided by current OSs to be sufficiently robust, widespread or >>>>> consistent that we can rely on developers or text editors respecting >>>>> them [...]. >>>> >>>> I agree that the encoding of plain CIF files remains an open issue. >>>> >>>> I confess I find your concerns there somewhat vague, especially to the >>>> extent that they apply within the confines of a single machine. Do your >>>> concerns extend to that level? If so, can you provide an example or two >>>> of >>>> what you fear might go wrong in that context? >>>> >>>> As Herb recently wrote, "Multiple encodings are a fact of life when >>>> working >>>> with text." CIF2 looks like text, it feels like text, and despite some >>>> exotic spice, it tastes like text -- even in UTF-8 only form. We cannot >>>> pretend that we're dealing with anything other than text. We need to >>>> accept, therefore, that no matter what we do, authors and programmers will >>>> need to account for multiple encodings, one way or another. The format >>>> specification cannot relieve either group of that responsibility. >>>> >>>> That doesn't necessarily mean, however, that CIF must follow the XML model >>>> of being self-defining with regard to text encoding. Given CIF's various >>>> uses, we gain little of practical value in this area by defining CIF2 as >>>> UTF-8 only, and perhaps equally little by defining required decorations >>>> for >>>> expressing random encodings. Moreover, the best reading of CIF1 is that >>>> it >>>> relies on the *local* text conventions, whatever they may be, which is >>>> quite >>>> a different thing than handling all text conventions that might >>>> conceivably >>>> be employed. >>>> >>>> With that being the case, I don't think it needful for CIF2 in any given >>>> environment to endorse foreign encoding conventions other than UTF-8. >>>> CIF2 >>>> reasonably could endorse UTF-16 as well, though, as that cannot be >>>> confused >>>> with any ASCII-compatible encoding. Allowing UTF-16 would open up useful >>>> possibilities both for imgCIF and for future uses not yet conceived. >>>> Additionally, since CIF is text I still think it important for CIF2 to >>>> endorse the default text conventions of its operating environment. >>>> >>>> Could we agree on those three as allowed encodings? Consider, given that >>>> combination of supported alternatives and no extra support from the spec, >>>> how might various parties deal with the unavoidable encoding issue. Here >>>> are some of the more reasonable alternatives I see: >>>> >>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB: >>>> >>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs. The >>>> responsibility to perform any needed transcoding is on the other party. >>>> This is just as it might be with UTF-8-only. >>>> >>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support >>>> other encodings by allowing users to explicitly specify them as part of >>>> the >>>> submission/retrieval process. The processor / repository would either >>>> ensure the CIF is properly labeled, or, better, transcode it to >>>> UTF-8[/16]. >>>> This also is just as it might be with UTF-8 only. >>>> >>>> 2. Programs and Libraries: >>>> >>>> Option a) On input, detect encoding by checking first for UTF-16, >>>> assuming UTF-8 if not UTF-16, and falling back to default text conventions >>>> if a UTF-8 decoding error is encountered. On output, encode as directed >>>> by >>>> the user (among the two/three options), defaulting to the input encoding >>>> when that is available and feasible. These would be desirable behaviors >>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment, >>>> but they do exceed UTF-8-only requirements. >>>> >>>> Option b) Require input and produce output according to a fixed set >>>> of conventions (whether local text conventions or UTF-8/16). The program >>>> user is responsible for any needed transcoding. This would be sufficient >>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those >>>> differ, however, in which text conventions would be assumed. >>>> >>>> 3. Users/Authors: >>>> 3.1. Creating / editing CIFs >>>> No change from current practice is needed, but users might choose >>>> to >>>> store CIFs in UTF-8[/16] form. This is just as it would likely be under >>>> UTF-8 only. >>>> >>>> 3.2. Transferring CIFs >>>> Unless an alternative agreement on encoding can be reached by some >>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16]. This >>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe) >>>> allowed. >>>> >>>> 3.3. Receiving CIFs >>>> The receiver may reasonably demand that the CIF be provided in >>>> UTF-8[/16] form. He should *expect* that form unless some alternative >>>> agreement is established. Any desired transcoding from UTF-8[/16] to an >>>> alternative encoding is the user's responsibility. Again, this is not >>>> significantly different from the UTF-8 only case. >>>> >>>> >>>> A driving force in many of those cases is the well-understood (especially >>>> here!) fact that different systems cannot be relied upon to share text >>>> conventions, thus leaving UTF-8[/16] as the only available general-purpose >>>> medium of exchange. At the same time, local conventions are not forbidden >>>> from use where they can be relied upon -- most notably, within the same >>>> computer. Even if end-users, as a group, do not appreciate those details, >>>> we can ensure via the spec that CIF2 implementers do. That's sufficient. >>>> >>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the >>>> same as it would be under UTF-8-only, then why prefer the former? Because >>>> under UTF-8[/16]+local, all the behavior described is conformant to the >>>> spec, whereas under UTF-8 only, a significant proportion is not. If the >>>> standard adequately covers these behaviors then we can expect more uniform >>>> support. Moreover, this bears directly on community acceptance of the >>>> spec. If flaunting the spec with respect to encoding becomes common, then >>>> the spec will have failed, at least in that area. Having failed in one >>>> area, it is more likely to fail in others. >>>> >>>> >>>> Regards, >>>> >>>> John >>>> -- >>>> John C. Bollinger, Ph.D. >>>> Department of Structural Biology >>>> St. Jude Children's Research Hospital >>>> >>>> Email Disclaimer: www.stjude.org/emaildisclaimer > _______________________________________________ > cif2-encoding mailing list > cif2-encoding@iucr.org > http://scripts.iucr.org/mailman/listinfo/cif2-encoding >
_______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- References:
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (SIMON WESTRIP)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Brian McMahon)
- Prev by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Next by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Prev by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Next by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Index(es):