[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Hi Brian: I think that John B and Simon have answered your questions
adequately (ie you can forget about reliable autodetection of
non-Unicode encodings).  Chester is somewhat shielded from encoding
mixups by virtue of the fact that you are in contact with the author
of the CIF, who will have opportunities to catch encoding errors at
some stage prior to the manuscript being finalised.  That said, the
less potential for mixup between multiple authors prior to submission,
the better.

For my part, I think the IUCr could handle manuscript submissions as follows:

(i) CheckCIF should report non-UTF8 encoding as a top-level warning,
with the warning message pointing to an IUCr-maintained webpage which
describes how to save/convert files to UTF8 encoding for a range of
popular editors
(ii) The standard should give as little encouragement to non-UTF8
encodings as possible, to reduce the number of non-UTF8 submissions in
the first place
(iii) UTF8 introduction can be staged relatively slowly, starting from
allowing it in a few non-essential datanames (e.g. defining
_author_name_native_script or somesuch).  Let's remember that on day 1
everything can still be ASCII as the dictionaries will be able to
restrict character sets to ASCII
(iv) Authors are not required to choose an encoding upon submission,
but if non-UTF8 is detected, the authors will automatically be
presented with a PDF version of their CIF manuscript and advised to
check carefully, especially non-ASCII characters (Greek symbols!).

On Thu, Sep 16, 2010 at 12:11 AM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
> Hi Brian: I dont know of any standard text-encoding identifiers and
> transcoders.
>
> There are certainly SDKs out there that provide text codecs to read/write
> data;
> the trick is identifying the original encoding in order to select the codec.
>  Interactive applications might resort to prompting the
> user to confirm the encoding by presenting them with a view of the text and
> a list of encodings - the
> user can toggle through the encodings until the document is rendered
> correctly (e.g. MS Word does this).
> Obviously this is not ideal, but is something I've been thinking about as
> part of the web upload process.
>
> In addition, there's documentation on heuristic approaches to detecting
> encoding (as employed by browsers - indeed Mozilla makes its source
> available).
> I dont think this sort of autodetection will prove useful though, and
> actually may scupper an interactive encoding confirmation mechanism as
> described above!
>
> Cheers
>
> Simon
>
>
> ________________________________
> From: Brian McMahon <bm@iucr.org>
> To: Group for discussing encoding and content validation schemes for CIF2
> <cif2-encoding@iucr.org>
> Sent: Wednesday, 15 September, 2010 13:39:27
> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
>
> I have said little or nothing on this list so far, because I'm
> not sure that I can add anything that's of concrete use. I've read
> the many contributions, all of them carefully thought through, and
> I still see both sides (actually, all sides) of the arguments. I
> am disinterested in the eventual outcome (but not "uninterested").
>
> But, whatever the outcome, the IUCr will undoubtedly receive files
> *intended* by the authors as CIF submissions, that come in a variety of
> character-set encodings. For the most part, we will want to accept
> these without asking the author what the encoding was, not least
> because the typical author will have no idea (and increasingly,
> our typical author will struggle to understand the questions we are
> posing since English is not his or her native language - or perhaps we
> will struggle to understand the reply).
>
> So my concerns are:
>
> (1) how easily can we determine the correct encoding with which the
> file was generated;
>
> (2) how easily can we convert it into our canonical encoding(s) for
> in-house production, archiving and delivery?
>
> First a few comments on that "canonical encoding(s)". Simon and I have
> both been happy enough to consider UTF-8 as a lingua franca, since we
> perceive it as a reasonably widespread vehicle for carrying a large
> (multilingual) character set, and that is widely supported by many
> generic text processors and platforms. However, many of our existing
> CIF applications may choke on a UTF-8 file, and we may need to
> create working formats that are pure ASCII. I would also prefer to
> retain a single archival version of a CIF (well, ideally several
> identical copies for redundancy, but nonetheless a single *version*),
> from which alternative encodings that we choose to support for
> delivery from the archive can be generated on the fly.
>
> So, really, the desire would be to have standalone applications that
> can convert between character encodings on the fly. Does anyone know
> of the general availability of such tools? The more, reliable,
> conversions that can be made, the more relaxed we are about accepting
> multiple input encodings. I have to say that a very quick Google
> search hasn't yet thrown up much encouragement here.
>
> Now, back to (1). In similar vein, do you know of any standalone
> utilities that help in determining a text-file character encoding?
>
> [I'm happy to be educated, ideally off-list, in whether
> Content-Encoding negotiation in web forms can help here, since many
> of our CIF submissions come by that route, but I'm more interested in
> the general question of how you determine the encoding of a text file
> that you just happen to find sitting on the filesystem.]
>
> One utility we use heavily in the submission system is "file"
> (http://freshmeat.net/projects/file - we currently use version 4.26
> with an augmented and slightly modified magic file). This is rather
> quiet about different character encodings, though I notice the magic
> file distributed with the more recent version 5.04 does have a
> "Unicode" section, namely:
>
>
> #------------------------------------------------------------------------------
>     # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $
>     # Unicode:  BOM prefixed text files - Adrian Havill
> <havill@turbolinux.co.jp>
>     # GRR: These types should be recognised in file_ascmagic so these
>     # encodings can be treated by text patterns.
>     # Missing types are already dealt with internally.
>     #
>     0      string  +/v8                    Unicode text, UTF-7
>     0      string  +/v9                    Unicode text, UTF-7
>     0      string  +/v+                    Unicode text, UTF-7
>     0      string  +/v/                    Unicode text, UTF-7
>     0      string  \335\163\146\163        Unicode text, UTF-8-EBCDIC
>     0      string  \376\377\000\000        Unicode text, UTF-32, big-endian
>     0      string  \377\376\000\000        Unicode text, UTF-32,
> little-endian
>     0      string  \016\376\377            Unicode text, SCSU (Standard
> Compression Scheme for Unicode)
>
> Interestingly, the "animation" module of this new magic file
> conflicts with other possible UTF encodings:
>
>     # MPA, M1A
>     # updated by Joerg Jenderek
>     # GRR the original test are too common for many DOS files, so test 32 <=
> kbits <
>     = 448
>     # GRR this test is still too general as it catches a BOM of UTF-16 files
> (0xFFFE)
>     # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by
> these entries
>
>
> And, by the way, the "augmented" magic file we use (the one distributed as
> part of the KDE desktop distribution) already includes this section:
>
>     # chemical/x-cif 50
>     0    string    #\#CIF_1.1
>     >10    byte    9    chemical/x-cif
>     >10    byte    10    chemical/x-cif
>     >10    byte    13    chemical/x-cif
>
>
>
> It seems to me that without some reasonably reliable discriminator,
> John's endorsement of support for "local" encodings will allow files
> to leak out into the wider world where they can't at all easily be
> handled or even properly identified. (Though, as many have argued
> persuasively, "forbidding" them is not going to prevent such files
> from being created, and possibly even used fruitfully within local
> environments.)
>
> Remember that many CIFs will come to us in the end after passage across
> many heterogeneous systems. I referred in a previous post to my own
> daily working environment - Solaris, Linux and Windows systems linked
> by a variety of X servers, X emulators, NFS and SMB cross-mounted
> filesystems, clipboards communicating with diverse applications
> and OSes running different default locales...
> [Incidentally, hasn't SMB now been superseded by "CIFS" !]
>
> Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll
> also see files shuttled between co-authors with different languages,
> locales, OSes, and exchanged via email, ftp, USB stick etc.
> "Corruptions" will inevitably be introduced in these interchanges -
> sometimes subtle ones. For example, outside the CIF world altogether,
> we see Greek characters change their identity when we run some files
> through a PDF -> PostScript -> PDF cycle (all using software from the
> same software house, Adobe). The reason has to do with differences in
> Windows and Mac encodings, and the failure of the Acrobat software to
> track and maintain the character mappings through such a cycle.
>
> Well, I'll stop here, because in spite of my best intentions I don't
> think I'm moving the debate along very much, and I apologise if
> everything here has already been so obvious as not to need saying.
>
> I'll defer further comment until I've learned if there are already
> standard text-encoding identifiers and transcoders.
>
> Regards
> Brian
> _________________________________________________________________________
> Brian McMahon                                      tel: +44 1244 342878
> Research and Development Officer                    fax: +44 1244 314888
> International Union of Crystallography            e-mail:  bm@iucr.org
> 5 Abbey Square, Chester CH1 2HU, England
>
>
> On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote:
>> One, hopefully relevant, aside -- ascii files are not as
>> unambiguous as one might think.  Depending on what localization
>> one has one one's computer, the code point 0x5c (one of the
>> characters in the first 127) will be shown as a reverse
>> solidus, a yen currency symbol or a won currency symbol.  This
>> is a holdover from the days of national variants of the ISO
>> character set, and shows no signs of going away any time soon.
>>
>> This is _not_ the only such case, but it is one that impacts
>> most programming languages, including dREL, and existing CIF
>> files, including the PDB's mmCIF files.
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote:
>>
>>> Dear Colleagues,
>>>
>>>  To avoid any misunderstandings, rather than worrying about how
>>> we got to where we are, let us each just state a clear position.
>>> Here is mine:
>>>
>>>  I favor CIF2 being stated in terms of UTF-8 for clarity, but
>>> not specifying any particular _mandatory_ encoding of a CIF2 file
>>> as long as there is a clearly agreed mechanism between the
>>> creator and consumer of a given CIF2 file as to how to faithfully
>>> transform the file between creator's and the consumer's encodings.
>>>
>>>  I favor UTF-8 being the default encoding that any CIF2 creator
>>> should feel free to use without having to establish any prior
>>> agreement with consumers, and that all consumers should try
>>> to make arrangements to be able to read, either directly or
>>> via some conversion utility or service.  If the consumers don't
>>> make such arrangements then there may be CIF2 files that they
>>> will not be able to read.  If a producer creates a CIF2 in any
>>> encoding other than UTF8 then there may be consumers who have
>>> difficulty reading that CIF2.
>>>
>>>  I favor the IUCr taking responsibility for collecting and
>>> disseminating information on particularly useful ways to go
>>> to and from UTF8 and/or other popular encodings.
>>>
>>>  Regards,
>>>    Herbert
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>>  Dowling College, Kramer Science Center, KSC 121
>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                +1-631-244-3035
>>>                yaya@dowling.edu
>>> =====================================================
>>>
>>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote:
>>>
>>>> I sense some common ground here with my previous post.
>>>>
>>>> The UTF8/16 pair could possibly be extended to any unicode encoding that
>>>> is
>>>> unambiguously/inherently identifiable?
>>>> The 'local' encodings then encompass everything else?
>>>>
>>>> However, I think we've yet to agree that anything but UTF8 is to be
>>>> allowed
>>>> at all. We have a draft spec that stipulates UTF8,
>>>> but I infer from this thread that there is scope to relax that
>>>> restriction.
>>>> The views seem to range from at least 'leaving the door open'
>>>> in recognition of the variety of encodings available, to advocating that
>>>> the encoding should not be part of the specification at all, and it will
>>>> be
>>>> down to developers to accommodate/influence user practice. I'm in favour
>>>> of
>>>> a default encoding or maybe any encoding that is inherently
>>>> identifiable,
>>>> and providing a means to declare other encodings (however untrustworthy
>>>> the
>>>> declaration may be, it would at least be available to conscientious
>>>> users/developers), all documented in the spec.
>>>>
>>>> Please forgive me if this summary is off the mark; my conclusion is that
>>>> there's a willingness to accommodate multiple encodings
>>>> in this (albeit very small) group. Given that we are starting from the
>>>> position of having a single encoding (agreed upon after much earlier
>>>> debate), I cannot see us performing a complete U-turn to allow any
>>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some
>>>> specification of a canonical encoding or mechanisms to identify/declare
>>>> the
>>>> encoding. On the other hand, I hope to see
>>>> a revised spec that isnt UTF8 only.
>>>>
>>>> To get to the point - is there any hope of reaching a compromise?
>>>>
>>>> Cheers
>>>>
>>>> Simon
>>>>
>>>>
>>>>
>>>> ____________________________________________________________________________
>>>> From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
>>>> To: Group for discussing encoding and content validation schemes for
>>>> CIF2
>>>> <cif2-encoding@iucr.org>
>>>> Sent: Monday, 13 September, 2010 19:52:26
>>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics.
>>>> ..
>>>> .
>>>>
>>>>
>>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
>>>> [...]
>>>>> To my mind, the encoding of plain CIF files remains an open issue.  I
>>>>> do not view the mechanisms for managing file encoding that are
>>>>> provided by current OSs to be sufficiently robust, widespread or
>>>>> consistent that we can rely on developers or text editors respecting
>>>>> them [...].
>>>>
>>>> I agree that the encoding of plain CIF files remains an open issue.
>>>>
>>>> I confess I find your concerns there somewhat vague, especially to the
>>>> extent that they apply within the confines of a single machine.  Do your
>>>> concerns extend to that level?  If so, can you provide an example or two
>>>> of
>>>> what you fear might go wrong in that context?
>>>>
>>>> As Herb recently wrote, "Multiple encodings are a fact of life when
>>>> working
>>>> with text."  CIF2 looks like text, it feels like text, and despite some
>>>> exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot
>>>> pretend that we're dealing with anything other than text.  We need to
>>>> accept, therefore, that no matter what we do, authors and programmers
>>>> will
>>>> need to account for multiple encodings, one way or another.  The format
>>>> specification cannot relieve either group of that responsibility.
>>>>
>>>> That doesn't necessarily mean, however, that CIF must follow the XML
>>>> model
>>>> of being self-defining with regard to text encoding.  Given CIF's
>>>> various
>>>> uses, we gain little of practical value in this area by defining CIF2 as
>>>> UTF-8 only, and perhaps equally little by defining required decorations
>>>> for
>>>> expressing random encodings.  Moreover, the best reading of CIF1 is that
>>>> it
>>>> relies on the *local* text conventions, whatever they may be, which is
>>>> quite
>>>> a different thing than handling all text conventions that might
>>>> conceivably
>>>> be employed.
>>>>
>>>> With that being the case, I don't think it needful for CIF2 in any given
>>>> environment to endorse foreign encoding conventions other than UTF-8.
>>>> CIF2
>>>> reasonably could endorse UTF-16 as well, though, as that cannot be
>>>> confused
>>>> with any ASCII-compatible encoding.  Allowing UTF-16 would open up
>>>> useful
>>>> possibilities both for imgCIF and for future uses not yet conceived.
>>>> Additionally, since CIF is text I still think it important for CIF2 to
>>>> endorse the default text conventions of its operating environment.
>>>>
>>>> Could we agree on those three as allowed encodings?  Consider, given
>>>> that
>>>> combination of supported alternatives and no extra support from the
>>>> spec,
>>>> how might various parties deal with the unavoidable encoding issue.
>>>> Here
>>>> are some of the more reasonable alternatives I see:
>>>>
>>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and
>>>> PDB:
>>>>
>>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The
>>>> responsibility to perform any needed transcoding is on the other party.
>>>> This is just as it might be with UTF-8-only.
>>>>
>>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support
>>>> other encodings by allowing users to explicitly specify them as part of
>>>> the
>>>> submission/retrieval process.  The processor / repository would either
>>>> ensure the CIF is properly labeled, or, better, transcode it to
>>>> UTF-8[/16].
>>>> This also is just as it might be with UTF-8 only.
>>>>
>>>> 2. Programs and Libraries:
>>>>
>>>> Option a) On input, detect encoding by checking first for UTF-16,
>>>> assuming UTF-8 if not UTF-16, and falling back to default text
>>>> conventions
>>>> if a UTF-8 decoding error is encountered.  On output, encode as directed
>>>> by
>>>> the user (among the two/three options), defaulting to the input encoding
>>>> when that is available and feasible.  These would be desirable behaviors
>>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2
>>>> environment,
>>>> but they do exceed UTF-8-only requirements.
>>>>
>>>> Option b) Require input and produce output according to a fixed set
>>>> of conventions (whether local text conventions or UTF-8/16).  The
>>>> program
>>>> user is responsible for any needed transcoding.  This would be
>>>> sufficient
>>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those
>>>> differ, however, in which text conventions would be assumed.
>>>>
>>>> 3. Users/Authors:
>>>> 3.1. Creating / editing CIFs
>>>> No change from current practice is needed, but users might choose
>>>> to
>>>> store CIFs in UTF-8[/16] form.  This is just as it would likely be under
>>>> UTF-8 only.
>>>>
>>>> 3.2. Transferring CIFs
>>>> Unless an alternative agreement on encoding can be reached by some
>>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].
>>>> This
>>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe)
>>>> allowed.
>>>>
>>>> 3.3. Receiving CIFs
>>>> The receiver may reasonably demand that the CIF be provided in
>>>> UTF-8[/16] form.  He should *expect* that form unless some alternative
>>>> agreement is established.  Any desired transcoding from UTF-8[/16] to an
>>>> alternative encoding is the user's responsibility.  Again, this is not
>>>> significantly different from the UTF-8 only case.
>>>>
>>>>
>>>> A driving force in many of those cases is the well-understood
>>>> (especially
>>>> here!) fact that different systems cannot be relied upon to share text
>>>> conventions, thus leaving UTF-8[/16] as the only available
>>>> general-purpose
>>>> medium of exchange.  At the same time, local conventions are not
>>>> forbidden
>>>> from use where they can be relied upon -- most notably, within the same
>>>> computer.  Even if end-users, as a group, do not appreciate those
>>>> details,
>>>> we can ensure via the spec that CIF2 implementers do.  That's
>>>> sufficient.
>>>>
>>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is
>>>> the
>>>> same as it would be under UTF-8-only, then why prefer the former?
>>>> Because
>>>> under UTF-8[/16]+local, all the behavior described is conformant to the
>>>> spec, whereas under UTF-8 only, a significant proportion is not.  If the
>>>> standard adequately covers these behaviors then we can expect more
>>>> uniform
>>>> support.  Moreover, this bears directly on community acceptance of the
>>>> spec.  If flaunting the spec with respect to encoding becomes common,
>>>> then
>>>> the spec will have failed, at least in that area.  Having failed in one
>>>> area, it is more likely to fail in others.
>>>>
>>>>
>>>> Regards,
>>>>
>>>> John
>>>> --
>>>> John C. Bollinger, Ph.D.
>>>> Department of Structural Biology
>>>> St. Jude Children's Research Hospital
>>>>
>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding


Reply to: [list | sender only]