[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .

Hello Brian,

On Wednesday, September 15, 2010 7:39 AM, Brian McMahon wrote:

[...]

>But, whatever the outcome, the IUCr will undoubtedly receive files
>*intended* by the authors as CIF submissions, that come in a variety of
>character-set encodings. For the most part, we will want to accept
>these without asking the author what the encoding was, not least
>because the typical author will have no idea (and increasingly,
>our typical author will struggle to understand the questions we are
>posing since English is not his or her native language - or perhaps we
>will struggle to understand the reply).
>
>So my concerns are:
>
>(1) how easily can we determine the correct encoding with which the
>file was generated;

In general, it is not possible to do this.  These practical matters bear on the issue:

a) If the authors use only ASCII characters then in most cases the actual encoding either (i) is indistinguishable from and congruent with UTF-8 for the file's contents, or (ii) is autodetectable

b) If the authors put literal non-ASCII characters in their CIF, then UTF-8 and UTF-16 variants (and UTF-32 variants, though these are rarely used) could be autodetected with excellent reliability, but these are typically *not* the default encoding in current computing environments.  Other encodings cannot reliably be distinguished, though one might attempt to guess based on geographic origin of the CIF and/or natural language text in certain data items.  That's not very satisfactory.

>(2) how easily can we convert it into our canonical encoding(s) for
>in-house production, archiving and delivery?

If a file's encoding is known, then transcoding it is easy.  The only potential issue is if the result encoding does not have codes for some of the input characters, but in practice, this is not an issue for UTF-8 (or UTF-16 or UTF-32) as the result encoding.

If a file's encoding cannot reliably be determined, then correctly transcoding it is impossible.

>First a few comments on that "canonical encoding(s)". Simon and I have
>both been happy enough to consider UTF-8 as a lingua franca
[...]
> However, many of our existing
>CIF applications may choke on a UTF-8 file, and we may need to
>create working formats that are pure ASCII.

If you need support for pure ASCII then you need some kind of general escape mechanism by which to represent non-ASCII characters in ASCII.  Something like Python's "\uxxxx[x[x]]" syntax, perhaps.  Such a scheme could work equally well for non-ASCII characters in data names as for those in values, but there may be secondary considerations for those in data names.

> I would also prefer to
>retain a single archival version of a CIF
[...]
>from which alternative encodings that we choose to support for
>delivery from the archive can be generated on the fly.
>
>So, really, the desire would be to have standalone applications that
>can convert between character encodings on the fly. Does anyone know
>of the general availability of such tools? The more, reliable,
>conversions that can be made, the more relaxed we are about accepting
>multiple input encodings. I have to say that a very quick Google
>search hasn't yet thrown up much encouragement here.

I'm not sure about specific commercially / openly available transcoders, but it's a relatively easy problem.  I can write a simple one in under half an hour that would handle a large proportion of what you want -- with the exception of the problem of representing non-ASCII characters in ASCII without data loss.  That's not really a hard problem either, once a specific solution is chosen, but it will require a custom program.

>Now, back to (1). In similar vein, do you know of any standalone
>utilities that help in determining a text-file character encoding?

The most prominent contender in this space appears to be Mozilla's encoding detection algorithm, which is available in library form and in a few programs.  I do not have personal experience with any of them.  All the algorithms and utilities I have researched are focused on HTML pages and rely on the input containing text in a natural language associated with the encoding -- the more natural language text, the better.  I don't think any of them are well suited to CIF, nor especially to Acta Cryst submissions (because the text must be English).

>One utility we use heavily in the submission system is "file"
>(http://freshmeat.net/projects/file - we currently use version 4.26
>with an augmented and slightly modified magic file).

'File' implements an heuristic approach based on characteristic signatures associated with many file types.  It is much more reliable for some file types than for others.  Without going into detail, 'file' is simply not up to the task of discerning among various text encodings, notwithstanding its recognition of signatures for some varieties of Unicode text.

>It seems to me that without some reasonably reliable discriminator,
>John's endorsement of support for "local" encodings will allow files
>to leak out into the wider world where they can't at all easily be
>handled or even properly identified. (Though, as many have argued
>persuasively, "forbidding" them is not going to prevent such files
>from being created, and possibly even used fruitfully within local
>environments.)

Indeed, I claim that there is *no* way to prevent authors from sending "CIFs" encoded in any particular way out into the world.  It already happens.  Nothing the standard can say will make it stop.  The most the standard can do is to declare that such files aren't actually CIFs, but that wouldn't help anybody very much.  What *can* happen, however, is that recipients of such files can reject them if the encoding is ambiguous.

Were I setting policy for Acta Crystallographica with respect to CIF2, I would require CIF2 submissions to be encoded in UTF-8, or perhaps alternatively in UTF-16 if that ends up allowed by the standard.  If IUCr wishes to be relaxed about _enforcement_ of such a policy in order to better serve authors, then fine, but that's a tricky proposition.  I expect that in this area it will be much easier to tell authors "do this" than to after the fact determine "what did you do?".  Chester will face that policy decision regardless of the standard's ultimate position on encoding.

>Remember that many CIFs will come to us in the end after passage across
>many heterogeneous systems.
[...]
>We'll
>also see files shuttled between co-authors with different languages,
>locales, OSes, and exchanged via email, ftp, USB stick etc.
>"Corruptions" will inevitably be introduced in these interchanges -
>sometimes subtle ones.
[...]

In no way do I discount those issues, but none of them can be solved by CIF2.  James and I differ on how much the standard can even influence those areas, but I think little.  Whatever influence it can exert, however, is at least as great under my UTF-8[/16]+local proposal as it is under UTF-8 only, because neither grants blanket acceptance to encodings other than UTF-8 and maybe UTF-16, yet UTF-8[/16]+local also focuses attention on the fact that local conventions do differ.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]