Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Hi Brian: I dont know of any standard text-encoding identifiers and transcoders.

There are certainly SDKs out there that provide text codecs to read/write data;
the trick is identifying the original encoding in order to select the codec.
 Interactive applications might resort to prompting the
user to confirm the encoding by presenting them with a view of the text and a list of encodings - the
user can toggle through the encodings until the document is rendered correctly (e.g. MS Word does this).
Obviously this is not ideal, but is something I've been thinking about as part of the web upload process.

In addition, there's documentation on heuristic approaches to detecting encoding (as employed by browsers - indeed Mozilla makes its source available).
I dont think this sort of autodetection will prove useful though, and actually may scupper an interactive encoding confirmation mechanism as described above!

Cheers

Simon



From: Brian McMahon <bm@iucr.org>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Wednesday, 15 September, 2010 13:39:27
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

I have said little or nothing on this list so far, because I'm
not sure that I can add anything that's of concrete use. I've read
the many contributions, all of them carefully thought through, and
I still see both sides (actually, all sides) of the arguments. I
am disinterested in the eventual outcome (but not "uninterested").

But, whatever the outcome, the IUCr will undoubtedly receive files
*intended* by the authors as CIF submissions, that come in a variety of
character-set encodings. For the most part, we will want to accept
these without asking the author what the encoding was, not least
because the typical author will have no idea (and increasingly,
our typical author will struggle to understand the questions we are
posing since English is not his or her native language - or perhaps we
will struggle to understand the reply).

So my concerns are:

(1) how easily can we determine the correct encoding with which the
file was generated;

(2) how easily can we convert it into our canonical encoding(s) for
in-house production, archiving and delivery?

First a few comments on that "canonical encoding(s)". Simon and I have
both been happy enough to consider UTF-8 as a lingua franca, since we
perceive it as a reasonably widespread vehicle for carrying a large
(multilingual) character set, and that is widely supported by many
generic text processors and platforms. However, many of our existing
CIF applications may choke on a UTF-8 file, and we may need to
create working formats that are pure ASCII. I would also prefer to
retain a single archival version of a CIF (well, ideally several
identical copies for redundancy, but nonetheless a single *version*),
from which alternative encodings that we choose to support for
delivery from the archive can be generated on the fly.

So, really, the desire would be to have standalone applications that
can convert between character encodings on the fly. Does anyone know
of the general availability of such tools? The more, reliable,
conversions that can be made, the more relaxed we are about accepting
multiple input encodings. I have to say that a very quick Google
search hasn't yet thrown up much encouragement here.

Now, back to (1). In similar vein, do you know of any standalone
utilities that help in determining a text-file character encoding?

[I'm happy to be educated, ideally off-list, in whether
Content-Encoding negotiation in web forms can help here, since many
of our CIF submissions come by that route, but I'm more interested in
the general question of how you determine the encoding of a text file
that you just happen to find sitting on the filesystem.]

One utility we use heavily in the submission system is "file"
(http://freshmeat.net/projects/file - we currently use version 4.26
with an augmented and slightly modified magic file). This is rather
quiet about different character encodings, though I notice the magic
file distributed with the more recent version 5.04 does have a
"Unicode" section, namely:

    #------------------------------------------------------------------------------
    # $File: unicode,v 1.5 2009/09/19 16:28:13 christos Exp $
    # Unicode:  BOM prefixed text files - Adrian Havill <havill@turbolinux.co.jp>
    # GRR: These types should be recognised in file_ascmagic so these
    # encodings can be treated by text patterns.
    # Missing types are already dealt with internally.
    #
    0      string  +/v8                    Unicode text, UTF-7
    0      string  +/v9                    Unicode text, UTF-7
    0      string  +/v+                    Unicode text, UTF-7
    0      string  +/v/                    Unicode text, UTF-7
    0      string  \335\163\146\163        Unicode text, UTF-8-EBCDIC
    0      string  \376\377\000\000        Unicode text, UTF-32, big-endian
    0      string  \377\376\000\000        Unicode text, UTF-32, little-endian
    0      string  \016\376\377            Unicode text, SCSU (Standard Compression Scheme for Unicode)
   
Interestingly, the "animation" module of this new magic file
conflicts with other possible UTF encodings:

    # MPA, M1A
    # updated by Joerg Jenderek
    # GRR the original test are too common for many DOS files, so test 32 <= kbits <
    = 448
    # GRR this test is still too general as it catches a BOM of UTF-16 files (0xFFFE)
    # FIXME: Almost all little endian UTF-16 text with BOM are clobbered by these entries


And, by the way, the "augmented" magic file we use (the one distributed as
part of the KDE desktop distribution) already includes this section:

    # chemical/x-cif 50
    0    string    #\#CIF_1.1
    >10    byte    9    chemical/x-cif
    >10    byte    10    chemical/x-cif
    >10    byte    13    chemical/x-cif



It seems to me that without some reasonably reliable discriminator,
John's endorsement of support for "local" encodings will allow files
to leak out into the wider world where they can't at all easily be
handled or even properly identified. (Though, as many have argued
persuasively, "forbidding" them is not going to prevent such files
from being created, and possibly even used fruitfully within local
environments.)

Remember that many CIFs will come to us in the end after passage across
many heterogeneous systems. I referred in a previous post to my own
daily working environment - Solaris, Linux and Windows systems linked
by a variety of X servers, X emulators, NFS and SMB cross-mounted
filesystems, clipboards communicating with diverse applications
and OSes running different default locales...
[Incidentally, hasn't SMB now been superseded by "CIFS" !]

Perhaps I'm just perverse; but I doubt that I'm quite unique. We'll
also see files shuttled between co-authors with different languages,
locales, OSes, and exchanged via email, ftp, USB stick etc.
"Corruptions" will inevitably be introduced in these interchanges -
sometimes subtle ones. For example, outside the CIF world altogether,
we see Greek characters change their identity when we run some files
through a PDF -> PostScript -> PDF cycle (all using software from the
same software house, Adobe). The reason has to do with differences in
Windows and Mac encodings, and the failure of the Acrobat software to
track and maintain the character mappings through such a cycle.

Well, I'll stop here, because in spite of my best intentions I don't
think I'm moving the debate along very much, and I apologise if
everything here has already been so obvious as not to need saying.

I'll defer further comment until I've learned if there are already
standard text-encoding identifiers and transcoders.

Regards
Brian
_________________________________________________________________________
Brian McMahon                                      tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England


On Tue, Sep 14, 2010 at 10:58:39AM -0400, Herbert J. Bernstein wrote:
> One, hopefully relevant, aside -- ascii files are not as
> unambiguous as one might think.  Depending on what localization
> one has one one's computer, the code point 0x5c (one of the
> characters in the first 127) will be shown as a reverse
> solidus, a yen currency symbol or a won currency symbol.  This
> is a holdover from the days of national variants of the ISO
> character set, and shows no signs of going away any time soon.
>
> This is _not_ the only such case, but it is one that impacts
> most programming languages, including dREL, and existing CIF
> files, including the PDB's mmCIF files.
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                  +1-631-244-3035
>                  yaya@dowling.edu
> =====================================================
>
> On Tue, 14 Sep 2010, Herbert J. Bernstein wrote:
>
>> Dear Colleagues,
>>
>>  To avoid any misunderstandings, rather than worrying about how
>> we got to where we are, let us each just state a clear position.
>> Here is mine:
>>
>>  I favor CIF2 being stated in terms of UTF-8 for clarity, but
>> not specifying any particular _mandatory_ encoding of a CIF2 file
>> as long as there is a clearly agreed mechanism between the
>> creator and consumer of a given CIF2 file as to how to faithfully
>> transform the file between creator's and the consumer's encodings.
>>
>>  I favor UTF-8 being the default encoding that any CIF2 creator
>> should feel free to use without having to establish any prior
>> agreement with consumers, and that all consumers should try
>> to make arrangements to be able to read, either directly or
>> via some conversion utility or service.  If the consumers don't
>> make such arrangements then there may be CIF2 files that they
>> will not be able to read.  If a producer creates a CIF2 in any
>> encoding other than UTF8 then there may be consumers who have
>> difficulty reading that CIF2.
>>
>>  I favor the IUCr taking responsibility for collecting and
>> disseminating information on particularly useful ways to go
>> to and from UTF8 and/or other popular encodings.
>>
>>  Regards,
>>    Herbert
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>>  Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                +1-631-244-3035
>>                yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 14 Sep 2010, SIMON WESTRIP wrote:
>>
>>> I sense some common ground here with my previous post.
>>>
>>> The UTF8/16 pair could possibly be extended to any unicode encoding that
>>> is
>>> unambiguously/inherently identifiable?
>>> The 'local' encodings then encompass everything else?
>>>
>>> However, I think we've yet to agree that anything but UTF8 is to be
>>> allowed
>>> at all. We have a draft spec that stipulates UTF8,
>>> but I infer from this thread that there is scope to relax that
>>> restriction.
>>> The views seem to range from at least 'leaving the door open'
>>> in recognition of the variety of encodings available, to advocating that
>>> the encoding should not be part of the specification at all, and it will
>>> be
>>> down to developers to accommodate/influence user practice. I'm in favour
>>> of
>>> a default encoding or maybe any encoding that is inherently identifiable,
>>> and providing a means to declare other encodings (however untrustworthy
>>> the
>>> declaration may be, it would at least be available to conscientious
>>> users/developers), all documented in the spec.
>>>
>>> Please forgive me if this summary is off the mark; my conclusion is that
>>> there's a willingness to accommodate multiple encodings
>>> in this (albeit very small) group. Given that we are starting from the
>>> position of having a single encoding (agreed upon after much earlier
>>> debate), I cannot see us performing a complete U-turn to allow any
>>> (potentially unrecognizable) encoding as in CIF1, i.e. without some
>>> specification of a canonical encoding or mechanisms to identify/declare
>>> the
>>> encoding. On the other hand, I hope to see
>>> a revised spec that isnt UTF8 only.
>>>
>>> To get to the point - is there any hope of reaching a compromise?
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>>
>>> ____________________________________________________________________________
>>> From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
>>> To: Group for discussing encoding and content validation schemes for CIF2
>>> <cif2-encoding@iucr.org>
>>> Sent: Monday, 13 September, 2010 19:52:26
>>> Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. ..
>>> .
>>>
>>>
>>> On Sunday, September 12, 2010 11:26 PM, James Hester wrote:
>>> [...]
>>>> To my mind, the encoding of plain CIF files remains an open issue.  I
>>>> do not view the mechanisms for managing file encoding that are
>>>> provided by current OSs to be sufficiently robust, widespread or
>>>> consistent that we can rely on developers or text editors respecting
>>>> them [...].
>>>
>>> I agree that the encoding of plain CIF files remains an open issue.
>>>
>>> I confess I find your concerns there somewhat vague, especially to the
>>> extent that they apply within the confines of a single machine.  Do your
>>> concerns extend to that level?  If so, can you provide an example or two
>>> of
>>> what you fear might go wrong in that context?
>>>
>>> As Herb recently wrote, "Multiple encodings are a fact of life when
>>> working
>>> with text."  CIF2 looks like text, it feels like text, and despite some
>>> exotic spice, it tastes like text -- even in UTF-8 only form.  We cannot
>>> pretend that we're dealing with anything other than text.  We need to
>>> accept, therefore, that no matter what we do, authors and programmers will
>>> need to account for multiple encodings, one way or another.  The format
>>> specification cannot relieve either group of that responsibility.
>>>
>>> That doesn't necessarily mean, however, that CIF must follow the XML model
>>> of being self-defining with regard to text encoding.  Given CIF's various
>>> uses, we gain little of practical value in this area by defining CIF2 as
>>> UTF-8 only, and perhaps equally little by defining required decorations
>>> for
>>> expressing random encodings.  Moreover, the best reading of CIF1 is that
>>> it
>>> relies on the *local* text conventions, whatever they may be, which is
>>> quite
>>> a different thing than handling all text conventions that might
>>> conceivably
>>> be employed.
>>>
>>> With that being the case, I don't think it needful for CIF2 in any given
>>> environment to endorse foreign encoding conventions other than UTF-8. 
>>> CIF2
>>> reasonably could endorse UTF-16 as well, though, as that cannot be
>>> confused
>>> with any ASCII-compatible encoding.  Allowing UTF-16 would open up useful
>>> possibilities both for imgCIF and for future uses not yet conceived. 
>>> Additionally, since CIF is text I still think it important for CIF2 to
>>> endorse the default text conventions of its operating environment.
>>>
>>> Could we agree on those three as allowed encodings?  Consider, given that
>>> combination of supported alternatives and no extra support from the spec,
>>> how might various parties deal with the unavoidable encoding issue.  Here
>>> are some of the more reasonable alternatives I see:
>>>
>>> 1. Bulk CIF processors and/or repositories such as Chester, CCDC, and PDB:
>>>
>>> Option a) accept and provide only UTF-8 and/or UTF-16 CIFs.  The
>>> responsibility to perform any needed transcoding is on the other party. 
>>> This is just as it might be with UTF-8-only.
>>>
>>> Option b) in addition to supporting UTF-8 and/or UTF-16, support
>>> other encodings by allowing users to explicitly specify them as part of
>>> the
>>> submission/retrieval process.  The processor / repository would either
>>> ensure the CIF is properly labeled, or, better, transcode it to
>>> UTF-8[/16]. 
>>> This also is just as it might be with UTF-8 only.
>>>
>>> 2. Programs and Libraries:
>>>
>>> Option a) On input, detect encoding by checking first for UTF-16,
>>> assuming UTF-8 if not UTF-16, and falling back to default text conventions
>>> if a UTF-8 decoding error is encountered.  On output, encode as directed
>>> by
>>> the user (among the two/three options), defaulting to the input encoding
>>> when that is available and feasible.  These would be desirable behaviors
>>> even in the UTF-8 only case, especially in a mixed CIF1/CIF2 environment,
>>> but they do exceed UTF-8-only requirements.
>>>
>>> Option b) Require input and produce output according to a fixed set
>>> of conventions (whether local text conventions or UTF-8/16).  The program
>>> user is responsible for any needed transcoding.  This would be sufficient
>>> for the CIF2, UTF-8 only case, and is typical in the CIF1 case; those
>>> differ, however, in which text conventions would be assumed.
>>>
>>> 3. Users/Authors:
>>> 3.1. Creating / editing CIFs
>>> No change from current practice is needed, but users might choose
>>> to
>>> store CIFs in UTF-8[/16] form.  This is just as it would likely be under
>>> UTF-8 only.
>>>
>>> 3.2. Transferring CIFs
>>> Unless an alternative agreement on encoding can be reached by some
>>> means, the transferor must ensure the CIF is encoded in UTF-8[/16].  This
>>> differs from the UTF-8-only case only inasmuch as UTF-16 is (maybe)
>>> allowed.
>>>
>>> 3.3. Receiving CIFs
>>> The receiver may reasonably demand that the CIF be provided in
>>> UTF-8[/16] form.  He should *expect* that form unless some alternative
>>> agreement is established.  Any desired transcoding from UTF-8[/16] to an
>>> alternative encoding is the user's responsibility.  Again, this is not
>>> significantly different from the UTF-8 only case.
>>>
>>>
>>> A driving force in many of those cases is the well-understood (especially
>>> here!) fact that different systems cannot be relied upon to share text
>>> conventions, thus leaving UTF-8[/16] as the only available general-purpose
>>> medium of exchange.  At the same time, local conventions are not forbidden
>>> from use where they can be relied upon -- most notably, within the same
>>> computer.  Even if end-users, as a group, do not appreciate those details,
>>> we can ensure via the spec that CIF2 implementers do.  That's sufficient.
>>>
>>> So, if pretty much all my expected behavior under UTF-8[/16]+local is the
>>> same as it would be under UTF-8-only, then why prefer the former?  Because
>>> under UTF-8[/16]+local, all the behavior described is conformant to the
>>> spec, whereas under UTF-8 only, a significant proportion is not.  If the
>>> standard adequately covers these behaviors then we can expect more uniform
>>> support.  Moreover, this bears directly on community acceptance of the
>>> spec.  If flaunting the spec with respect to encoding becomes common, then
>>> the spec will have failed, at least in that area.  Having failed in one
>>> area, it is more likely to fail in others.
>>>
>>>
>>> Regards,
>>>
>>> John
>>> --
>>> John C. Bollinger, Ph.D.
>>> Department of Structural Biology
>>> St. Jude Children's Research Hospital
>>>
>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.