Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Cif2-encoding] Addressing Brian's concerns

  • To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
  • Subject: [Cif2-encoding] Addressing Brian's concerns
  • From: James Hester <jamesrhester@xxxxxxxxx>
  • Date: Tue, 28 Sep 2010 17:05:49 +1000
In this email I address Brian's comments.  I have reproduced the email in full at the end for reference.  To save reading the whole email, you may download 'juffed', based on Qt, and cut and paste between foreign language webpages in your browser and juffed to see multiple-encoding cut and paste in action.  There is thus no impediment to publCIF operating in a multiple-encoding environment.

Brian writes:

I sympathise greatly with James's desire for a prescriptive, "binary"
approach, but its corollary is that a CIF application must take full
responsibility for expressing any supported extended character set (I
mean accented Latin letters, Greek characters, Cyrillic or Chinese
alphabets).

This is not correct.  A typical application at minimum will only need to parse the CIF file such that each tag has an associated string value.  What higher levels of the application do with those tags and strings is application-dependent.  The only applications that need to worry about actually displaying glyphs are those concerned with text display.  Most Unicode-aware software (e.g. web browsers) simply displays a default character if it is not able to display a particular glyph.  That does not mean that that code point disappears, simply that it is not displayed.  So I do not see being required to display all of Unicode is a valid criticism of the 'binary' approach, as displaying all of Unicode is not a requirement.

First off, I don't know how difficult that is technically. I would
guess that rather than trying to handle arbitrary keyboard mappings,
the natural approach would be to pick from a graphical character
grid. (What are the implications for this of glyph rendering - does
a CIF editor have to be compiled with its own large font library?)

If I understand correctly, this paragraph is not relevant if there is no implicit requirement to display all of Unicode.  I will just add that there is no need to choose a less comprehensive encoding just because you don't want to display characters outside a certain range.  The various mappings involved in text display are all decoupled.  Just to list them for clarity:

(1) Mapping from keyboard code to code point;
(2) Mapping from code point to on-disk binary representation; (this is the encoding we have been talking about)
(3) Mapping from code point to font coordinate (display glyph);

You can restrict the set of display glyphs without changing the underlying file encoding.  For example, a CIF-aware application for handling CIF2 text could bundle different language packs as Adobe does for PDF.

But that's a laborious method of authoring if relatively large amounts
of "non-standard" text are involved, and the way that authors would
prefer to work, surely, is by copying and pasting text from Word or
some other tool of choice. Permitting that necessarily pollutes the
"binary" approach with byte streams delivered by text-oriented
applications.

I agree that authors would probably prefer to cut and paste.  Does this pollute the 'binary' approach more than the 'any encoding' approach?  How much of the cut-and-pastability of CIF1 text is due to the commonality of ASCII encoding for the CIF text codepoints, and how much to magic CIF1 pixie dust that translates the encoding of the cut text to that of the text document it is pasted into?  Anyway, more about cutting and pasting is included below.

If I could be sure that publCIF, say, can be compiled with libraries
that reliably transcode byte streams imported from clipboards and
file import (across the mess of SMB/NFS mounts etc. that exist in
the real world) - and equally reliably transcode its UTF8 encoded text
to the author's locale-based clipboard, then I'd be more willing to
promote option 3 to the top as the starting point at least for CIF
2.0 (but its "enforcement" does depend on the availability of such a
robust CIF-editing tool).

Short form of my answer: publCIF will be able to work well under both proposals as it is interactive and uses Qt.  Long form:

You don't even need any new libraries for this! Qt (upon which publCIF is built) aims to do it all transparently for you. Let me be your software architect.  Assume publCIF always handles UTF8 text as per my preferred option. 

Cut and paste (clipboard import): When text is imported from the QClipboard (an abstraction of the system clipboard) into publCIF, publCIF should always request QMimeData, which will return an object containing the text and the encoding of the text.  Other standard Qt text transcoding functions can then be applied to convert the text in one encoding to text in the target encoding.  I estimate about 10 lines of new code.  The other direction is even more of a no-brainer, as publCIF need simply set the encoding of its source text in the mimeData object it passes to the clipboard, and the clipboard will transparently handle transcoding as needed.  I would note that this description applies equally well to the 'as for CIF1' proposal, with the potential simplification that, if source and target texts are known to be in the same encoding, no transcoding is necessary.

I suggest you download the free Qt-based editor called 'Juffed' and play with cutting and pasting from international web pages.  I have just pasted bits of text encoded in euc-jp, utf-8 and win-1251 into Juffed and all displayed correctly.  As this is based on the same libraries and technology that publCIF is, I think your worries are unfounded.

Note also that cutting and pasting is a user-mediated operation, so the user sees both the input text and the output text.  This means that transcoding errors (which may occasionally occur every now and then for single characters, others have reported) are more likely to be caught than a situation where transcoding is done silently in the background.

Import CIF file from some undefined location: under my UTF8/16 proposal, there is no issue doing this, as the file is supposed to be UTF8/16.  Under the 'as for CIF1' proposal (which Brain paradoxically supports?), or even the 'local + UTF8/16' proposal, you are *on your own* as far as figuring out the source file encoding and I know of no automated solution.  As a practical matter, because publCIF is interactive, you could prompt the user to specify the encoding when UTF8/16 are not found, in the same way that browsers allow encoding to be set.  But that latter behaviour is entirely your decision. 

While I'm rewriting publCIF in my head, I will note in passing that in terms of fonts, publCIF is already well set up.  The Linux version of publCIF allows me to choose a Unicode font, for example, which displays the greek symbols perfectly, and Windows will have its own Unicode font available.

So, I do not think that publCIF can be used as a way to distinguish between the competing proposals on the table.

all the best,
James.
==========================
Brian's email in full for reference


My vote:

Preference  Option
 1        2. Herbert's 'as for CIF1 proposal with UTF8 in place of
             ASCII', together with Brian's *recommendations*
 2        1. Herbert's 'as for CIF1 proposal with UTF8 in place of
             ASCII' recently posted here and to COMCIFS.
 3        4. UTF8 + UTF16
 4        3. UTF8-only as in the original draft
 5        5. UTF8, UTF16 + "local"

Rationale: I still feel this argument is at heart a "binary/text"
dichotomy, where "binary" implies that one can prescribe specific
byte-level representations of every distinct character; "text"
implies that you're at the mercy of external libraries and mappings
between encoding conventions - and those mappings are not always
explicit or easy to identify.

I sympathise greatly with James's desire for a prescriptive, "binary"
approach, but its corollary is that a CIF application must take full
responsibility for expressing any supported extended character set (I
mean accented Latin letters, Greek characters, Cyrillic or Chinese
alphabets).

First off, I don't know how difficult that is technically. I would
guess that rather than trying to handle arbitrary keyboard mappings,
the natural approach would be to pick from a graphical character
grid. (What are the implications for this of glyph rendering - does
a CIF editor have to be compiled with its own large font library?)

But that's a laborious method of authoring if relatively large amounts
of "non-standard" text are involved, and the way that authors would
prefer to work, surely, is by copying and pasting text from Word or
some other tool of choice. Permitting that necessarily pollutes the
"binary" approach with byte streams delivered by text-oriented
applications.

If I could be sure that publCIF, say, can be compiled with libraries
that reliably transcode byte streams imported from clipboards and
file import (across the mess of SMB/NFS mounts etc. that exist in
the real world) - and equally reliably transcode its UTF8 encoded text
to the author's locale-based clipboard, then I'd be more willing to
promote option 3 to the top as the starting point at least for CIF
2.0 (but its "enforcement" does depend on the availability of such a
robust CIF-editing tool).

I prefer the UTF8 + UTF16 option over UTF8-only because of the
real-world use case that Herbert has described before; and in
existing imgCIF applications the UTF16 encoding is being done
rather carefully and for a specific purpose.

I put option 5 at the bottom because of the non-portability of a
"local" encoding.

Note, though, that whatever the outcome I would still favour the
discussion of character set encodings to be presented as a Part 3
to the complete CIF2 spec.

Best wishes
Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.