Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Summary of encoding discussion so far. .

See comments in-line, responding both to the original summary and to James's follow-up comments:

On Monday, June 28, 2010 12:35 AM, James Hester wrote:

>See inserted comments:
>
>On Mon, Jun 28, 2010 at 2:30 PM, James Hester <jamesrhester@gmail.com> wrote:

[...]

>> There are approximately two points of view regarding the encoding to
>> be used for CIF2: allow only one encoding, which would be UTF-8; or
>> allow multiple encodings to be used. The multiple encoding option
>> comes with several alternative approaches:

I don't see most of these being alternatives, in the sense that they mostly are not mutually exclusive.

>> 1) Mandate that all CIF2 processors must support UTF-8, and are not
>> required to support any other encoding.  Non-UTF-8 encoded input CIFs
>> would first need to be transcoded by a separate tool to UTF-8

Let's please remove the second sentence.  It does not follow.  Although transcoding to UTF-8 would be a universally viable approach under this option, other alternatives would be available depending on the file and the processor.

Overall, I would just rephrase it something like

1') Mandate that all CIF2 processors must support UTF-8, and permit, but not require, them to support any other encoding.

>> 2) Remain silent on the encoding issue (as for CIF1)

I guess that is an option, but I don't recall anyone seriously pushing for it, at least in this later round of discussion.

>> 3) Specify separately a 'CIF interchange format', which would strongly
>> encourage use of UTF-8 for transmission and storage but not prohibit
>> use of different encodings among agreeing parties.
>>
>> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
>> tests and test suites would be written.

For what it's worth, I don't see (3) and (4) being meaningfully distinct.  They are different, non-exclusive formalisms for the same end: promoting UTF-8 as a preferred -- perhaps default -- encoding, without forbidding alternatives to be used.  They go hand-in-glove with (1), but either one or both could be adopted separately.

Only (2) is a distinct (and not much supported) alternative here; the rest are various compatible options under the general category of allowing multiple encodings.

>> Following is a list of the arguments presented for and against the
>> above two choices.
>>
>> Restrict CIF2 to UTF-8:
>> =======================
>>
>> Arguments in favour:
>>
>> * Implementation of the standard is simpler as CIF processors are not
>>  required to negotiate encoding or support more than one encoding

Sorry, I reject that one.  None of the options summarized above for multiple encoding, as I understand them, require a CIF processor to support anything other than UTF-8 or to perform any explicit encoding negotiation.  Many of the proposals in fact aim specifically to allow a processor to support just UTF-8 as a minimally-compliant implementation.  This argument therefore does not distinguish single- from multiple encodings.

>> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
>>  detectable with high probability due to the specific bit-patterns
>>  required for UTF-8 encoding

[Technical details accepted and elided]

This is true, and it is indeed an advantage of UTF-8 over most, if not all, other encodings in general use.  No one has contested that UTF-8 is an encoding with many strengths.  It is a good choice for use with CIF2, but that does not support the contention that CIF2 should be restricted to that encoding only.

>> * A single encoding conforms to the philosophical principle observed
>>  in earlier CIF standards, that it is only necessary to define one
>>  convention for each feature in the standard

That pretty much assumes the conclusion, doesn't it?

Moreover, CIF1 expressly permits multiple encodings, so however much that principle may have been applied in other areas of earlier CIF standards, the premise apparently was accepted that it needn't apply to the area of character encoding.  The issue is not mooted by the fact that many encodings are congruent with ASCII over the entire CIF1 character set, for there are some widely used encodings that aren't (UTF-16 variants, EBCDIC variants).

>> * A key virtue of CIF is simplicity.  Multiple optional encodings is
>>  not simple.

I think that puts the focus in the wrong place.  It is text handling in general, and especially storage and interchange of non-ASCII characters, that is not simple.  The group has already decided to adopt Unicode as the character set for CIF2, therefore the inevitable complications are naturally arising.  Among them, CIF2 now requires all concerned to devote greater attention to character encoding.

In particular, requiring authors to encode their CIFs in UTF-8 is not simple either, except perhaps where that's the contextual default.  The simple alternative would be to allow authors to continue, in general, to neglect the issue altogether, but that isn't feasible.  This argument therefore does not distinguish single- from multiple encoding.

>> Arguments against:
>>
>> * Choosing a specific encoding unduly restricts user freedom or shows
>>  a lack of respect for the way others do science
>
>I believe this is not a significant concern, as scientists are used to
>compromising how they do things in order to communicate
>internationally (most obviously, learning English and having a
>restricted choice of word processor for publication submission)

I'm not personally much swayed by the respect angle, but I am considerably more moved when the question is couched in terms of freedom.  I am sometimes prepared to yield some of my freedom in fair exchange for something of value, but I don't see much upside here for most authors.  I don't even really see upside for CIF consumers, software developers, or archivers, inasmuch as each of these will still need to choose and implement a policy for handling CIFs encoded via various schemes, even if that policy is to detect and reject all non-UTF-8 CIFs (which would be an acceptable option anyway).

>>
>> * We are premature in settling on Unicode and/or UTF-8; by doing so we
>>  risk alienating important user groups and/or backing the wrong horse
>
>The link cited in support of this statement was at least 12 years old.
> UTF-8 is supported on all major operating systems and applications
>exist that will run on 90s era platforms, so it seems that UTF-8 is
>here for the long run.  Dare I say it, UTF-8 is the default encoding
>of the Web (ie HTML and XML).

I don't think adopting Unicode as a character repertoire and model is any risk, and anyway, the only alternative I see is no common character model at all, which is worse.

Having chosen Unicode, it is perfectly reasonable to support the UTF-8 encoding scheme that's part of the Unicode standard.  It is harder, however, to then reject the several variants of UTF-16 and UTF-32 that are also part of the Unicode standard.

Anyway, UTF-8 is widely supported and accepted, and in that sense I don't think there need be any fear that it's the wrong horse.  On the other hand, I think it unwise to rely on UTF-8 being a suitable encoding into the indefinite future.  The problem is not so much that UTF-8 may be the wrong horse, but rather that it shouldn't be the only one, when the herd overall will thrive long after UTF-8 plods off to the dog food factory.


Also:

* Standards should place the minimum constraints necessary to achieve their purpose, and designation of a single required character encoding is not necessary and / or does not meaningfully advance CIFs purpose.


>>
>> Allow multiple CIF2 encodings always including UTF-8:
>> =====================================================
>>
>> Arguments in favour
>>
>> * CIF has always been a 'text' standard, with no encoding mandated.
>>  This has worked out OK so far
>
>I think this is purely because any non-ASCII encodings that were used
>to generate IUCr submissions coincided with ASCII in the ASCII byte
>range, and all important information was expressed using ASCII
>characters, so the encoding issue was not a big deal.  When we
>formally accept non-ASCII characters into the standard, the encoding
>issue becomes far more significant, as we can no longer rely on such
>neat coincidence between encodings.

The argument point is put a bit weakly.  We all recognize that the characters of the CIF1 set are among those that are well supported as text across a wide variety of contexts, and that we are adding others for which encoding is a much larger issue.  But IUCr has actively promoted CIF1 as a text format, amenable to manipulation with text tools, not only for searching (as mentioned below), but also for editing, text extraction, etc.  There seems no strong will to change that stance, and considerable interest in maintaining it.  'Text' has a locale-specific meaning, however, with associated implications for encoding.  Recognizing only UTF-8 undermines the position that CIF is a text format.

>>
>> * Provided sender and receiver system understand that a file is a
>>  'text' file, encodings are manipulated automatically to produce a
>>  correct file after transmission
>
>This statement is not supported by the facts.  Simon's initial
>investigations did not produce a situation where the transmitted bytes
>were altered, and I would challenge anyone who believes that bytes are
>altered during transmission to produce evidence of this.  What is true
>is that the text that is rendered on screen from a series of bytes may
>not correspond to the text that was sent, due to an inability to
>properly match the input and output encodings.  The more encodings in
>play, the more likely this is to happen.

Then let's drop this point.  I am not convinced that such automatic conversions never occur, but I also never believed that they consistently occurred.  I don't either way consider it very persuasive.

>> * If a user anticipates the need to specify encoding (because none is
>>  mandated and the documents remind them of this need) then they are
>>  more likely to include information about the encoding they are
>>  using.  If no encoding information is thought necessary, then a
>>  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
>>  difficult to decode.
>
>We can know almost unambiguously that the mistakenly encoded file is
>*not* a UTF-8 file.

Agreed.  If I ever promoted that specific argument then I was in error.

>  Can you say with such certainty that the
>supposedly iso-8859-5 file you have received is not iso-8859-5, but
>actually iso-8859-15?

No, and that's exactly the sort of case that concerns me here.  The problem is in no way minimized by the CIF specification refusing to sanction ISO-8859-5 or ISO-8859-15 as an encoding for CIF2.

>  Can you even say that your idea of iso-8859-5
>is the same as the sender's idea, as Herbert seems to imply in some
>cases that different OS's can disagree on the mapping for the same
>nominal encoding, which is one more reason to avoid other encodings.

I must have missed that one.  The ISO-8859 series are international standards.  Different OS's do not have the luxury of disagreeing on the character mappings of those standards.  We cannot control OS bugs that may cause characters to be mapped incorrectly, and requiring use of UTF-8 does not in any case protect us from them.

>>
>> * Binary formats are bad
>
>Not a lot of supporting argument has been provided for this one.  I
>would agree that undocumented, unsupported binary formats are awful
>and require a lot of work.  However, 'Text' formats are just a
>specific type of binary format, where the mapping to code points is
>(a) known and (b) supported by readily available system tools.  Where
>the mapping to code points is not known, or the system tools do not
>support that mapping, a 'text' file in the wrong encoding is just as
>bad as a binary file.  Ever tried opening a CJK-encoded pdf using a
>standard Western Adobe Acrobat?  You'll know what I mean, because
>unless you are able to download and install the Adobe CJK-kit, that
>pdf file is useless.

I concede this point.

>>
>> * Labelling is normal practice, and so there is nothing contentious
>>  about labelling the encoding used in a file
>
>Labelling is wonderful (the more metadata the better) except that
>someone has to do it, and get it right.  We go to considerable lengths
>at my lab (a reactor source) to make sure that users need only print
>out a barcode to label their samples, but it requires constant urging
>from the staff to make sure this procedure is followed.  What then of
>CIF?  Only rejection of files lacking an encoding label would do the
>trick, but if you are going to do this, you might as well just specify
>utf-8 and reject non-utf-8 files.

I argued, and Simon accepted, that possibility of incorrect labeling is no worse than the inability to label at all.  How to handle the situation then returns to the issue of which decisions should be individual policy choices vs. which should be afforded a place in the international standard.  Regardless of what the spec says about encodings, any CIF consumer is always free to accept, for example, only UTF-8 CIFs.  I see no advantage in specifying that it is not CIF2-conformant to adopt an alternative strategy.

>>
>> * Saving CIF files in the native text format allows system text tools
>>  (e.g. for searching) to be useful
>
>True, I think this is a point in favour of native text formats, if the
>available tools are unable to search for text in a different encoding.
>
>> * Users are going to produce CIFs in multiple encodings anyway, so we
>>  might as well try to manage this use by specifying standards

There are two separate arguments there:
1) Users are going to produce CIFs in multiple encodings anyway, and
2) we might as well try to manage this use by specifying standards

There has been little or no dispute on point (1), and that point by itself is a significant argument against requiring UTF-8.  It avails nothing to put it in the standard if you can rely on that part of the standard being violated by a substantial fraction of all CIF2 CIFs.

As for specifying standards for use of alternative encodings,

>Assuming that virtually all users will have access to utf-8 capable
>tools, there is a contradiction in supposing that a user is incapable
>of working out how to output a UTF-8 file, but is able to correctly
>output some other encoding, which they are also willing and able to
>insert in the file header correctly.

I do not suppose that any user we care about is incapable of outputting a UTF-8 file.  I do suppose, however, that many manually edited CIFs are prepared by people who are uninterested in figuring out how to output UTF-8, and who have bright graduate students or other assistants who know or can figure out what the encoding for the relevant environment is, which need be done only once, and which may already have been done for other purposes.

>  If the default encoding is not
>UTF-8, they need to realise that this is the case, find out what the
>default actually is, and find out how to specify that encoding in
>CIF2.  And hope that when they click 'save' they are actually using
>the default encoding, and not some other encoding that was specified
>in some setup file somewhere.  If only utf-8 were acceptable, that
>same user will still have to find out what their default encoding is,
>attempt to change it to utf-8 if necessary, and click 'save'.  With
>the important distinction that the recipient is almost certain to
>detect if the file is not really utf-8 encoded, due to the properties
>of utf-8. I don't see that the availability of multiple encodings has
>reduced the scope for error, if anything, there are more ways for
>things to go wrong.  At least with utf-8 it is easy to automatically
>detect when a file is not correctly encoded, unlike any other
>encoding.

The person or organization that wants to accept only UTF-8 (or perhaps other discernable encodings) in order to reduce the perceived or real likelihood of misidentified encodings is under any circumstances free to do so.  The standard neither helps them by requiring UTF-8 alone, nor hinders them by permitting other encodings.

>> Arguments against multiple encodings:
>>
>> * There is no way to reliably detect which encoding has been used in a
>>  file, and it is not reasonable to assume that a human editor has
>>  gotten an embedded encoding declaration correct, requiring that all
>>  files are therefore read through by a human after transmission to
>>  check for incorrect letters, accents etc.

That describes a policy decision that individual CIF consumers can and should make for themselves.  Such consumers would not be prevented from requiring CIFs to be submitted to them in UTF-8 form, should they wish to rely on that as a means to convey the text accurately.  They will need to make a decision on how to handle non-UTF-8 CIFs in any event, regardless of the official stance on permitted encodings.  There is no gain here from the spec forbidding encodings other than UTF-8.

>> * Facilitating use of multiple encodings encourages them to be used,
>>  which increases the scale of the multiple encoding problem

Forbidding encodings other than UTF-8 does not effectively prevent them from being used, especially given the historical context.  The scale of the problem is little affected by the spec's ultimate stance on this issue.  The problem is not, in any case, that a variety of encodings will be in use; it is rather that CIFs may be stored or exchanged without correct encoding information.  That is a problem where the spec may be able to exert more influence.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital



Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.