Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Summary of encoding discussion so far

See inserted comments:

On Mon, Jun 28, 2010 at 2:30 PM, James Hester <jamesrhester@gmail.com> wrote:
> The following is a summary of the encoding discussion so far.  It
> incorporates material from the previous discussion in Oct/Nov 2009. I
> have refrained from commenting on the validity of the various
> arguments, but I will be posting a subsequent message with my
> thoughts.
> There are approximately two points of view regarding the encoding to
> be used for CIF2: allow only one encoding, which would be UTF-8; or
> allow multiple encodings to be used. The multiple encoding option
> comes with several alternative approaches:
> 1) Mandate that all CIF2 processors must support UTF-8, and are not
> required to support any other encoding.  Non-UTF-8 encoded input CIFs
> would first need to be transcoded by a separate tool to UTF-8
> 2) Remain silent on the encoding issue (as for CIF1)
> 3) Specify separately a 'CIF interchange format', which would strongly
> encourage use of UTF-8 for transmission and storage but not prohibit
> use of different encodings among agreeing parties.
> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
> tests and test suites would be written.
> Following is a list of the arguments presented for and against the
> above two choices.
> Restrict CIF2 to UTF-8:
> =======================
> Arguments in favour:
> * Implementation of the standard is simpler as CIF processors are not
>  required to negotiate encoding or support more than one encoding
> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
>  detectable with high probability due to the specific bit-patterns
>  required for UTF-8 encoding

Note specifically that a byte with the high bit set sandwiched between
bytes without the high bit set is never valid UTF-8.  This immediately
catches any wrongly-encoded isolated (e.g. accented or Greek)
characters.  Given two neighbouring bytes with the high bit set, the
chances that they are valid UTF-8 is a priori 1/8, falling to 1/32 for
three high-bit-set bytes, and about 1/64 for four high bytes in a row.

> * A single encoding conforms to the philosophical principle observed
>  in earlier CIF standards, that it is only necessary to define one
>  convention for each feature in the standard
> * A key virtue of CIF is simplicity.  Multiple optional encodings is
>  not simple.
> Arguments against:
> * Choosing a specific encoding unduly restricts user freedom or shows
>  a lack of respect for the way others do science

I believe this is not a significant concern, as scientists are used to
compromising how they do things in order to communicate
internationally (most obviously, learning English and having a
restricted choice of word processor for publication submission)
> * We are premature in settling on Unicode and/or UTF-8; by doing so we
>  risk alienating important user groups and/or backing the wrong horse

The link cited in support of this statement was at least 12 years old.
 UTF-8 is supported on all major operating systems and applications
exist that will run on 90s era platforms, so it seems that UTF-8 is
here for the long run.  Dare I say it, UTF-8 is the default encoding
of the Web (ie HTML and XML).
> Allow multiple CIF2 encodings always including UTF-8:
> =====================================================
> Arguments in favour
> * CIF has always been a 'text' standard, with no encoding mandated.
>  This has worked out OK so far

I think this is purely because any non-ASCII encodings that were used
to generate IUCr submissions coincided with ASCII in the ASCII byte
range, and all important information was expressed using ASCII
characters, so the encoding issue was not a big deal.  When we
formally accept non-ASCII characters into the standard, the encoding
issue becomes far more significant, as we can no longer rely on such
neat coincidence between encodings.

> * Provided sender and receiver system understand that a file is a
>  'text' file, encodings are manipulated automatically to produce a
>  correct file after transmission

This statement is not supported by the facts.  Simon's initial
investigations did not produce a situation where the transmitted bytes
were altered, and I would challenge anyone who believes that bytes are
altered during transmission to produce evidence of this.  What is true
is that the text that is rendered on screen from a series of bytes may
not correspond to the text that was sent, due to an inability to
properly match the input and output encodings.  The more encodings in
play, the more likely this is to happen.

> * If a user anticipates the need to specify encoding (because none is
>  mandated and the documents remind them of this need) then they are
>  more likely to include information about the encoding they are
>  using.  If no encoding information is thought necessary, then a
>  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
>  difficult to decode.

We can know almost unambiguously that the mistakenly encoded file is
*not* a UTF-8 file.  Can you say with such certainty that the
supposedly iso-8859-5 file you have received is not iso-8859-5, but
actually iso-8859-15?  Can you even say that your idea of iso-8859-5
is the same as the sender's idea, as Herbert seems to imply in some
cases that different OS's can disagree on the mapping for the same
nominal encoding, which is one more reason to avoid other encodings.

> * Binary formats are bad

Not a lot of supporting argument has been provided for this one.  I
would agree that undocumented, unsupported binary formats are awful
and require a lot of work.  However, 'Text' formats are just a
specific type of binary format, where the mapping to code points is
(a) known and (b) supported by readily available system tools.  Where
the mapping to code points is not known, or the system tools do not
support that mapping, a 'text' file in the wrong encoding is just as
bad as a binary file.  Ever tried opening a CJK-encoded pdf using a
standard Western Adobe Acrobat?  You'll know what I mean, because
unless you are able to download and install the Adobe CJK-kit, that
pdf file is useless.

> * Labelling is normal practice, and so there is nothing contentious
>  about labelling the encoding used in a file

Labelling is wonderful (the more metadata the better) except that
someone has to do it, and get it right.  We go to considerable lengths
at my lab (a reactor source) to make sure that users need only print
out a barcode to label their samples, but it requires constant urging
from the staff to make sure this procedure is followed.  What then of
CIF?  Only rejection of files lacking an encoding label would do the
trick, but if you are going to do this, you might as well just specify
utf-8 and reject non-utf-8 files.
> * Saving CIF files in the native text format allows system text tools
>  (e.g. for searching) to be useful

True, I think this is a point in favour of native text formats, if the
available tools are unable to search for text in a different encoding.

> * Users are going to produce CIFs in multiple encodings anyway, so we
>  might as well try to manage this use by specifying standards

Assuming that virtually all users will have access to utf-8 capable
tools, there is a contradiction in supposing that a user is incapable
of working out how to output a UTF-8 file, but is able to correctly
output some other encoding, which they are also willing and able to
insert in the file header correctly.  If the default encoding is not
UTF-8, they need to realise that this is the case, find out what the
default actually is, and find out how to specify that encoding in
CIF2.  And hope that when they click 'save' they are actually using
the default encoding, and not some other encoding that was specified
in some setup file somewhere.  If only utf-8 were acceptable, that
same user will still have to find out what their default encoding is,
attempt to change it to utf-8 if necessary, and click 'save'.  With
the important distinction that the recipient is almost certain to
detect if the file is not really utf-8 encoded, due to the properties
of utf-8. I don't see that the availability of multiple encodings has
reduced the scope for error, if anything, there are more ways for
things to go wrong.  At least with utf-8 it is easy to automatically
detect when a file is not correctly encoded, unlike any other

> Arguments against multiple encodings:
> * There is no way to reliably detect which encoding has been used in a
>  file, and it is not reasonable to assume that a human editor has
>  gotten an embedded encoding declaration correct, requiring that all
>  files are therefore read through by a human after transmission to
>  check for incorrect letters, accents etc.
> * Facilitating use of multiple encodings encourages them to be used,
>  which increases the scale of the multiple encoding problem

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.