[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Summary of encoding discussion so far

The following is a summary of the encoding discussion so far.  It
incorporates material from the previous discussion in Oct/Nov 2009. I
have refrained from commenting on the validity of the various
arguments, but I will be posting a subsequent message with my
thoughts.

There are approximately two points of view regarding the encoding to
be used for CIF2: allow only one encoding, which would be UTF-8; or
allow multiple encodings to be used. The multiple encoding option
comes with several alternative approaches:

1) Mandate that all CIF2 processors must support UTF-8, and are not
required to support any other encoding.  Non-UTF-8 encoded input CIFs
would first need to be transcoded by a separate tool to UTF-8

2) Remain silent on the encoding issue (as for CIF1)

3) Specify separately a 'CIF interchange format', which would strongly
encourage use of UTF-8 for transmission and storage but not prohibit
use of different encodings among agreeing parties.

4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
tests and test suites would be written.

Following is a list of the arguments presented for and against the
above two choices.

Restrict CIF2 to UTF-8:
=======================

Arguments in favour:

* Implementation of the standard is simpler as CIF processors are not
  required to negotiate encoding or support more than one encoding

* UTF8 is a unique encoding in that a non UTF-8 encoded file is
  detectable with high probability due to the specific bit-patterns
  required for UTF-8 encoding

* A single encoding conforms to the philosophical principle observed
  in earlier CIF standards, that it is only necessary to define one
  convention for each feature in the standard

* A key virtue of CIF is simplicity.  Multiple optional encodings is
  not simple.

Arguments against:

* Choosing a specific encoding unduly restricts user freedom or shows
  a lack of respect for the way others do science

* We are premature in settling on Unicode and/or UTF-8; by doing so we
  risk alienating important user groups and/or backing the wrong horse

Allow multiple CIF2 encodings always including UTF-8:
=====================================================

Arguments in favour

* CIF has always been a 'text' standard, with no encoding mandated.
  This has worked out OK so far

* Provided sender and receiver system understand that a file is a
  'text' file, encodings are manipulated automatically to produce a
  correct file after transmission

* If a user anticipates the need to specify encoding (because none is
  mandated and the documents remind them of this need) then they are
  more likely to include information about the encoding they are
  using.  If no encoding information is thought necessary, then a
  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
  difficult to decode.

* Binary formats are bad

* Labelling is normal practice, and so there is nothing contentious
  about labelling the encoding used in a file

* Saving CIF files in the native text format allows system text tools
  (e.g. for searching) to be useful

* Users are going to produce CIFs in multiple encodings anyway, so we
  might as well try to manage this use by specifying standards

Arguments against multiple encodings:

* There is no way to reliably detect which encoding has been used in a
  file, and it is not reasonable to assume that a human editor has
  gotten an embedded encoding declaration correct, requiring that all
  files are therefore read through by a human after transmission to
  check for incorrect letters, accents etc.

* Facilitating use of multiple encodings encourages them to be used,
  which increases the scale of the multiple encoding problem


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]