[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Summary of encoding discussion so far

Dear Colleagues,

   I would suggest posting this summary to the wider community and
soliciting their comments.   While I strongly disagree with James'
comments in his marked up version, I have no objection to his
also posting his views to the wider community _after_ posting
the unmarked-up version, and I will wait a few days after that before 
posting any rebuttal.  It will be very interesting to see if this
community is ready for a transition to pure UTF8 already.

   I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 28 Jun 2010, James Hester wrote:

> The following is a summary of the encoding discussion so far.  It
> incorporates material from the previous discussion in Oct/Nov 2009. I
> have refrained from commenting on the validity of the various
> arguments, but I will be posting a subsequent message with my
> thoughts.
>
> There are approximately two points of view regarding the encoding to
> be used for CIF2: allow only one encoding, which would be UTF-8; or
> allow multiple encodings to be used. The multiple encoding option
> comes with several alternative approaches:
>
> 1) Mandate that all CIF2 processors must support UTF-8, and are not
> required to support any other encoding.  Non-UTF-8 encoded input CIFs
> would first need to be transcoded by a separate tool to UTF-8
>
> 2) Remain silent on the encoding issue (as for CIF1)
>
> 3) Specify separately a 'CIF interchange format', which would strongly
> encourage use of UTF-8 for transmission and storage but not prohibit
> use of different encodings among agreeing parties.
>
> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
> tests and test suites would be written.
>
> Following is a list of the arguments presented for and against the
> above two choices.
>
> Restrict CIF2 to UTF-8:
> =======================
>
> Arguments in favour:
>
> * Implementation of the standard is simpler as CIF processors are not
>  required to negotiate encoding or support more than one encoding
>
> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
>  detectable with high probability due to the specific bit-patterns
>  required for UTF-8 encoding
>
> * A single encoding conforms to the philosophical principle observed
>  in earlier CIF standards, that it is only necessary to define one
>  convention for each feature in the standard
>
> * A key virtue of CIF is simplicity.  Multiple optional encodings is
>  not simple.
>
> Arguments against:
>
> * Choosing a specific encoding unduly restricts user freedom or shows
>  a lack of respect for the way others do science
>
> * We are premature in settling on Unicode and/or UTF-8; by doing so we
>  risk alienating important user groups and/or backing the wrong horse
>
> Allow multiple CIF2 encodings always including UTF-8:
> =====================================================
>
> Arguments in favour
>
> * CIF has always been a 'text' standard, with no encoding mandated.
>  This has worked out OK so far
>
> * Provided sender and receiver system understand that a file is a
>  'text' file, encodings are manipulated automatically to produce a
>  correct file after transmission
>
> * If a user anticipates the need to specify encoding (because none is
>  mandated and the documents remind them of this need) then they are
>  more likely to include information about the encoding they are
>  using.  If no encoding information is thought necessary, then a
>  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
>  difficult to decode.
>
> * Binary formats are bad
>
> * Labelling is normal practice, and so there is nothing contentious
>  about labelling the encoding used in a file
>
> * Saving CIF files in the native text format allows system text tools
>  (e.g. for searching) to be useful
>
> * Users are going to produce CIFs in multiple encodings anyway, so we
>  might as well try to manage this use by specifying standards
>
> Arguments against multiple encodings:
>
> * There is no way to reliably detect which encoding has been used in a
>  file, and it is not reasonable to assume that a human editor has
>  gotten an embedded encoding declaration correct, requiring that all
>  files are therefore read through by a human after transmission to
>  check for incorrect letters, accents etc.
>
> * Facilitating use of multiple encodings encourages them to be used,
>  which increases the scale of the multiple encoding problem
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]