Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Summary of encoding discussion so far

Dear Colleagues,

   I would suggest posting this summary to the wider community and
soliciting their comments.   While I strongly disagree with James'
comments in his marked up version, I have no objection to his
also posting his views to the wider community _after_ posting
the unmarked-up version, and I will wait a few days after that before 
posting any rebuttal.  It will be very interesting to see if this
community is ready for a transition to pure UTF8 already.

   I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 28 Jun 2010, James Hester wrote:

> The following is a summary of the encoding discussion so far.  It
> incorporates material from the previous discussion in Oct/Nov 2009. I
> have refrained from commenting on the validity of the various
> arguments, but I will be posting a subsequent message with my
> thoughts.
> There are approximately two points of view regarding the encoding to
> be used for CIF2: allow only one encoding, which would be UTF-8; or
> allow multiple encodings to be used. The multiple encoding option
> comes with several alternative approaches:
> 1) Mandate that all CIF2 processors must support UTF-8, and are not
> required to support any other encoding.  Non-UTF-8 encoded input CIFs
> would first need to be transcoded by a separate tool to UTF-8
> 2) Remain silent on the encoding issue (as for CIF1)
> 3) Specify separately a 'CIF interchange format', which would strongly
> encourage use of UTF-8 for transmission and storage but not prohibit
> use of different encodings among agreeing parties.
> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
> tests and test suites would be written.
> Following is a list of the arguments presented for and against the
> above two choices.
> Restrict CIF2 to UTF-8:
> =======================
> Arguments in favour:
> * Implementation of the standard is simpler as CIF processors are not
>  required to negotiate encoding or support more than one encoding
> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
>  detectable with high probability due to the specific bit-patterns
>  required for UTF-8 encoding
> * A single encoding conforms to the philosophical principle observed
>  in earlier CIF standards, that it is only necessary to define one
>  convention for each feature in the standard
> * A key virtue of CIF is simplicity.  Multiple optional encodings is
>  not simple.
> Arguments against:
> * Choosing a specific encoding unduly restricts user freedom or shows
>  a lack of respect for the way others do science
> * We are premature in settling on Unicode and/or UTF-8; by doing so we
>  risk alienating important user groups and/or backing the wrong horse
> Allow multiple CIF2 encodings always including UTF-8:
> =====================================================
> Arguments in favour
> * CIF has always been a 'text' standard, with no encoding mandated.
>  This has worked out OK so far
> * Provided sender and receiver system understand that a file is a
>  'text' file, encodings are manipulated automatically to produce a
>  correct file after transmission
> * If a user anticipates the need to specify encoding (because none is
>  mandated and the documents remind them of this need) then they are
>  more likely to include information about the encoding they are
>  using.  If no encoding information is thought necessary, then a
>  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
>  difficult to decode.
> * Binary formats are bad
> * Labelling is normal practice, and so there is nothing contentious
>  about labelling the encoding used in a file
> * Saving CIF files in the native text format allows system text tools
>  (e.g. for searching) to be useful
> * Users are going to produce CIFs in multiple encodings anyway, so we
>  might as well try to manage this use by specifying standards
> Arguments against multiple encodings:
> * There is no way to reliably detect which encoding has been used in a
>  file, and it is not reasonable to assume that a human editor has
>  gotten an embedded encoding declaration correct, requiring that all
>  files are therefore read through by a human after transmission to
>  check for incorrect letters, accents etc.
> * Facilitating use of multiple encodings encourages them to be used,
>  which increases the scale of the multiple encoding problem
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.