[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Summary of encoding discussion so far
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] Summary of encoding discussion so far
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Mon, 28 Jun 2010 07:06:11 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <[email protected]>
Dear Colleagues,
I would suggest posting this summary to the wider community and
soliciting their comments. While I strongly disagree with James'
comments in his marked up version, I have no objection to his
also posting his views to the wider community _after_ posting
the unmarked-up version, and I will wait a few days after that before
posting any rebuttal. It will be very interesting to see if this
community is ready for a transition to pure UTF8 already.
I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Mon, 28 Jun 2010, James Hester wrote:
> The following is a summary of the encoding discussion so far. It
> incorporates material from the previous discussion in Oct/Nov 2009. I
> have refrained from commenting on the validity of the various
> arguments, but I will be posting a subsequent message with my
> thoughts.
>
> There are approximately two points of view regarding the encoding to
> be used for CIF2: allow only one encoding, which would be UTF-8; or
> allow multiple encodings to be used. The multiple encoding option
> comes with several alternative approaches:
>
> 1) Mandate that all CIF2 processors must support UTF-8, and are not
> required to support any other encoding. Non-UTF-8 encoded input CIFs
> would first need to be transcoded by a separate tool to UTF-8
>
> 2) Remain silent on the encoding issue (as for CIF1)
>
> 3) Specify separately a 'CIF interchange format', which would strongly
> encourage use of UTF-8 for transmission and storage but not prohibit
> use of different encodings among agreeing parties.
>
> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
> tests and test suites would be written.
>
> Following is a list of the arguments presented for and against the
> above two choices.
>
> Restrict CIF2 to UTF-8:
> =======================
>
> Arguments in favour:
>
> * Implementation of the standard is simpler as CIF processors are not
> required to negotiate encoding or support more than one encoding
>
> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
> detectable with high probability due to the specific bit-patterns
> required for UTF-8 encoding
>
> * A single encoding conforms to the philosophical principle observed
> in earlier CIF standards, that it is only necessary to define one
> convention for each feature in the standard
>
> * A key virtue of CIF is simplicity. Multiple optional encodings is
> not simple.
>
> Arguments against:
>
> * Choosing a specific encoding unduly restricts user freedom or shows
> a lack of respect for the way others do science
>
> * We are premature in settling on Unicode and/or UTF-8; by doing so we
> risk alienating important user groups and/or backing the wrong horse
>
> Allow multiple CIF2 encodings always including UTF-8:
> =====================================================
>
> Arguments in favour
>
> * CIF has always been a 'text' standard, with no encoding mandated.
> This has worked out OK so far
>
> * Provided sender and receiver system understand that a file is a
> 'text' file, encodings are manipulated automatically to produce a
> correct file after transmission
>
> * If a user anticipates the need to specify encoding (because none is
> mandated and the documents remind them of this need) then they are
> more likely to include information about the encoding they are
> using. If no encoding information is thought necessary, then a
> non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
> difficult to decode.
>
> * Binary formats are bad
>
> * Labelling is normal practice, and so there is nothing contentious
> about labelling the encoding used in a file
>
> * Saving CIF files in the native text format allows system text tools
> (e.g. for searching) to be useful
>
> * Users are going to produce CIFs in multiple encodings anyway, so we
> might as well try to manage this use by specifying standards
>
> Arguments against multiple encodings:
>
> * There is no way to reliably detect which encoding has been used in a
> file, and it is not reasonable to assume that a human editor has
> gotten an embedded encoding declaration correct, requiring that all
> files are therefore read through by a human after transmission to
> check for incorrect letters, accents etc.
>
> * Facilitating use of multiple encodings encourages them to be used,
> which increases the scale of the multiple encoding problem
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- [ddlm-group] A useful web page (Herbert J. Bernstein)
- References:
- [ddlm-group] Summary of encoding discussion so far (James Hester)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .
- Next by Date: [ddlm-group] A useful web page
- Prev by thread: Re: [ddlm-group] Summary of encoding discussion so far. .. .
- Next by thread: [ddlm-group] A useful web page
- Index(es):

