[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Summary of encoding discussion so far
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Summary of encoding discussion so far
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Mon, 28 Jun 2010 07:06:11 -0400 (EDT)
- In-Reply-To: <AANLkTiljuKDk9I-6GkQ_gnIPJRk8lv7JjHDARdi6tAwv@mail.gmail.com>
- References: <AANLkTiljuKDk9I-6GkQ_gnIPJRk8lv7JjHDARdi6tAwv@mail.gmail.com>
Dear Colleagues, I would suggest posting this summary to the wider community and soliciting their comments. While I strongly disagree with James' comments in his marked up version, I have no objection to his also posting his views to the wider community _after_ posting the unmarked-up version, and I will wait a few days after that before posting any rebuttal. It will be very interesting to see if this community is ready for a transition to pure UTF8 already. I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 28 Jun 2010, James Hester wrote: > The following is a summary of the encoding discussion so far. It > incorporates material from the previous discussion in Oct/Nov 2009. I > have refrained from commenting on the validity of the various > arguments, but I will be posting a subsequent message with my > thoughts. > > There are approximately two points of view regarding the encoding to > be used for CIF2: allow only one encoding, which would be UTF-8; or > allow multiple encodings to be used. The multiple encoding option > comes with several alternative approaches: > > 1) Mandate that all CIF2 processors must support UTF-8, and are not > required to support any other encoding. Non-UTF-8 encoded input CIFs > would first need to be transcoded by a separate tool to UTF-8 > > 2) Remain silent on the encoding issue (as for CIF1) > > 3) Specify separately a 'CIF interchange format', which would strongly > encourage use of UTF-8 for transmission and storage but not prohibit > use of different encodings among agreeing parties. > > 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip > tests and test suites would be written. > > Following is a list of the arguments presented for and against the > above two choices. > > Restrict CIF2 to UTF-8: > ======================= > > Arguments in favour: > > * Implementation of the standard is simpler as CIF processors are not > required to negotiate encoding or support more than one encoding > > * UTF8 is a unique encoding in that a non UTF-8 encoded file is > detectable with high probability due to the specific bit-patterns > required for UTF-8 encoding > > * A single encoding conforms to the philosophical principle observed > in earlier CIF standards, that it is only necessary to define one > convention for each feature in the standard > > * A key virtue of CIF is simplicity. Multiple optional encodings is > not simple. > > Arguments against: > > * Choosing a specific encoding unduly restricts user freedom or shows > a lack of respect for the way others do science > > * We are premature in settling on Unicode and/or UTF-8; by doing so we > risk alienating important user groups and/or backing the wrong horse > > Allow multiple CIF2 encodings always including UTF-8: > ===================================================== > > Arguments in favour > > * CIF has always been a 'text' standard, with no encoding mandated. > This has worked out OK so far > > * Provided sender and receiver system understand that a file is a > 'text' file, encodings are manipulated automatically to produce a > correct file after transmission > > * If a user anticipates the need to specify encoding (because none is > mandated and the documents remind them of this need) then they are > more likely to include information about the encoding they are > using. If no encoding information is thought necessary, then a > non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be > difficult to decode. > > * Binary formats are bad > > * Labelling is normal practice, and so there is nothing contentious > about labelling the encoding used in a file > > * Saving CIF files in the native text format allows system text tools > (e.g. for searching) to be useful > > * Users are going to produce CIFs in multiple encodings anyway, so we > might as well try to manage this use by specifying standards > > Arguments against multiple encodings: > > * There is no way to reliably detect which encoding has been used in a > file, and it is not reasonable to assume that a human editor has > gotten an embedded encoding declaration correct, requiring that all > files are therefore read through by a human after transmission to > check for incorrect letters, accents etc. > > * Facilitating use of multiple encodings encourages them to be used, > which increases the scale of the multiple encoding problem > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- [ddlm-group] A useful web page (Herbert J. Bernstein)
- References:
- [ddlm-group] Summary of encoding discussion so far (James Hester)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .
- Next by Date: [ddlm-group] A useful web page
- Prev by thread: Re: [ddlm-group] Summary of encoding discussion so far. .. .
- Next by thread: [ddlm-group] A useful web page
- Index(es):