[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] A useful web page
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: [ddlm-group] A useful web page
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Mon, 28 Jun 2010 08:05:32 -0400 (EDT)
- In-Reply-To: <alpine.BSF.2.00.1006280654470.96218@epsilon.pair.com>
- References: <AANLkTiljuKDk9I-6GkQ_gnIPJRk8lv7JjHDARdi6tAwv@mail.gmail.com><alpine.BSF.2.00.1006280654470.96218@epsilon.pair.com>
Many of you may find the discussion of character encodings used for HTML4 helpful: http://www.w3.org/TR/REC-html40/charset.html ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 28 Jun 2010, Herbert J. Bernstein wrote: > Dear Colleagues, > > I would suggest posting this summary to the wider community and > soliciting their comments. While I strongly disagree with James' > comments in his marked up version, I have no objection to his > also posting his views to the wider community _after_ posting > the unmarked-up version, and I will wait a few days after that before > posting any rebuttal. It will be very interesting to see if this > community is ready for a transition to pure UTF8 already. > > I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Mon, 28 Jun 2010, James Hester wrote: > >> The following is a summary of the encoding discussion so far. It >> incorporates material from the previous discussion in Oct/Nov 2009. I >> have refrained from commenting on the validity of the various >> arguments, but I will be posting a subsequent message with my >> thoughts. >> >> There are approximately two points of view regarding the encoding to >> be used for CIF2: allow only one encoding, which would be UTF-8; or >> allow multiple encodings to be used. The multiple encoding option >> comes with several alternative approaches: >> >> 1) Mandate that all CIF2 processors must support UTF-8, and are not >> required to support any other encoding. Non-UTF-8 encoded input CIFs >> would first need to be transcoded by a separate tool to UTF-8 >> >> 2) Remain silent on the encoding issue (as for CIF1) >> >> 3) Specify separately a 'CIF interchange format', which would strongly >> encourage use of UTF-8 for transmission and storage but not prohibit >> use of different encodings among agreeing parties. >> >> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip >> tests and test suites would be written. >> >> Following is a list of the arguments presented for and against the >> above two choices. >> >> Restrict CIF2 to UTF-8: >> ======================= >> >> Arguments in favour: >> >> * Implementation of the standard is simpler as CIF processors are not >> required to negotiate encoding or support more than one encoding >> >> * UTF8 is a unique encoding in that a non UTF-8 encoded file is >> detectable with high probability due to the specific bit-patterns >> required for UTF-8 encoding >> >> * A single encoding conforms to the philosophical principle observed >> in earlier CIF standards, that it is only necessary to define one >> convention for each feature in the standard >> >> * A key virtue of CIF is simplicity. Multiple optional encodings is >> not simple. >> >> Arguments against: >> >> * Choosing a specific encoding unduly restricts user freedom or shows >> a lack of respect for the way others do science >> >> * We are premature in settling on Unicode and/or UTF-8; by doing so we >> risk alienating important user groups and/or backing the wrong horse >> >> Allow multiple CIF2 encodings always including UTF-8: >> ===================================================== >> >> Arguments in favour >> >> * CIF has always been a 'text' standard, with no encoding mandated. >> This has worked out OK so far >> >> * Provided sender and receiver system understand that a file is a >> 'text' file, encodings are manipulated automatically to produce a >> correct file after transmission >> >> * If a user anticipates the need to specify encoding (because none is >> mandated and the documents remind them of this need) then they are >> more likely to include information about the encoding they are >> using. If no encoding information is thought necessary, then a >> non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be >> difficult to decode. >> >> * Binary formats are bad >> >> * Labelling is normal practice, and so there is nothing contentious >> about labelling the encoding used in a file >> >> * Saving CIF files in the native text format allows system text tools >> (e.g. for searching) to be useful >> >> * Users are going to produce CIFs in multiple encodings anyway, so we >> might as well try to manage this use by specifying standards >> >> Arguments against multiple encodings: >> >> * There is no way to reliably detect which encoding has been used in a >> file, and it is not reasonable to assume that a human editor has >> gotten an embedded encoding declaration correct, requiring that all >> files are therefore read through by a human after transmission to >> check for incorrect letters, accents etc. >> >> * Facilitating use of multiple encodings encourages them to be used, >> which increases the scale of the multiple encoding problem >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Summary of encoding discussion so far (James Hester)
- Re: [ddlm-group] Summary of encoding discussion so far (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] Summary of encoding discussion so far
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .
- Prev by thread: Re: [ddlm-group] Summary of encoding discussion so far
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- Index(es):