[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Summary of encoding discussion so far
- To: ddlm-group <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Summary of encoding discussion so far
- From: James Hester <jamesrhester@gmail.com>
- Date: Mon, 28 Jun 2010 15:35:00 +1000
- In-Reply-To: <AANLkTiljuKDk9I-6GkQ_gnIPJRk8lv7JjHDARdi6tAwv@mail.gmail.com>
- References: <AANLkTiljuKDk9I-6GkQ_gnIPJRk8lv7JjHDARdi6tAwv@mail.gmail.com>
See inserted comments: On Mon, Jun 28, 2010 at 2:30 PM, James Hester <jamesrhester@gmail.com> wrote: > The following is a summary of the encoding discussion so far. It > incorporates material from the previous discussion in Oct/Nov 2009. I > have refrained from commenting on the validity of the various > arguments, but I will be posting a subsequent message with my > thoughts. > > There are approximately two points of view regarding the encoding to > be used for CIF2: allow only one encoding, which would be UTF-8; or > allow multiple encodings to be used. The multiple encoding option > comes with several alternative approaches: > > 1) Mandate that all CIF2 processors must support UTF-8, and are not > required to support any other encoding. Non-UTF-8 encoded input CIFs > would first need to be transcoded by a separate tool to UTF-8 > > 2) Remain silent on the encoding issue (as for CIF1) > > 3) Specify separately a 'CIF interchange format', which would strongly > encourage use of UTF-8 for transmission and storage but not prohibit > use of different encodings among agreeing parties. > > 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip > tests and test suites would be written. > > Following is a list of the arguments presented for and against the > above two choices. > > Restrict CIF2 to UTF-8: > ======================= > > Arguments in favour: > > * Implementation of the standard is simpler as CIF processors are not > required to negotiate encoding or support more than one encoding > > * UTF8 is a unique encoding in that a non UTF-8 encoded file is > detectable with high probability due to the specific bit-patterns > required for UTF-8 encoding Note specifically that a byte with the high bit set sandwiched between bytes without the high bit set is never valid UTF-8. This immediately catches any wrongly-encoded isolated (e.g. accented or Greek) characters. Given two neighbouring bytes with the high bit set, the chances that they are valid UTF-8 is a priori 1/8, falling to 1/32 for three high-bit-set bytes, and about 1/64 for four high bytes in a row. > * A single encoding conforms to the philosophical principle observed > in earlier CIF standards, that it is only necessary to define one > convention for each feature in the standard > > * A key virtue of CIF is simplicity. Multiple optional encodings is > not simple. > > Arguments against: > > * Choosing a specific encoding unduly restricts user freedom or shows > a lack of respect for the way others do science I believe this is not a significant concern, as scientists are used to compromising how they do things in order to communicate internationally (most obviously, learning English and having a restricted choice of word processor for publication submission) > > * We are premature in settling on Unicode and/or UTF-8; by doing so we > risk alienating important user groups and/or backing the wrong horse The link cited in support of this statement was at least 12 years old. UTF-8 is supported on all major operating systems and applications exist that will run on 90s era platforms, so it seems that UTF-8 is here for the long run. Dare I say it, UTF-8 is the default encoding of the Web (ie HTML and XML). > > Allow multiple CIF2 encodings always including UTF-8: > ===================================================== > > Arguments in favour > > * CIF has always been a 'text' standard, with no encoding mandated. > This has worked out OK so far I think this is purely because any non-ASCII encodings that were used to generate IUCr submissions coincided with ASCII in the ASCII byte range, and all important information was expressed using ASCII characters, so the encoding issue was not a big deal. When we formally accept non-ASCII characters into the standard, the encoding issue becomes far more significant, as we can no longer rely on such neat coincidence between encodings. > > * Provided sender and receiver system understand that a file is a > 'text' file, encodings are manipulated automatically to produce a > correct file after transmission This statement is not supported by the facts. Simon's initial investigations did not produce a situation where the transmitted bytes were altered, and I would challenge anyone who believes that bytes are altered during transmission to produce evidence of this. What is true is that the text that is rendered on screen from a series of bytes may not correspond to the text that was sent, due to an inability to properly match the input and output encodings. The more encodings in play, the more likely this is to happen. > * If a user anticipates the need to specify encoding (because none is > mandated and the documents remind them of this need) then they are > more likely to include information about the encoding they are > using. If no encoding information is thought necessary, then a > non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be > difficult to decode. We can know almost unambiguously that the mistakenly encoded file is *not* a UTF-8 file. Can you say with such certainty that the supposedly iso-8859-5 file you have received is not iso-8859-5, but actually iso-8859-15? Can you even say that your idea of iso-8859-5 is the same as the sender's idea, as Herbert seems to imply in some cases that different OS's can disagree on the mapping for the same nominal encoding, which is one more reason to avoid other encodings. > > * Binary formats are bad Not a lot of supporting argument has been provided for this one. I would agree that undocumented, unsupported binary formats are awful and require a lot of work. However, 'Text' formats are just a specific type of binary format, where the mapping to code points is (a) known and (b) supported by readily available system tools. Where the mapping to code points is not known, or the system tools do not support that mapping, a 'text' file in the wrong encoding is just as bad as a binary file. Ever tried opening a CJK-encoded pdf using a standard Western Adobe Acrobat? You'll know what I mean, because unless you are able to download and install the Adobe CJK-kit, that pdf file is useless. > > * Labelling is normal practice, and so there is nothing contentious > about labelling the encoding used in a file Labelling is wonderful (the more metadata the better) except that someone has to do it, and get it right. We go to considerable lengths at my lab (a reactor source) to make sure that users need only print out a barcode to label their samples, but it requires constant urging from the staff to make sure this procedure is followed. What then of CIF? Only rejection of files lacking an encoding label would do the trick, but if you are going to do this, you might as well just specify utf-8 and reject non-utf-8 files. > > * Saving CIF files in the native text format allows system text tools > (e.g. for searching) to be useful True, I think this is a point in favour of native text formats, if the available tools are unable to search for text in a different encoding. > * Users are going to produce CIFs in multiple encodings anyway, so we > might as well try to manage this use by specifying standards Assuming that virtually all users will have access to utf-8 capable tools, there is a contradiction in supposing that a user is incapable of working out how to output a UTF-8 file, but is able to correctly output some other encoding, which they are also willing and able to insert in the file header correctly. If the default encoding is not UTF-8, they need to realise that this is the case, find out what the default actually is, and find out how to specify that encoding in CIF2. And hope that when they click 'save' they are actually using the default encoding, and not some other encoding that was specified in some setup file somewhere. If only utf-8 were acceptable, that same user will still have to find out what their default encoding is, attempt to change it to utf-8 if necessary, and click 'save'. With the important distinction that the recipient is almost certain to detect if the file is not really utf-8 encoded, due to the properties of utf-8. I don't see that the availability of multiple encodings has reduced the scope for error, if anything, there are more ways for things to go wrong. At least with utf-8 it is easy to automatically detect when a file is not correctly encoded, unlike any other encoding. > Arguments against multiple encodings: > > * There is no way to reliably detect which encoding has been used in a > file, and it is not reasonable to assume that a human editor has > gotten an embedded encoding declaration correct, requiring that all > files are therefore read through by a human after transmission to > check for incorrect letters, accents etc. > > * Facilitating use of multiple encodings encourages them to be used, > which increases the scale of the multiple encoding problem > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Summary of encoding discussion so far. . (Bollinger, John C)
- References:
- [ddlm-group] Summary of encoding discussion so far (James Hester)
- Prev by Date: [ddlm-group] Summary of encoding discussion so far
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- Prev by thread: [ddlm-group] Summary of encoding discussion so far
- Next by thread: Re: [ddlm-group] Summary of encoding discussion so far. .
- Index(es):