[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
From: David Brown <idbrown@mcmaster.ca>
Date: Thu, 24 Jun 2010 10:29:58 -0400
In-Reply-To: <381469.52475.qm@web87004.mail.ird.yahoo.com>
References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com> <alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com> <AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com> <alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local ><AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com> <20100623103310.GD15883@emerald.iucr.org> <8F77913624F7524AACD2A92EAF3BFA541661229521@SJMEMXMBS11.stjude.sjcrh.local ><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA541661229523@SJMEMXMBS11.stjude.sjcrh.local ><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA541661229526@SJMEMXMBS11.stjude.sjcrh.local ><alpine.BSF.2.00.1006231550410.30894@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA541661229527@SJMEMXMBS11.stjude.sjcrh.local ><a06240802c848414681ef@[192.168.2.104]><381469.52475.qm@web87004.mail.ird.yahoo.com>

Title:

I would like to endorse Simon's view. If it ain't broke, don't fix it. We have managed well with ASCII, and ASCII will continue to be used for all but the text fields for a long time even is we allow Unicode. Firstly we have to get dictionaries written, then we have to have CIF2 compliant user programs (to take advantage of the real virtue of CIF2, namely methods). Then we have to persuade CIF writers to produce CIF2 files. They will not be keen to do this until their own programs are able to read CIF2. How many years down the road have we now gone, 10? 20? Even when programs are available for writing CIF2, all the business end of the CIF (numerical tables) will continue to be written in ASCII and most of the Comment and Abstract fields will be as well, though there may be some who will find Unicode useful for subscrripts and superscripts and names with accents. I agree with Simon that we should be looking ahead, but we should be looking ahead to the time when encoding is nt the mess that it is at the present, and when the choice will be obvious. Surely extending the character set is one that can be added later. DDLm will continue to be written in ASCII as will most private dictionaries. I think we are planning a large airport before the we have a plane that can fly. My vote is to stay with ASCII but keep extended codings in mind so that we can move when we know which way everyone else is going to move. If we choose UTF-8 now we might be backing the wrong horse and then we will be in real trouble. David James Hester wrote: Before I engage with this latest proposal, I need to pick over the statements in your first paragraph carefully, so bear with me: On Thu, Jun 24, 2010 at 12:47 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: Here is an issue to consider: If we impose a non-text canonical UTF-8 encoding that does not contain an internal encoding signature, and that file is transmitted as text and not binary from a machine for which, say, ASCII with code pages for, say, western europe, is the native encoding, and the transmission converts the UTF-8 charcaters as if they were accented characters in Latin-1, then what is received may appear plausible at the receiving end, just wrong. 1. 'That file is transmitted as text': what does this mean? How do I transmit a file as text as opposed to just sending the file contents with no change? What protocol am I using? Email attachment? Http upload? Http downloading a .tgz file from a website? Ftp with 'text' mode? 2. 'The transmission converts the UTF-8 characters': why would it do this? What is this advanced text transmission protocol that is so confident about altering file contents? 3. 'Native encoding': what does this mean? What would the native encoding of my computer be, with one shell window having 'LANG=ru_RU' and another 'LANG=POSIX?' Does the concept of native encoding make any sense at all at an OS level? I'm aware of filenames in filesystems being expressed in standard encodings, but not the file contents themselves. Just so you know what my mental model of this whole file transmission issue is: 1. Files in the modern computing world are virtually always transmitted without alteration of any bytes at all. Call this binary transmission if you like. I am aware that email protocols may encode to base64 etc., but this is of course to make sure every single byte is identical when it is unpacked at the end. 2. How a file is *displayed* will be application (not OS) dependent. The application may take into account environment variables, any metadata about the file, and user selections. How a file is *displayed* does not change how it is stored on disk. 3. Utilities exist to interconvert between encodings. Modern text editors do not need these tools as they come with a reasonable range of character mapping tables to enable them to *display* the correct character if they are told the correct encoding. There is no such thing as the *correct* encoding for such an application, only a default encoding. Therefore, I would suggest that we be very careful to make such a canonical UTF-8 cif self identifying, by including not only a BOM, but by adding some text in the range of #x128-#x254 to the magic number to help in detecting such unintended transmission conversions. In addition, I would suggest that, just as the first line of an XML document specifies its encoding in plain text, that we add the same information to our magic number. I would suggest carefully reading the XML specification on this subject and that we try to follow the approach taken. It is well-supported by a great deal of existing software. If we follow a similar approach, we should avoid any offense in what is clearly a very touchy issue. 4.3.3 Character Encoding in Entities Each external parsed entity in an XML document may use a different encoding for its characters. All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings. The terms "UTF-8" and "UTF-16" in this specification do not apply to related character encodings, including but not limited to UTF-16BE, UTF-16LE, or CESU-8. Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 16.8 of [Unicode] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors MUST be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents. If the replacement text of an external entity is to begin with the character U+FEFF, and no text declaration is present, then a Byte Order Mark MUST be present, whether the entity is encoded in UTF-8 or UTF-16. Although an XML processor is required to read only entities in the UTF-8 and UTF-16 encodings, it is recognized that other encodings are used around the world, and it may be desired for XML processors to read entities that use them. In the absence of external character encoding information (such as MIME headers), parsed entities which are stored in an encoding other than UTF-8 or UTF-16 MUST begin with a text declaration (see 4.3.1 The Text Declaration) containing an encoding declaration: Encoding Declaration [80] EncodingDecl ::= S 'encoding' Eq ('"' EncName '"' | "'" EncName "'" ) [81] EncName ::= [A-Za-z] ([A-Za-z0-9._] | '-')* /* Encoding name contains only Latin characters */ In the document entity, the encoding declaration is part of the XML declaration. The EncName is the name of the encoding used. In an encoding declaration, the values " UTF-8 ", " UTF-16 ", " ISO-10646-UCS-2 ", and " ISO-10646-UCS-4 " SHOULD be used for the various encodings and transformations of Unicode / ISO/IEC 10646, the values " ISO-8859-1 ", " ISO-8859-2 ", ... " ISO-8859- n " (where n is the part number) SHOULD be used for the parts of ISO 8859, and the values " ISO-2022-JP ", " Shift_JIS ", and " EUC-JP " SHOULD be used for the various encoded forms of JIS X-0208-1997. It is RECOMMENDED that character encodings registered (as charsets) with the Internet Assigned Numbers Authority [IANA-CHARSETS], other than those just listed, be referred to using their registered names; other encodings SHOULD use names starting with an "x-" prefix. XML processors SHOULD match character encoding names in a case-insensitive way and SHOULD either interpret an IANA-registered name as the encoding registered at IANA for that name or treat it as unknown (processors are, of course, not required to support all IANA-registered encodings). In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration. It is a fatal error for a TextDecl to occur other than at the beginning of an external entity. It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding. Specifically, it is a fatal error if an entity encoded in UTF-8 contains any ill-formed code unit sequences, as defined in section 3.9 of Unicode [Unicode]. Unless an encoding is determined by a higher-level protocol, it is also a fatal error if an XML entity contains no encoding declaration and its content is not legal UTF-8 or UTF-16. Examples of text declarations containing encoding declarations: <?xml encoding='UTF-8'?> <?xml encoding='EUC-JP'?> ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 23 Jun 2010, Bollinger, John C wrote: On Wednesday, June 23, 2010 5:33 AM, Brian McMahon wrote: [...] Expecting every CIF application to be robustly able to handle every conceivable - or even every reasonable - encoding is (what's the word?) "optimistic", and places a heavy burden on application developers. I thought you were an optimist? :-) Indeed, I agree that such an expectation would be optimistic in the extreme, and I don't think anyone has been advocating such a requirement. Consider instead the approach of defining the CIF standard as a text file and using UTF-8 for a "canonical" description of low-level representations. Supply a set of such canonical CIFs in the next-generation trip test suite. Require a "compliant" CIF application to handle the trip tests with the canonical encoding. Permit - indeed encourage - applications developers to accommodate other encodings to the extent they can easily do with their standard text-processing libraries/utilities/tools. Encourage or perhaps commission a "canonicalisation" suite for use in contexts where an application cannot natively handle a submitted encoding. [...] This isn't a radical new suggestion; it seems to me to encapsulate many of the points of common ground around which we're still negotiating our points of principle or philosophy, but I would hope it can help us to move forward. That satisfactorily captures the key points I have been pursuing. With only a bit of tweaking, the "CIF Interchange Format" proposal I floated would serve this end nicely. Alternatively, the same end could be reached by couching the requirement in terms of a "canonical" encoding, more along the lines of Brian's text above: 1. In "TERMINOLOGY", insert a new first paragraph: ==== Reference to characters means numeric code points in the Unicode code space. Where Unicode has assigned 'abstract characters' to specific code points, those code points may sometimes be referred to by the Unicode-assigned name or a colloquial equivalent. Otherwise, they are referred to according to Unicode convention, U+[[x]x]xxxx, where [[x]x]xxxx is the four- to six-digit hexadecimal representation of the code point value. ==== 2. Change the heading "CHANGE 2 - NEW (ENCODING)" to "CHANGE 2 - NEW (CHARACTER SET)". 3. Replace the first paragraph in the CHANGE 2 section with: ==== CIF2 files are variable-length Unicode text files, but for historical reasons will have a maximum record length of 2048 characters. As described in detail below, CIF2 imposes restrictions on the characters allowed in data names, block codes, and save frame codes, and it disregards the Unicode-defined separating and delimiting functions of all but a few characters. ==== 4. Change the format of the explicit included character set to use Unicode convention. (A few weeks ago I provided James a proposed draft update that does this.) 5. Delete all remaining appearances of the text "UTF-8" in that section and those following, without replacement (the definition of "character(s)" obviates these). 6. Add a new section at the end: ==== CHANGE 10 - NEW (ENCODING) Many alternative encodings are available for recording and exchanging Unicode text (such as CIF2 data) via byte-oriented media. This specification does not forbid the use of any particular encoding for storing and exchanging CIF2 data, but UTF-8 is the canonical encoding for CIF2. All CIF2 readers conformant with this specification are prepared to accept CIF2 input encoded in UTF-8. They may in addition accept CIF2 input encoded via other schemes, but they are not required to do so. CIF2 writers may produce output in any encoding, but they are strongly encouraged to use UTF-8 unless environment- or purpose-specific circumstances direct otherwise. As used with CIF2, UTF-8 encoding includes an optional initial UTF-8 encoded byte-order mark (character U+FEFF). Such a code is accepted and ignored if present, but it is considered part of the encoding, not part of the encoded CIF2 data. Reasoning: A canonical encoding is chosen to standardize one means of exchanging CIF data without data corruption or loss. UTF-8 in particular is chosen because of its widespread and growing acceptance and implementation, its coverage of the entire Unicode code space, and its congruence with 7-bit ASCII over the entire ASCII range. ==== Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group

begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Herbert J. Bernstein)

References:

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ... (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .

Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .