[ddlm-group] Unicode string equivalence in CIF2.0
- To: "ddlm-group@iucr.org" <ddlm-group@iucr.org>
- Subject: [ddlm-group] Unicode string equivalence in CIF2.0
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Mon, 18 Mar 2013 14:54:38 -0500
- Accept-Language: en-US
- acceptlanguage: en-US
Dear DDLm group, As I work on a model C implementation of a full CIF2 API, I have run into a couple of issues revolving around equivalence of Unicode strings, as it applies to Unicode-aware CIF 2.0. James suggested that it might not be too late for CIF
2.0 to address these questions, though he didn’t advise me about where to raise them. My apologies, therefore, to anyone who would rather I had taken this directly to COMCIFS instead of to this group. Basically, the two questions for which I seek answers are 1) Taking into account Unicode's separate accommodation of pre-composed and decomposed (and partially-decomposed) characters, including the possibility of different permutations of the same combining marks being applied to the same base
character, and different scripts' and cultures' varying case conversion conventions, under what circumstances are pairs of block codes, frame codes, or data names "the same" for CIF's purposes? 2) Taking into account Unicode's separate accommodation of pre-composed and decomposed (and partially-decomposed) characters, including the possibility of different permutations of the same combining marks being applied to the same base
character, under what circumstances are table indices "unique", as CIF 2 requires them to be? These are non-trivial because the simplest approach to Unicode string comparison -- character-by-character matching -- leads to potential problems for CIF (and everyone else). Unicode defines pre-composed and decomposed canonical forms,
and directs that compliant text processing systems handle canonically-equivalent Unicode text equivalently. That does not apply directly to CIF, but in addition to setting expectations for Unicode users, it sets the stage for subtle bugs if CIF takes a different
view of equivalence. For example, if I pipe a CIF document through a general-purpose Unicode filter (including a manual filter such as a text editor), that filter might consider it perfectly acceptable to, say, convert everything to Unicode normalization
form NFC (or NFD, etc.). If CIF does not consider the resulting item names "the same" as the original, then they might no longer correspond to definitions in the applicable dictionary. Furthermore, such an issue could be extremely difficult to diagnose by
eye, because Unicode insists that compliant text processors display the two sets of names identically. In addition, where CIF specifies case-insensitive comparison, we need to recognize that that is ambiguous outside the walled garden of 7-bit ASCII. Moreover, case mapping / folding is not orthogonal to normalization, so we cannot even
address this aspect separately. I suggest that CIF 2.0 adopt the position that table indices are different if and only if they are not canonically equivalent, as judged by their representations in Unicode normalization form NFC. I suggest that CIF 2.0 adopt the position that pairs of block codes, frame codes, data names, or anything else that is "case insensitive" are "the same" for CIF's purposes if the following procedure produces the same sequence of characters
for each: a) normalize the input to Unicode normalization form NFD b) apply the Unicode case folding algorithm (without Turkic dotless-i special option) to the form NFD result c) normalize the case-folded output (which is not guaranteed to be normalized any longer) to form NFC More detailed information on the Unicode issues involved, including that particular approach to caseless matching, are available in chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2). Best Regards, John --
John C. Bollinger, Ph.D. Computing and X-Ray Scientist Department of Structural Biology St. Jude Children's Research Hospital (901) 595-3166 [office] Email Disclaimer: www.stjude.org/emaildisclaimer Consultation Disclaimer: www.stjude.org/consultationdisclaimer |
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Prev by Date: Re: [ddlm-group] The Grazulis eliding proposal: how to incorporateinto CIF?. .. .. .
- Next by Date: Re: [ddlm-group] Unicode string equivalence in CIF2.0
- Prev by thread: Re: [ddlm-group] Draft EBNF for CIF2
- Next by thread: Re: [ddlm-group] Unicode string equivalence in CIF2.0
- Index(es):