[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Unicode string equivalence in CIF2.0
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Unicode string equivalence in CIF2.0
- From: yayahjb <yayahjb@gmail.com>
- Date: Mon, 18 Mar 2013 18:29:44 -0400
- DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=x-received:message-id:date:from:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:content-transfer-encoding;bh=jpkKEAoZc5IeU7MEXj+aeV2Ar9z5miHor+JvUWVPBfU=;b=Z2B1SFbuwrOFYBFAuZOerutMGI95Nsth3jsLjZHH81IQVEteCV4VNslV7iye9OF9bUHUlslUScZHacSlAX6iUGE3ImPJyZOVI/VP1kwgBNQdwMh5yJPo1FMlxlUM33fZiGg3DUFuJy+WG6mLourxp9HiNkLS+A9bojv6B7isudB1RZx+ClSc8QG+M3yPFfoAvdZwtnIO9okT7pWdmca6XEpSbbR01rggjWL3PnOTVqyj1Y9Om6a/g6aIk2NkujHISr0BrAOXaNfV8LnK36iYa4hpAemmn+SSz9uhS83wTqPhhtrUWFaPy4Lymb7/YTWiuYGts9PTRLlf9Rnf9tVKLA==
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>
- References: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>
There is nothing new about this problem simply because of the use of Unicode. CIF1 has allowed arbitrary system-dependent text encodings, e.g. EBCDIC and ASCII and CDC display code. The comparison of files there came down to comparison of the printable text. I would suggest continuing the same approach for CIF2 -- two CIFS are equivalent if they present the same printable text -- making all the pre-composed and decomposed presentations of the same printed information equivalent. I cannot see any other way to have portable meaningful category keys. On 3/18/13 3:54 PM, Bollinger, John C wrote: > > Dear DDLm group, > > As I work on a model C implementation of a full CIF2 API, I have run > into a couple of issues revolving around equivalence of Unicode > strings, as it applies to Unicode-aware CIF 2.0. James suggested that > it might not be too late for CIF 2.0 to address these questions, > though he didn’t advise me about where to raise them. My apologies, > therefore, to anyone who would rather I had taken this directly to > COMCIFS instead of to this group. > > Basically, the two questions for which I seek answers are > > 1) Taking into account Unicode's separate accommodation of > pre-composed and decomposed (and partially-decomposed) characters, > including the possibility of different permutations of the same > combining marks being applied to the same base character, and > different scripts' and cultures' varying case conversion conventions, > under what circumstances are pairs of block codes, frame codes, or > data names "the same" for CIF's purposes? > > 2) Taking into account Unicode's separate accommodation of > pre-composed and decomposed (and partially-decomposed) characters, > including the possibility of different permutations of the same > combining marks being applied to the same base character, under what > circumstances are table indices "unique", as CIF 2 requires them to be? > > These are non-trivial because the simplest approach to Unicode string > comparison -- character-by-character matching -- leads to potential > problems for CIF (and everyone else). Unicode defines pre-composed > and decomposed canonical forms, and directs that compliant text > processing systems handle canonically-equivalent Unicode text > equivalently. That does not apply directly to CIF, but in addition to > setting expectations for Unicode users, it sets the stage for subtle > bugs if CIF takes a different view of equivalence. For example, if I > pipe a CIF document through a general-purpose Unicode filter > (including a manual filter such as a text editor), that filter might > consider it perfectly acceptable to, say, convert everything to > Unicode normalization form NFC (or NFD, etc.). If CIF does not > consider the resulting item names "the same" as the original, then > they might no longer correspond to definitions in the applicable > dictionary. Furthermore, such an issue could be extremely difficult > to diagnose by eye, because Unicode insists that compliant text > processors display the two sets of names identically. > > In addition, where CIF specifies case-insensitive comparison, we need > to recognize that that is ambiguous outside the walled garden of 7-bit > ASCII. Moreover, case mapping / folding is not orthogonal to > normalization, so we cannot even address this aspect separately. > > I suggest that CIF 2.0 adopt the position that table indices are > different if and only if they are not canonically equivalent, as > judged by their representations in Unicode normalization form NFC. > > I suggest that CIF 2.0 adopt the position that pairs of block codes, > frame codes, data names, or anything else that is "case insensitive" > are "the same" for CIF's purposes if the following procedure produces > the same sequence of characters for each: > > a) normalize the input to Unicode normalization form NFD > > b) apply the Unicode case folding algorithm (without Turkic dotless-i > special option) to the form NFD result > > c) normalize the case-folded output (which is not guaranteed to be > normalized any longer) to form NFC > > More detailed information on the Unicode issues involved, including > that particular approach to caseless matching, are available in > chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2). > > Best Regards, > > John > > -- > > John C. Bollinger, Ph.D. > > Computing and X-Ray Scientist > > Department of Structural Biology > > St. Jude Children's Research Hospital > > John.Bollinger@StJude.org <mailto:John.Bollinger@StJude.org> > > (901) 595-3166 [office] > > www.stjude.org <http://www.stjude.org/> > > > ------------------------------------------------------------------------ > Email Disclaimer: www.stjude.org/emaildisclaimer > Consultation Disclaimer: www.stjude.org/consultationdisclaimer > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://mailman.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Unicode string equivalence in CIF2.0 (Bollinger, John C)
- References:
- [ddlm-group] Unicode string equivalence in CIF2.0 (Bollinger, John C)
- Prev by Date: [ddlm-group] Unicode string equivalence in CIF2.0
- Next by Date: Re: [ddlm-group] Unicode string equivalence in CIF2.0
- Prev by thread: [ddlm-group] Unicode string equivalence in CIF2.0
- Next by thread: Re: [ddlm-group] Unicode string equivalence in CIF2.0
- Index(es):