[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Unicode string equivalence in CIF2.0

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Unicode string equivalence in CIF2.0
From: yayahjb <yayahjb@gmail.com>
Date: Mon, 18 Mar 2013 18:29:44 -0400
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=x-received:message-id:date:from:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:content-transfer-encoding;bh=jpkKEAoZc5IeU7MEXj+aeV2Ar9z5miHor+JvUWVPBfU=;b=Z2B1SFbuwrOFYBFAuZOerutMGI95Nsth3jsLjZHH81IQVEteCV4VNslV7iye9OF9bUHUlslUScZHacSlAX6iUGE3ImPJyZOVI/VP1kwgBNQdwMh5yJPo1FMlxlUM33fZiGg3DUFuJy+WG6mLourxp9HiNkLS+A9bojv6B7isudB1RZx+ClSc8QG+M3yPFfoAvdZwtnIO9okT7pWdmca6XEpSbbR01rggjWL3PnOTVqyj1Y9Om6a/g6aIk2NkujHISr0BrAOXaNfV8LnK36iYa4hpAemmn+SSz9uhS83wTqPhhtrUWFaPy4Lymb7/YTWiuYGts9PTRLlf9Rnf9tVKLA==
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>
References: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>

There is nothing new about this problem simply because of the use of 
Unicode.  CIF1 has allowed
arbitrary system-dependent text encodings, e.g. EBCDIC and ASCII and CDC 
display code.
The comparison of files there came down to comparison of the printable 
text.  I would
suggest continuing the same approach for CIF2 -- two CIFS are equivalent 
if they present
the same printable text -- making all the pre-composed and decomposed 
presentations
of the same printed information equivalent.  I cannot see any other way 
to have
portable meaningful category keys.


On 3/18/13 3:54 PM, Bollinger, John C wrote:
>
> Dear DDLm group,
>
> As I work on a model C implementation of a full CIF2 API, I have run 
> into a couple of issues revolving around equivalence of Unicode 
> strings, as it applies to Unicode-aware CIF 2.0.  James suggested that 
> it might not be too late for CIF 2.0 to address these questions, 
> though he didn�t advise me about where to raise them.  My apologies, 
> therefore, to anyone who would rather I had taken this directly to 
> COMCIFS instead of to this group.
>
> Basically, the two questions for which I seek answers are
>
> 1) Taking into account Unicode's separate accommodation of 
> pre-composed and decomposed (and partially-decomposed) characters, 
> including the possibility of different permutations of the same 
> combining marks being applied to the same base character, and 
> different scripts' and cultures' varying case conversion conventions, 
> under what circumstances are pairs of block codes, frame codes, or 
> data names "the same" for CIF's purposes?
>
> 2) Taking into account Unicode's separate accommodation of 
> pre-composed and decomposed (and partially-decomposed) characters, 
> including the possibility of different permutations of the same 
> combining marks being applied to the same base character, under what 
> circumstances are table indices "unique", as CIF 2 requires them to be?
>
> These are non-trivial because the simplest approach to Unicode string 
> comparison -- character-by-character matching -- leads to potential 
> problems for CIF (and everyone else).  Unicode defines pre-composed 
> and decomposed canonical forms, and directs that compliant text 
> processing systems handle canonically-equivalent Unicode text 
> equivalently.  That does not apply directly to CIF, but in addition to 
> setting expectations for Unicode users, it sets the stage for subtle 
> bugs if CIF takes a different view of equivalence.  For example, if I 
> pipe a CIF document through a general-purpose Unicode filter 
> (including a manual filter such as a text editor), that filter might 
> consider it perfectly acceptable to, say, convert everything to 
> Unicode normalization form NFC (or NFD, etc.).  If CIF does not 
> consider the resulting item names "the same" as the original, then 
> they might no longer correspond to definitions in the applicable 
> dictionary.  Furthermore, such an issue could be extremely difficult 
> to diagnose by eye, because Unicode insists that compliant text 
> processors display the two sets of names identically.
>
> In addition, where CIF specifies case-insensitive comparison, we need 
> to recognize that that is ambiguous outside the walled garden of 7-bit 
> ASCII.  Moreover, case mapping / folding is not orthogonal to 
> normalization, so we cannot even address this aspect separately.
>
> I suggest that CIF 2.0 adopt the position that table indices are 
> different if and only if they are not canonically equivalent, as 
> judged by their representations in Unicode normalization form NFC.
>
> I suggest that CIF 2.0 adopt the position that pairs of block codes, 
> frame codes, data names, or anything else that is "case insensitive" 
> are "the same" for CIF's purposes if the following procedure produces 
> the same sequence of characters for each:
>
> a) normalize the input to Unicode normalization form NFD
>
> b) apply the Unicode case folding algorithm (without Turkic dotless-i 
> special option) to the form NFD result
>
> c) normalize the case-folded output (which is not guaranteed to be 
> normalized any longer) to form NFC
>
> More detailed information on the Unicode issues involved, including 
> that particular approach to caseless matching, are available in 
> chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2).
>
> Best Regards,
>
> John
>
> -- 
>
> John C. Bollinger, Ph.D.
>
> Computing and X-Ray Scientist
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> John.Bollinger@StJude.org <mailto:John.Bollinger@StJude.org>
>
> (901) 595-3166 [office]
>
> www.stjude.org <http://www.stjude.org/>
>
>
> ------------------------------------------------------------------------
> Email Disclaimer: www.stjude.org/emaildisclaimer
> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/mailman/listinfo/ddlm-group
>    

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Unicode string equivalence in CIF2.0 (Bollinger, John C)

References:

[ddlm-group] Unicode string equivalence in CIF2.0 (Bollinger, John C)

Prev by Date: [ddlm-group] Unicode string equivalence in CIF2.0

Next by Date: Re: [ddlm-group] Unicode string equivalence in CIF2.0

Prev by thread: [ddlm-group] Unicode string equivalence in CIF2.0

Next by thread: Re: [ddlm-group] Unicode string equivalence in CIF2.0

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Unicode string equivalence in CIF2.0