Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Unicode string equivalence in CIF2.0

  • To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
  • Subject: Re: [ddlm-group] Unicode string equivalence in CIF2.0
  • From: yayahjb <yayahjb@gmail.com>
  • Date: Mon, 18 Mar 2013 18:29:44 -0400
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;h=x-received:message-id:date:from:user-agent:mime-version:to:cc:subject:references:in-reply-to:content-type:content-transfer-encoding;bh=jpkKEAoZc5IeU7MEXj+aeV2Ar9z5miHor+JvUWVPBfU=;b=Z2B1SFbuwrOFYBFAuZOerutMGI95Nsth3jsLjZHH81IQVEteCV4VNslV7iye9OF9bUHUlslUScZHacSlAX6iUGE3ImPJyZOVI/VP1kwgBNQdwMh5yJPo1FMlxlUM33fZiGg3DUFuJy+WG6mLourxp9HiNkLS+A9bojv6B7isudB1RZx+ClSc8QG+M3yPFfoAvdZwtnIO9okT7pWdmca6XEpSbbR01rggjWL3PnOTVqyj1Y9Om6a/g6aIk2NkujHISr0BrAOXaNfV8LnK36iYa4hpAemmn+SSz9uhS83wTqPhhtrUWFaPy4Lymb7/YTWiuYGts9PTRLlf9Rnf9tVKLA==
  • In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>
  • References: <8F77913624F7524AACD2A92EAF3BFA544EFBBF3BFE@11.stjude.org>
There is nothing new about this problem simply because of the use of 
Unicode.  CIF1 has allowed
arbitrary system-dependent text encodings, e.g. EBCDIC and ASCII and CDC 
display code.
The comparison of files there came down to comparison of the printable 
text.  I would
suggest continuing the same approach for CIF2 -- two CIFS are equivalent 
if they present
the same printable text -- making all the pre-composed and decomposed 
presentations
of the same printed information equivalent.  I cannot see any other way 
to have
portable meaningful category keys.


On 3/18/13 3:54 PM, Bollinger, John C wrote:
>
> Dear DDLm group,
>
> As I work on a model C implementation of a full CIF2 API, I have run 
> into a couple of issues revolving around equivalence of Unicode 
> strings, as it applies to Unicode-aware CIF 2.0.  James suggested that 
> it might not be too late for CIF 2.0 to address these questions, 
> though he didnít advise me about where to raise them.  My apologies, 
> therefore, to anyone who would rather I had taken this directly to 
> COMCIFS instead of to this group.
>
> Basically, the two questions for which I seek answers are
>
> 1) Taking into account Unicode's separate accommodation of 
> pre-composed and decomposed (and partially-decomposed) characters, 
> including the possibility of different permutations of the same 
> combining marks being applied to the same base character, and 
> different scripts' and cultures' varying case conversion conventions, 
> under what circumstances are pairs of block codes, frame codes, or 
> data names "the same" for CIF's purposes?
>
> 2) Taking into account Unicode's separate accommodation of 
> pre-composed and decomposed (and partially-decomposed) characters, 
> including the possibility of different permutations of the same 
> combining marks being applied to the same base character, under what 
> circumstances are table indices "unique", as CIF 2 requires them to be?
>
> These are non-trivial because the simplest approach to Unicode string 
> comparison -- character-by-character matching -- leads to potential 
> problems for CIF (and everyone else).  Unicode defines pre-composed 
> and decomposed canonical forms, and directs that compliant text 
> processing systems handle canonically-equivalent Unicode text 
> equivalently.  That does not apply directly to CIF, but in addition to 
> setting expectations for Unicode users, it sets the stage for subtle 
> bugs if CIF takes a different view of equivalence.  For example, if I 
> pipe a CIF document through a general-purpose Unicode filter 
> (including a manual filter such as a text editor), that filter might 
> consider it perfectly acceptable to, say, convert everything to 
> Unicode normalization form NFC (or NFD, etc.).  If CIF does not 
> consider the resulting item names "the same" as the original, then 
> they might no longer correspond to definitions in the applicable 
> dictionary.  Furthermore, such an issue could be extremely difficult 
> to diagnose by eye, because Unicode insists that compliant text 
> processors display the two sets of names identically.
>
> In addition, where CIF specifies case-insensitive comparison, we need 
> to recognize that that is ambiguous outside the walled garden of 7-bit 
> ASCII.  Moreover, case mapping / folding is not orthogonal to 
> normalization, so we cannot even address this aspect separately.
>
> I suggest that CIF 2.0 adopt the position that table indices are 
> different if and only if they are not canonically equivalent, as 
> judged by their representations in Unicode normalization form NFC.
>
> I suggest that CIF 2.0 adopt the position that pairs of block codes, 
> frame codes, data names, or anything else that is "case insensitive" 
> are "the same" for CIF's purposes if the following procedure produces 
> the same sequence of characters for each:
>
> a) normalize the input to Unicode normalization form NFD
>
> b) apply the Unicode case folding algorithm (without Turkic dotless-i 
> special option) to the form NFD result
>
> c) normalize the case-folded output (which is not guaranteed to be 
> normalized any longer) to form NFC
>
> More detailed information on the Unicode issues involved, including 
> that particular approach to caseless matching, are available in 
> chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2).
>
> Best Regards,
>
> John
>
> -- 
>
> John C. Bollinger, Ph.D.
>
> Computing and X-Ray Scientist
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> John.Bollinger@StJude.org <mailto:John.Bollinger@StJude.org>
>
> (901) 595-3166 [office]
>
> www.stjude.org <http://www.stjude.org/>
>
>
> ------------------------------------------------------------------------
> Email Disclaimer: www.stjude.org/emaildisclaimer
> Consultation Disclaimer: www.stjude.org/consultationdisclaimer
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/mailman/listinfo/ddlm-group
>    

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.