Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Unicode string equivalence in CIF2.0

Dear DDLm group,

 

As I work on a model C implementation of a full CIF2 API, I have run into a couple of issues revolving around equivalence of Unicode strings, as it applies to Unicode-aware CIF 2.0.  James suggested that it might not be too late for CIF 2.0 to address these questions, though he didn’t advise me about where to raise them.  My apologies, therefore, to anyone who would rather I had taken this directly to COMCIFS instead of to this group.

 

Basically, the two questions for which I seek answers are

 

1) Taking into account Unicode's separate accommodation of pre-composed and decomposed (and partially-decomposed) characters, including the possibility of different permutations of the same combining marks being applied to the same base character, and different scripts' and cultures' varying case conversion conventions, under what circumstances are pairs of block codes, frame codes, or data names "the same" for CIF's purposes?

 

2) Taking into account Unicode's separate accommodation of pre-composed and decomposed (and partially-decomposed) characters, including the possibility of different permutations of the same combining marks being applied to the same base character, under what circumstances are table indices "unique", as CIF 2 requires them to be?

 

 

These are non-trivial because the simplest approach to Unicode string comparison -- character-by-character matching -- leads to potential problems for CIF (and everyone else).  Unicode defines pre-composed and decomposed canonical forms, and directs that compliant text processing systems handle canonically-equivalent Unicode text equivalently.  That does not apply directly to CIF, but in addition to setting expectations for Unicode users, it sets the stage for subtle bugs if CIF takes a different view of equivalence.  For example, if I pipe a CIF document through a general-purpose Unicode filter (including a manual filter such as a text editor), that filter might consider it perfectly acceptable to, say, convert everything to Unicode normalization form NFC (or NFD, etc.).  If CIF does not consider the resulting item names "the same" as the original, then they might no longer correspond to definitions in the applicable dictionary.  Furthermore, such an issue could be extremely difficult to diagnose by eye, because Unicode insists that compliant text processors display the two sets of names identically.

 

In addition, where CIF specifies case-insensitive comparison, we need to recognize that that is ambiguous outside the walled garden of 7-bit ASCII.  Moreover, case mapping / folding is not orthogonal to normalization, so we cannot even address this aspect separately.

 

 

I suggest that CIF 2.0 adopt the position that table indices are different if and only if they are not canonically equivalent, as judged by their representations in Unicode normalization form NFC.

 

I suggest that CIF 2.0 adopt the position that pairs of block codes, frame codes, data names, or anything else that is "case insensitive" are "the same" for CIF's purposes if the following procedure produces the same sequence of characters for each:

 

a) normalize the input to Unicode normalization form NFD

b) apply the Unicode case folding algorithm (without Turkic dotless-i special option) to the form NFD result

c) normalize the case-folded output (which is not guaranteed to be normalized any longer) to form NFC

 

 

More detailed information on the Unicode issues involved, including that particular approach to caseless matching, are available in chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2).

 

 

Best Regards,

 

John

 

--

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

John.Bollinger@StJude.org

(901) 595-3166 [office]

www.stjude.org

 

 


Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.