[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Unicode string equivalence in CIF2.0. .

 

Do we need to vote on this?  If so, then what, exactly, should the motion be, and to which group should it be directed?

 

Thanks,

 

John

 

From: ddlm-group-bounces@iucr.org [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Tuesday, March 19, 2013 6:07 PM
To: Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] Unicode string equivalence in CIF2.0. .

 

I agree with John's proposal.  A quick examination of the web suggests that regular expression libraries aspire to conform to the Unicode case matching algorithm which means that we can expect to have less trouble implementing our specification if we also follow suit.  There are no legacy issues to worry about either.

On Wed, Mar 20, 2013 at 1:40 AM, Bollinger, John C <John.Bollinger@stjude.org> wrote:


I am pleased that you support my proposal, Herbert, at least in a general sense, even if you don't think it's anything new.

I would like to emphasize, however, that the issues I raise are not the same as the issue of different text encodings.  The issue does not arise from whether a CIF is encoded in UTF-8, UTF-16, or some other Unicode encoding, or whether it is encoded in some national encoding or whatever, either on disk or in memory or over the wire.  Rather, it is an issue with the logical Unicode characters from which the CIF is constructed, regardless of the system's digital representation (encoding) of them: different sequences of Unicode characters can represent the same printable text.  That such sequences should be considered "the same" for the purposes of string comparison in CIF indeed seems appropriate, I agree, but not necessarily obvious -- especially to people who may have little familiarity with the details of Unicode.  To put it a different way, it's not about encoding, it's about the vastly expanded character repertoire of CIF2.

Anyway, the more complex issue is caseless matching, as required for data names, frame codes, and block codes.  The adoption of Unicode does add new complication to this issue, because we now need to be prepared to perform caseless matching on character sequences containing diacriticals and other combining marks, ligatures, and certain other characters with odd rules.  We must furthermore, we seem to agree, do it in a way that produces the same results for different sequences of Unicode characters that represent the same printable text.  We must also do it consistently, regardless of the varying case conversion conventions that prevail in different languages and cultures, even though different case-conversion procedures can produce different caseless-matching results.  No accommodation for any of that was needed in CIF 1, because at the CIF level, the allowed characters were explicitly restricted to a subset of those representable in 7-bit ASCII (regardless of the actual enco
 ding used to represent them).

To ensure that, for example, data names read from a CIF can be correctly matched to definitions in a dictionary, without regard to case, precomposition / decomposition / combining mark sequence, or encoding, CIF 2 should be explicit about what it means by "case insensitive".  The term is not inherently well-defined in a Unicode context, and the most straightforward way to complete its definition is to provide a reference procedure, such as the one I suggested (which I drew from Unicode's own discussion of the issue).  To be clear: I am not suggesting a requirement directly on programs to perform caseless matching a certain way; rather, I am suggesting clarifying the CIF 2.0 specifications so that it is possible to tell whether whatever approach a given program may take produces the correct results.


Best,

John



-----Original Message-----
From: yayahjb [mailto:yayahjb@gmail.com]
Sent: Monday, March 18, 2013 5:30 PM
To: Group finalising DDLm and associated dictionaries
Cc: Bollinger, John C
Subject: Re: [ddlm-group] Unicode string equivalence in CIF2.0

There is nothing new about this problem simply because of the use of Unicode.  CIF1 has allowed arbitrary system-dependent text encodings, e.g. EBCDIC and ASCII and CDC display code.
The comparison of files there came down to comparison of the printable text.  I would suggest continuing the same approach for CIF2 -- two CIFS are equivalent if they present the same printable text -- making all the pre-composed and decomposed presentations of the same printed information equivalent.  I cannot see any other way to have portable meaningful category keys.


On 3/18/13 3:54 PM, Bollinger, John C wrote:
>
> Dear DDLm group,
>
> As I work on a model C implementation of a full CIF2 API, I have run
> into a couple of issues revolving around equivalence of Unicode
> strings, as it applies to Unicode-aware CIF 2.0.  James suggested that
> it might not be too late for CIF 2.0 to address these questions,
> though he didn't advise me about where to raise them.  My apologies,
> therefore, to anyone who would rather I had taken this directly to
> COMCIFS instead of to this group.
>
> Basically, the two questions for which I seek answers are
>
> 1) Taking into account Unicode's separate accommodation of
> pre-composed and decomposed (and partially-decomposed) characters,
> including the possibility of different permutations of the same
> combining marks being applied to the same base character, and
> different scripts' and cultures' varying case conversion conventions,
> under what circumstances are pairs of block codes, frame codes, or
> data names "the same" for CIF's purposes?
>
> 2) Taking into account Unicode's separate accommodation of
> pre-composed and decomposed (and partially-decomposed) characters,
> including the possibility of different permutations of the same
> combining marks being applied to the same base character, under what
> circumstances are table indices "unique", as CIF 2 requires them to be?
>
> These are non-trivial because the simplest approach to Unicode string
> comparison -- character-by-character matching -- leads to potential
> problems for CIF (and everyone else).  Unicode defines pre-composed
> and decomposed canonical forms, and directs that compliant text
> processing systems handle canonically-equivalent Unicode text
> equivalently.  That does not apply directly to CIF, but in addition to
> setting expectations for Unicode users, it sets the stage for subtle
> bugs if CIF takes a different view of equivalence.  For example, if I
> pipe a CIF document through a general-purpose Unicode filter
> (including a manual filter such as a text editor), that filter might
> consider it perfectly acceptable to, say, convert everything to
> Unicode normalization form NFC (or NFD, etc.).  If CIF does not
> consider the resulting item names "the same" as the original, then
> they might no longer correspond to definitions in the applicable
> dictionary.  Furthermore, such an issue could be extremely difficult
> to diagnose by eye, because Unicode insists that compliant text
> processors display the two sets of names identically.
>
> In addition, where CIF specifies case-insensitive comparison, we need
> to recognize that that is ambiguous outside the walled garden of 7-bit
> ASCII.  Moreover, case mapping / folding is not orthogonal to
> normalization, so we cannot even address this aspect separately.
>
> I suggest that CIF 2.0 adopt the position that table indices are
> different if and only if they are not canonically equivalent, as
> judged by their representations in Unicode normalization form NFC.
>
> I suggest that CIF 2.0 adopt the position that pairs of block codes,
> frame codes, data names, or anything else that is "case insensitive"
> are "the same" for CIF's purposes if the following procedure produces
> the same sequence of characters for each:
>
> a) normalize the input to Unicode normalization form NFD
>
> b) apply the Unicode case folding algorithm (without Turkic dotless-i
> special option) to the form NFD result
>
> c) normalize the case-folded output (which is not guaranteed to be
> normalized any longer) to form NFC
>
> More detailed information on the Unicode issues involved, including
> that particular approach to caseless matching, are available in
> chapter 5 of the Unicode specification (section 5.18 in Unicode 6.2).
>
> Best Regards,
>
> John
>
> --
>
> John C. Bollinger, Ph.D.
>
> Computing and X-Ray Scientist
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> John.Bollinger@StJude.org <mailto:John.Bollinger@StJude.org>
>
> (901) 595-3166 [office]
>
> www.stjude.org <http://www.stjude.org/>
>
>
> ----------------------------------------------------------------------

> -- Email Disclaimer: www.stjude.org/emaildisclaimer Consultation

> Disclaimer: www.stjude.org/consultationdisclaimer
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/mailman/listinfo/ddlm-group
>



_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]