[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity

Dear All,

First, my apologies to David for sticking to CIF2 syntax issues as he steers the discussion toward other DDLm questions.  CIF2 syntax seems settled enough to support the discussion he wants to have, but there remain a few syntax issues that need to be settled or at least clarified.  It would be better to handle those *before* CIF2 is released.  I hope that it will not distract too much from the discussion of other DDLm issues if they share the floor with some questions of syntax detail.

Now,

On 2/08/10 11:53:07 AM, "Nick Spadaccini" <nick@csse.uwa.edu.au> wrote:

[...]

> at an application level as one
>stored the data name it would dataname.toLower() first so that we were
>consistent.
>
>That is a solution and one we use in our system. Since it is case
>insensitive we choose a particular case to be consistent internally. You
>choose toUpper if you want it doesn't matter, sol long as you trap the fact
>that  _atom_site_frac_x and _Atom_Site_Frac_x are the same.

[...]

This is well and good, but it glosses over the complexities of Unicode case mapping.  As Joe originally pointed out, case mapping is *locale-sensitive*.  Moreover, Unicode case mappings are not all 1:1, therefore it makes a difference whether you convert to lowercase or uppercase for comparisons. For any who are interested, there is a Unicode standard annex (TR-21) that discusses these matters in some detail:
http://unicode.org/reports/tr21/tr21-5.html.

>On 12/01/10 1:11 AM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:
>
>> If the consensus is lower-case data names, why not make this part of the
>> CIF2 standard?

[...]

I don't think it is necessary or useful to require CIF data names to be lowercase, but I do think it would be prudent to define more precisely what "case-insensitive" means for CIF2.  It would be shamefully optimistic to assume that every string library implementation for every relevant language behaves identically for all inputs.  (For example, I was once painfully bitten by differing implementations of String.trim() in Java and C#.)  There *will* be uncertainty about the well-formedness of certain potential CIFs if CIF2 is not precise in this regard, and various processors *will* produce different results when fed such CIFs unless they all implement equivalent case folding semantics.

Apparently, Turkish dotted and dotless 'i' provide a canonical example of the difficulties here.  The Unicode uppercase mapping for U+0131 is 'I', and of course the Unicode uppercase mapping for 'i' is also 'I'.  The Unicode lowercase mapping for U+0130 is 'i'.  Consider these data tags:

_I
_i
_<U+0130>  # U+0130 is a Latin capital I with a dot above, used in Turkish
_<U+0131>  # U+0131 is a Latin lowercase i without a dot, also used in Turkish

Which of them should be considered equivalent?  If you compare the Unicode uppercase conversions then all of them are the same except _<U+0130>.  If you compare the Unicode lowercase conversions then all are the same except _<U+0131>.  Unless your string library implements case mapping only for ASCII or maybe Latin-1, in which case it would tell you that only _i and _I were equivalent.  On the other hand, if you use the conventions of the Turkish locale then uppercasing and lowercasing provide consistent results, but they are different from what you get using Unicode convention (case folding according to Turkish yields the equivalent pairs (I, U+0131) and (U+0130, i)).

It gets worse when you consider that some pre-composed characters have case mappings to decomposed character sequences.  This provides another way in which lowercase vs. uppercase comparison can yield different results about name equivalence. Section 1.4 of TR-21 discusses this issue, albeit more with an eye to the possibility of combining Unicode normalization with case folding.

Indeed, normalization should not be ignored in this discussion.  Mainly because of the existence of Unicode combining characters in addition to many pre-composed characters, there are multiple, "canonically equivalent", representations of many Unicode strings.  A Unicode-aware presentation system will often render these equivalents identically, even though they differ at the string level (by use of pre-composed vs. decomposed characters, for example, or by differing order of combining characters).  Unicode defines several normalization formats on which CIF could rely for ignoring the distinctions among these canonical equivalents, if that were desirable.  Unicode's recommendation (TR-21, sections 1.4 and 2.5) is to use normalization in conjunction with case folding for case-insensitive matching.


TO SUMMARIZE, the CIF2 specification needs to define the details of what it means for data names (and block and frame codes) to be "case insensitive", in light of the added complexities of case mapping in a general Unicode context.  My recommendation would be to use Unicode's recommended procedure for caseless matching as the basis for judging whether CIF identifiers are equivalent, but this is not the only viable alternative.  It may also be worth expanding the concept of case insensitivity to encompass Unicode canonical (or compatibility) equivalence as well.  My personal inclination would be to do so, but that is certainly debatable, and mine is not a strong position on that.


Best Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital



Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]