Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity

Dear All,

I believe that Nick and Herbert's preference for choosing a particular tolower() implementation disregards the complexities of the situation as outlined in the document that John references, particularly the locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need for normalisation.

My reading of TR21 indicates that the most correct way to proceed would be to specify that two datanames are equivalent if their normalised, case-folded equivalents are identical, as John B suggests.  In addition to TR21, RFC3454 gives a solid specification as to how comparison should be done (http://tools.ietf.org/html/rfc3454.html) and has been implemented in the Python 'stringprep' module.

The alternative approach of abandoning case-insensitivity is not as big a win as it may appear, because Unicode also introduces the issue of normalisation.  The need for character string normalisation arises because a number of Unicode character sequences may be used to represent the same displayed character (e.g. multiple accents added to a base character, and in addition the fully or partially accented character may have its own code point).  So even if we were to abandon case-insensitivity, we have to address normalisation

So I conclude that the best way forward is to specify that datanames are equivalent if their normalised, case-folded versions are identical as per RFC3454.  In the ASCII domain, this is equivalent to using tolower(), so I don't believe we are imposing a huge immediate burden on programmers.  A programmer should check for non-ASCII characters in datanames, and either abandon their processing (if the rely on tolower() and dataname comparisons are important in their context) or do a fully-fledged check as per RFC3454.



On Sat, May 22, 2010 at 4:29 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear Colleagues,

  While everything John says about case-insensitivity is true,
I have to agree with the concept Nick has proposed, which
is the specify the algorithm to be used to achieve case
insensitivity.

  Nick has proposed the use of a tolower routine to disambiguate.
What remains is to specify a particular tolower.  There are many
of them, and they differ in their behavior.  We should pick one.
The major constraint on such a choice is that it should agree
with current CIF1 behavior on ASCII characters.
=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================

On Fri, 21 May 2010, Bollinger, John C wrote:

> Dear All,
>
> First, my apologies to David for sticking to CIF2 syntax issues as he steers the discussion toward other DDLm questions.  CIF2 syntax seems settled enough to support the discussion he wants to have, but there remain a few syntax issues that need to be settled or at least clarified.  It would be better to handle those *before* CIF2 is released.  I hope that it will not distract too much from the discussion of other DDLm issues if they share the floor with some questions of syntax detail.
>
> Now,
>
> On 2/08/10 11:53:07 AM, "Nick Spadaccini" <nick@csse.uwa.edu.au> wrote:
>
> [...]
>
>> at an application level as one
>> stored the data name it would dataname.toLower() first so that we were
>> consistent.
>>
>> That is a solution and one we use in our system. Since it is case
>> insensitive we choose a particular case to be consistent internally. You
>> choose toUpper if you want it doesn't matter, sol long as you trap the fact
>> that  _atom_site_frac_x and _Atom_Site_Frac_x are the same.
>
> [...]
>
> This is well and good, but it glosses over the complexities of Unicode case mapping.  As Joe originally pointed out, case mapping is *locale-sensitive*.  Moreover, Unicode case mappings are not all 1:1, therefore it makes a difference whether you convert to lowercase or uppercase for comparisons. For any who are interested, there is a Unicode standard annex (TR-21) that discusses these matters in some detail:
> http://unicode.org/reports/tr21/tr21-5.html.
>
>> On 12/01/10 1:11 AM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:
>>
>>> If the consensus is lower-case data names, why not make this part of the
>>> CIF2 standard?
>
> [...]
>
> I don't think it is necessary or useful to require CIF data names to be lowercase, but I do think it would be prudent to define more precisely what "case-insensitive" means for CIF2.  It would be shamefully optimistic to assume that every string library implementation for every relevant language behaves identically for all inputs.  (For example, I was once painfully bitten by differing implementations of String.trim() in Java and C#.)  There *will* be uncertainty about the well-formedness of certain potential CIFs if CIF2 is not precise in this regard, and various processors *will* produce different results when fed such CIFs unless they all implement equivalent case folding semantics.
>
> Apparently, Turkish dotted and dotless 'i' provide a canonical example of the difficulties here.  The Unicode uppercase mapping for U+0131 is 'I', and of course the Unicode uppercase mapping for 'i' is also 'I'.  The Unicode lowercase mapping for U+0130 is 'i'.  Consider these data tags:
>
> _I
> _i
> _<U+0130>  # U+0130 is a Latin capital I with a dot above, used in Turkish
> _<U+0131>  # U+0131 is a Latin lowercase i without a dot, also used in Turkish
>
> Which of them should be considered equivalent?  If you compare the Unicode uppercase conversions then all of them are the same except _<U+0130>.  If you compare the Unicode lowercase conversions then all are the same except _<U+0131>.  Unless your string library implements case mapping only for ASCII or maybe Latin-1, in which case it would tell you that only _i and _I were equivalent.  On the other hand, if you use the conventions of the Turkish locale then uppercasing and lowercasing provide consistent results, but they are different from what you get using Unicode convention (case folding according to Turkish yields the equivalent pairs (I, U+0131) and (U+0130, i)).
>
> It gets worse when you consider that some pre-composed characters have case mappings to decomposed character sequences.  This provides another way in which lowercase vs. uppercase comparison can yield different results about name equivalence. Section 1.4 of TR-21 discusses this issue, albeit more with an eye to the possibility of combining Unicode normalization with case folding.
>
> Indeed, normalization should not be ignored in this discussion.  Mainly because of the existence of Unicode combining characters in addition to many pre-composed characters, there are multiple, "canonically equivalent", representations of many Unicode strings.  A Unicode-aware presentation system will often render these equivalents identically, even though they differ at the string level (by use of pre-composed vs. decomposed characters, for example, or by differing order of combining characters).  Unicode defines several normalization formats on which CIF could rely for ignoring the distinctions among these canonical equivalents, if that were desirable.  Unicode's recommendation (TR-21, sections 1.4 and 2.5) is to use normalization in conjunction with case folding for case-insensitive matching.
>
>
> TO SUMMARIZE, the CIF2 specification needs to define the details of what it means for data names (and block and frame codes) to be "case insensitive", in light of the added complexities of case mapping in a general Unicode context.  My recommendation would be to use Unicode's recommended procedure for caseless matching as the basis for judging whether CIF identifiers are equivalent, but this is not the only viable alternative.  It may also be worth expanding the concept of case insensitivity to encompass Unicode canonical (or compatibility) equivalence as well.  My personal inclination would be to do so, but that is certainly debatable, and mine is not a strong position on that.
>
>
> Best Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.