[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Case sensitivity
From: James Hester <[email protected]>
Date: Mon, 24 May 2010 14:32:51 +1000
In-Reply-To: <[email protected]>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337E7@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]>

I'm glad we agree.� Yes, I think NFKC normalisation is the appropriate normalisation for our purposes.

Does anybody disagree with this proposal?

On Mon, May 24, 2010 at 1:32 PM, Herbert J. Bernstein <[email protected]> wrote:

Dear James,

�I think we are saying the same thing conceptually. �You say to
use normalization and case folding. �Nic says to use tolower. �I say specify �precisely what the "tolower" we are to use is to do, and using the RFC3454 normalization and casefolding sounds reasonable to me.
To avoid any misunderstanding, I believe you are proposing to do
an NFKC normalization, i.e. decompose and then recombine.
Please confirm.

�We should make some test validation sets to help people to verify
their mappings ad give them to Brian to post. �We should check, but
I believe that libidn has a stringprep for C users.

�Regards,
� �Herbert

=====================================================
�Herbert J. Bernstein, Professor of Computer Science
� Dowling College, Kramer Science Center, KSC 121
� � � �Idle Hour Blvd, Oakdale, NY, 11769

� � � � � � � � +1-631-244-3035
� � � � � � � � [email protected]
=====================================================

On Mon, 24 May 2010, James Hester wrote:

Dear All,

I believe that Nick and Herbert's preference for choosing a particular
tolower() implementation disregards the complexities of the situation as
outlined in the document that John references, particularly the
locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need
for normalisation.

My reading of TR21 indicates that the most correct way to proceed would be to
specify that two datanames are equivalent if their normalised, case-folded
equivalents are identical, as John B suggests.� In addition to TR21, RFC3454
gives a solid specification as to how comparison should be done
(http://tools.ietf.org/html/rfc3454.html) and has been implemented in the
Python 'stringprep' module.

The alternative approach of abandoning case-insensitivity is not as big a win
as it may appear, because Unicode also introduces the issue of normalisation.�
The need for character string normalisation arises because a number of Unicode
character sequences may be used to represent the same displayed character (e.g.
multiple accents added to a base character, and in addition the fully or
partially accented character may have its own code point).� So even if we were
to abandon case-insensitivity, we have to address normalisation

So I conclude that the best way forward is to specify that datanames are
equivalent if their normalised, case-folded versions are identical as per
RFC3454.� In the ASCII domain, this is equivalent to using tolower(), so I
don't believe we are imposing a huge immediate burden on programmers.� A
programmer should check for non-ASCII characters in datanames, and either
abandon their processing (if the rely on tolower() and dataname comparisons are
important in their context) or do a fully-fledged check as per RFC3454.

On Sat, May 22, 2010 at 4:29 AM, Herbert J. Bernstein
<[email protected]> wrote:
� � �Dear Colleagues,

� � �� While everything John says about case-insensitivity is true,
� � �I have to agree with the concept Nick has proposed, which
� � �is the specify the algorithm to be used to achieve case
� � �insensitivity.

� � �� Nick has proposed the use of a tolower routine to disambiguate.
� � �What remains is to specify a particular tolower. �There are many
� � �of them, and they differ in their behavior. �We should pick one.
� � �The major constraint on such a choice is that it should agree
� � �with current CIF1 behavior on ASCII characters.
� � �=====================================================
� � ��Herbert J. Bernstein, Professor of Computer Science
� � �� Dowling College, Kramer Science Center, KSC 121
� � �� Idle Hour Blvd, Oakdale, NY, 11769

� � �� +1-631-244-3035
� � �� [email protected]
� � �=====================================================

On Fri, 21 May 2010, Bollinger, John C wrote:

> Dear All,
>
> First, my apologies to David for sticking to CIF2 syntax issues as he
steers the discussion toward other DDLm questions. �CIF2 syntax seems
settled enough to support the discussion he wants to have, but there
remain a few syntax issues that need to be settled or at least clarified.
�It would be better to handle those *before* CIF2 is released. �I hope
that it will not distract too much from the discussion of other DDLm
issues if they share the floor with some questions of syntax detail.
>
> Now,
>
> On 2/08/10 11:53:07 AM, "Nick Spadaccini" <[email protected]> wrote:
>
> [...]
>
>> at an application level as one
>> stored the data name it would dataname.toLower() first so that we were
>> consistent.
>>
>> That is a solution and one we use in our system. Since it is case
>> insensitive we choose a particular case to be consistent internally.
You
>> choose toUpper if you want it doesn't matter, sol long as you trap the
fact
>> that �_atom_site_frac_x and _Atom_Site_Frac_x are the same.
>
> [...]
>
> This is well and good, but it glosses over the complexities of Unicode
case mapping. �As Joe originally pointed out, case mapping is
*locale-sensitive*. �Moreover, Unicode case mappings are not all 1:1,
therefore it makes a difference whether you convert to lowercase or
uppercase for comparisons. For any who are interested, there is a Unicode
standard annex (TR-21) that discusses these matters in some detail:
> http://unicode.org/reports/tr21/tr21-5.html.
>
>> On 12/01/10 1:11 AM, "Joe Krahn" <[email protected]> wrote:
>>
>>> If the consensus is lower-case data names, why not make this part of
the
>>> CIF2 standard?
>
> [...]
>
> I don't think it is necessary or useful to require CIF data names to be
lowercase, but I do think it would be prudent to define more precisely
what "case-insensitive" means for CIF2. �It would be shamefully
optimistic to assume that every string library implementation for every
relevant language behaves identically for all inputs. �(For example, I
was once painfully bitten by differing implementations of String.trim()
in Java and C#.) �There *will* be uncertainty about the well-formedness
of certain potential CIFs if CIF2 is not precise in this regard, and
various processors *will* produce different results when fed such CIFs
unless they all implement equivalent case folding semantics.
>
> Apparently, Turkish dotted and dotless 'i' provide a canonical example
of the difficulties here. �The Unicode uppercase mapping for U+0131 is
'I', and of course the Unicode uppercase mapping for 'i' is also 'I'.
�The Unicode lowercase mapping for U+0130 is 'i'. �Consider these data
tags:
>
> _I
> _i
> _<U+0130> �# U+0130 is a Latin capital I with a dot above, used in
Turkish
> _<U+0131> �# U+0131 is a Latin lowercase i without a dot, also used in
Turkish
>
> Which of them should be considered equivalent? �If you compare the
Unicode uppercase conversions then all of them are the same except
_<U+0130>. �If you compare the Unicode lowercase conversions then all are
the same except _<U+0131>. �Unless your string library implements case
mapping only for ASCII or maybe Latin-1, in which case it would tell you
that only _i and _I were equivalent. �On the other hand, if you use the
conventions of the Turkish locale then uppercasing and lowercasing
provide consistent results, but they are different from what you get
using Unicode convention (case folding according to Turkish yields the
equivalent pairs (I, U+0131) and (U+0130, i)).
>
> It gets worse when you consider that some pre-composed characters have
case mappings to decomposed character sequences. �This provides another
way in which lowercase vs. uppercase comparison can yield different
results about name equivalence. Section 1.4 of TR-21 discusses this
issue, albeit more with an eye to the possibility of combining Unicode
normalization with case folding.
>
> Indeed, normalization should not be ignored in this discussion. �Mainly
because of the existence of Unicode combining characters in addition to
many pre-composed characters, there are multiple, "canonically
equivalent", representations of many Unicode strings. �A Unicode-aware
presentation system will often render these equivalents identically, even
though they differ at the string level (by use of pre-composed vs.
decomposed characters, for example, or by differing order of combining
characters). �Unicode defines several normalization formats on which CIF
could rely for ignoring the distinctions among these canonical
equivalents, if that were desirable. �Unicode's recommendation (TR-21,
sections 1.4 and 2.5) is to use normalization in conjunction with case
folding for case-insensitive matching.
>
>
> TO SUMMARIZE, the CIF2 specification needs to define the details of
what it means for data names (and block and frame codes) to be "case
insensitive", in light of the added complexities of case mapping in a
general Unicode context. �My recommendation would be to use Unicode's
recommended procedure for caseless matching as the basis for judging
whether CIF identifiers are equivalent, but this is not the only viable
alternative. �It may also be worth expanding the concept of case
insensitivity to encompass Unicode canonical (or compatibility)
equivalence as well. �My personal inclination would be to do so, but that
is certainly debatable, and mine is not a strong position on that.
>
>
> Best Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
> Email Disclaimer: �www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Case sensitivity (Bollinger, John C)

References:

Re: [ddlm-group] Case sensitivity (Bollinger, John C)

Re: [ddlm-group] Case sensitivity (Herbert J. Bernstein)

Re: [ddlm-group] Case sensitivity (James Hester)

Re: [ddlm-group] Case sensitivity (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] Case sensitivity

Next by Date: Re: [ddlm-group] UTF-8 BOM

Prev by thread: Re: [ddlm-group] Case sensitivity

Next by thread: Re: [ddlm-group] Case sensitivity

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Case sensitivity