Re: [ddlm-group] Case sensitivity

On Sunday, May 23, 2010 11:33 PM, James Hester wrote:

>I'm glad we agree.  Yes, I think NFKC normalisation is the appropriate normalisation for our purposes.
>Does anybody disagree with this proposal?

Not I, but RFC 3454 offers several options among which we must select, and about which it behooves us to be explicit.  Here's a go at it: two CIF data names will be considered equivalent if they map to the same string under

() a character-by-character mapping according to section 3 and appendices *B.1 and B.2* of RFC 3454

() *followed by normalization* according to Unicode normalization form KC, per section 4.

() There are *no prohibited output characters* (atypical for a Stringprep profile, but required by the current form of CIF2 change 3).

() Unassigned characters are *allowed*, and

() bidirectional text is *not* subject to the optional analysis in section 6.

I think that captures the proposed algorithm, and it's ok with me.

>On Mon, May 24, 2010 at 1:32 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
>Dear James,
> I think we are saying the same thing conceptually.  You say to
>use normalization and case folding.  Nic says to use tolower.  I say specify  precisely what the "tolower" we >are to use is to do, and using the RFC3454 normalization and casefolding sounds reasonable to me.
>To avoid any misunderstanding, I believe you are proposing to do
>an NFKC normalization, i.e. decompose and then recombine.


>On Mon, 24 May 2010, James Hester wrote:
>>Dear All,
>>I believe that Nick and Herbert's preference for choosing a particular
>>tolower() implementation disregards the complexities of the situation as
>>outlined in the document that John references, particularly the
>>locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need
>>for normalisation.
>>My reading of TR21 indicates that the most correct way to proceed would be to
>>specify that two datanames are equivalent if their normalised, case-folded
>>equivalents are identical, as John B suggests.  In addition to TR21, RFC3454
>>gives a solid specification as to how comparison should be done
>>(http://tools.ietf.org/html/rfc3454.html) and has been implemented in the
>>Python 'stringprep' module.



