Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity


On Sunday, May 23, 2010 11:33 PM, James Hester wrote:

>I'm glad we agree.  Yes, I think NFKC normalisation is the appropriate normalisation for our purposes.
>
>Does anybody disagree with this proposal?

Not I, but RFC 3454 offers several options among which we must select, and about which it behooves us to be explicit.  Here's a go at it: two CIF data names will be considered equivalent if they map to the same string under

() a character-by-character mapping according to section 3 and appendices *B.1 and B.2* of RFC 3454

() *followed by normalization* according to Unicode normalization form KC, per section 4.

() There are *no prohibited output characters* (atypical for a Stringprep profile, but required by the current form of CIF2 change 3).

() Unassigned characters are *allowed*, and

() bidirectional text is *not* subject to the optional analysis in section 6.

I think that captures the proposed algorithm, and it's ok with me.

>On Mon, May 24, 2010 at 1:32 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
>Dear James,
>
> I think we are saying the same thing conceptually.  You say to
>use normalization and case folding.  Nic says to use tolower.  I say specify  precisely what the "tolower" we >are to use is to do, and using the RFC3454 normalization and casefolding sounds reasonable to me.
>To avoid any misunderstanding, I believe you are proposing to do
>an NFKC normalization, i.e. decompose and then recombine.

[...]

>On Mon, 24 May 2010, James Hester wrote:
>>Dear All,
>>
>>I believe that Nick and Herbert's preference for choosing a particular
>>tolower() implementation disregards the complexities of the situation as
>>outlined in the document that John references, particularly the
>>locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need
>>for normalisation.
>>
>>My reading of TR21 indicates that the most correct way to proceed would be to
>>specify that two datanames are equivalent if their normalised, case-folded
>>equivalents are identical, as John B suggests.  In addition to TR21, RFC3454
>>gives a solid specification as to how comparison should be done
>>(http://tools.ietf.org/html/rfc3454.html) and has been implemented in the
>>Python 'stringprep' module.

[...]

Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.