[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Case sensitivity
- To: "'Group finalising DDLm and associated dictionaries'" <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Case sensitivity
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Mon, 24 May 2010 12:13:57 -0500
- Accept-Language: en-US
- acceptlanguage: en-US
- In-Reply-To: <AANLkTilP7Hv7KlBPRCNZoaFVHBsIjZn5qFo6Ai6BWcRu@mail.gmail.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337E7@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005211411300.25789@epsilon.pair.com><AANLkTinyGeOjv9hu7nhzsNSeLHmlyyNPtXXIzalWuVep@mail.gmail.com><alpine.BSF.2.00.1005232313130.63670@epsilon.pair.com><AANLkTilP7Hv7KlBPRCNZoaFVHBsIjZn5qFo6Ai6BWcRu@mail.gmail.com>
On Sunday, May 23, 2010 11:33 PM, James Hester wrote: >I'm glad we agree. Yes, I think NFKC normalisation is the appropriate normalisation for our purposes. > >Does anybody disagree with this proposal? Not I, but RFC 3454 offers several options among which we must select, and about which it behooves us to be explicit. Here's a go at it: two CIF data names will be considered equivalent if they map to the same string under () a character-by-character mapping according to section 3 and appendices *B.1 and B.2* of RFC 3454 () *followed by normalization* according to Unicode normalization form KC, per section 4. () There are *no prohibited output characters* (atypical for a Stringprep profile, but required by the current form of CIF2 change 3). () Unassigned characters are *allowed*, and () bidirectional text is *not* subject to the optional analysis in section 6. I think that captures the proposed algorithm, and it's ok with me. >On Mon, May 24, 2010 at 1:32 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: >Dear James, > > I think we are saying the same thing conceptually. You say to >use normalization and case folding. Nic says to use tolower. I say specify precisely what the "tolower" we >are to use is to do, and using the RFC3454 normalization and casefolding sounds reasonable to me. >To avoid any misunderstanding, I believe you are proposing to do >an NFKC normalization, i.e. decompose and then recombine. [...] >On Mon, 24 May 2010, James Hester wrote: >>Dear All, >> >>I believe that Nick and Herbert's preference for choosing a particular >>tolower() implementation disregards the complexities of the situation as >>outlined in the document that John references, particularly the >>locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need >>for normalisation. >> >>My reading of TR21 indicates that the most correct way to proceed would be to >>specify that two datanames are equivalent if their normalised, case-folded >>equivalents are identical, as John B suggests. In addition to TR21, RFC3454 >>gives a solid specification as to how comparison should be done >>(http://tools.ietf.org/html/rfc3454.html) and has been implemented in the >>Python 'stringprep' module. [...] Regards, John -- John C. Bollinger, Ph.D. Department of Structural Biology St. Jude Children's Research Hospital Email Disclaimer: www.stjude.org/emaildisclaimer _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] Case sensitivity (Bollinger, John C)
- Re: [ddlm-group] Case sensitivity (Herbert J. Bernstein)
- Re: [ddlm-group] Case sensitivity (James Hester)
- Re: [ddlm-group] Case sensitivity (Herbert J. Bernstein)
- Re: [ddlm-group] Case sensitivity (James Hester)
- Prev by Date: [ddlm-group] imgCIF versus CIF2
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] Case sensitivity
- Next by thread: [ddlm-group] LOOP versus LIST
- Index(es):