Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity

I'm glad we agree.  Yes, I think NFKC normalisation is the appropriate normalisation for our purposes.

Does anybody disagree with this proposal?

On Mon, May 24, 2010 at 1:32 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear James,

 I think we are saying the same thing conceptually.  You say to
use normalization and case folding.  Nic says to use tolower.  I say specify  precisely what the "tolower" we are to use is to do, and using the RFC3454 normalization and casefolding sounds reasonable to me.
To avoid any misunderstanding, I believe you are proposing to do
an NFKC normalization, i.e. decompose and then recombine.
Please confirm.

 We should make some test validation sets to help people to verify
their mappings ad give them to Brian to post.  We should check, but
I believe that libidn has a stringprep for C users.


 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 24 May 2010, James Hester wrote:

Dear All,

I believe that Nick and Herbert's preference for choosing a particular
tolower() implementation disregards the complexities of the situation as
outlined in the document that John references, particularly the
locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need
for normalisation.

My reading of TR21 indicates that the most correct way to proceed would be to
specify that two datanames are equivalent if their normalised, case-folded
equivalents are identical, as John B suggests.  In addition to TR21, RFC3454
gives a solid specification as to how comparison should be done
(http://tools.ietf.org/html/rfc3454.html) and has been implemented in the
Python 'stringprep' module.

The alternative approach of abandoning case-insensitivity is not as big a win
as it may appear, because Unicode also introduces the issue of normalisation. 
The need for character string normalisation arises because a number of Unicode
character sequences may be used to represent the same displayed character (e.g.
multiple accents added to a base character, and in addition the fully or
partially accented character may have its own code point).  So even if we were
to abandon case-insensitivity, we have to address normalisation

So I conclude that the best way forward is to specify that datanames are
equivalent if their normalised, case-folded versions are identical as per
RFC3454.  In the ASCII domain, this is equivalent to using tolower(), so I
don't believe we are imposing a huge immediate burden on programmers.  A
programmer should check for non-ASCII characters in datanames, and either
abandon their processing (if the rely on tolower() and dataname comparisons are
important in their context) or do a fully-fledged check as per RFC3454.

On Sat, May 22, 2010 at 4:29 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
     Dear Colleagues,

       While everything John says about case-insensitivity is true,
     I have to agree with the concept Nick has proposed, which
     is the specify the algorithm to be used to achieve case

       Nick has proposed the use of a tolower routine to disambiguate.
     What remains is to specify a particular tolower.  There are many
     of them, and they differ in their behavior.  We should pick one.
     The major constraint on such a choice is that it should agree
     with current CIF1 behavior on ASCII characters.
      Herbert J. Bernstein, Professor of Computer Science
        Dowling College, Kramer Science Center, KSC 121
             Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 21 May 2010, Bollinger, John C wrote:

> Dear All,
> First, my apologies to David for sticking to CIF2 syntax issues as he
steers the discussion toward other DDLm questions.  CIF2 syntax seems
settled enough to support the discussion he wants to have, but there
remain a few syntax issues that need to be settled or at least clarified.
 It would be better to handle those *before* CIF2 is released.  I hope
that it will not distract too much from the discussion of other DDLm
issues if they share the floor with some questions of syntax detail.
> Now,
> On 2/08/10 11:53:07 AM, "Nick Spadaccini" <nick@csse.uwa.edu.au> wrote:
> [...]
>> at an application level as one
>> stored the data name it would dataname.toLower() first so that we were
>> consistent.
>> That is a solution and one we use in our system. Since it is case
>> insensitive we choose a particular case to be consistent internally.
>> choose toUpper if you want it doesn't matter, sol long as you trap the
>> that  _atom_site_frac_x and _Atom_Site_Frac_x are the same.
> [...]
> This is well and good, but it glosses over the complexities of Unicode
case mapping.  As Joe originally pointed out, case mapping is
*locale-sensitive*.  Moreover, Unicode case mappings are not all 1:1,
therefore it makes a difference whether you convert to lowercase or
uppercase for comparisons. For any who are interested, there is a Unicode
standard annex (TR-21) that discusses these matters in some detail:
> http://unicode.org/reports/tr21/tr21-5.html.
>> On 12/01/10 1:11 AM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:
>>> If the consensus is lower-case data names, why not make this part of
>>> CIF2 standard?
> [...]
> I don't think it is necessary or useful to require CIF data names to be
lowercase, but I do think it would be prudent to define more precisely
what "case-insensitive" means for CIF2.  It would be shamefully
optimistic to assume that every string library implementation for every
relevant language behaves identically for all inputs.  (For example, I
was once painfully bitten by differing implementations of String.trim()
in Java and C#.)  There *will* be uncertainty about the well-formedness
of certain potential CIFs if CIF2 is not precise in this regard, and
various processors *will* produce different results when fed such CIFs
unless they all implement equivalent case folding semantics.
> Apparently, Turkish dotted and dotless 'i' provide a canonical example
of the difficulties here.  The Unicode uppercase mapping for U+0131 is
'I', and of course the Unicode uppercase mapping for 'i' is also 'I'.
 The Unicode lowercase mapping for U+0130 is 'i'.  Consider these data
> _I
> _i
> _<U+0130>  # U+0130 is a Latin capital I with a dot above, used in
> _<U+0131>  # U+0131 is a Latin lowercase i without a dot, also used in
> Which of them should be considered equivalent?  If you compare the
Unicode uppercase conversions then all of them are the same except
_<U+0130>.  If you compare the Unicode lowercase conversions then all are
the same except _<U+0131>.  Unless your string library implements case
mapping only for ASCII or maybe Latin-1, in which case it would tell you
that only _i and _I were equivalent.  On the other hand, if you use the
conventions of the Turkish locale then uppercasing and lowercasing
provide consistent results, but they are different from what you get
using Unicode convention (case folding according to Turkish yields the
equivalent pairs (I, U+0131) and (U+0130, i)).
> It gets worse when you consider that some pre-composed characters have
case mappings to decomposed character sequences.  This provides another
way in which lowercase vs. uppercase comparison can yield different
results about name equivalence. Section 1.4 of TR-21 discusses this
issue, albeit more with an eye to the possibility of combining Unicode
normalization with case folding.
> Indeed, normalization should not be ignored in this discussion.  Mainly
because of the existence of Unicode combining characters in addition to
many pre-composed characters, there are multiple, "canonically
equivalent", representations of many Unicode strings.  A Unicode-aware
presentation system will often render these equivalents identically, even
though they differ at the string level (by use of pre-composed vs.
decomposed characters, for example, or by differing order of combining
characters).  Unicode defines several normalization formats on which CIF
could rely for ignoring the distinctions among these canonical
equivalents, if that were desirable.  Unicode's recommendation (TR-21,
sections 1.4 and 2.5) is to use normalization in conjunction with case
folding for case-insensitive matching.
> TO SUMMARIZE, the CIF2 specification needs to define the details of
what it means for data names (and block and frame codes) to be "case
insensitive", in light of the added complexities of case mapping in a
general Unicode context.  My recommendation would be to use Unicode's
recommended procedure for caseless matching as the basis for judging
whether CIF identifiers are equivalent, but this is not the only viable
alternative.  It may also be worth expanding the concept of case
insensitivity to encompass Unicode canonical (or compatibility)
equivalence as well.  My personal inclination would be to do so, but that
is certainly debatable, and mine is not a strong position on that.
> Best Regards,
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
> Email Disclaimer:  www.stjude.org/emaildisclaimer
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

ddlm-group mailing list

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.