Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Case sensitivity

Dear James,

   I think we are saying the same thing conceptually.  You say to
use normalization and case folding.  Nic says to use tolower.  I say 
specify  precisely what the "tolower" we are to use is to do, and using 
the RFC3454 normalization and casefolding sounds reasonable to me.
To avoid any misunderstanding, I believe you are proposing to do
an NFKC normalization, i.e. decompose and then recombine.
Please confirm.

   We should make some test validation sets to help people to verify
their mappings ad give them to Brian to post.  We should check, but
I believe that libidn has a stringprep for C users.

   Regards,
     Herbert



=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 24 May 2010, James Hester wrote:

> Dear All,
> 
> I believe that Nick and Herbert's preference for choosing a particular
> tolower() implementation disregards the complexities of the situation as
> outlined in the document that John references, particularly the
> locale-dependence of the Turkish/Azeri mappings for the letter 'I' and the need
> for normalisation.
> 
> My reading of TR21 indicates that the most correct way to proceed would be to
> specify that two datanames are equivalent if their normalised, case-folded
> equivalents are identical, as John B suggests.  In addition to TR21, RFC3454
> gives a solid specification as to how comparison should be done
> (http://tools.ietf.org/html/rfc3454.html) and has been implemented in the
> Python 'stringprep' module.
> 
> The alternative approach of abandoning case-insensitivity is not as big a win
> as it may appear, because Unicode also introduces the issue of normalisation. 
> The need for character string normalisation arises because a number of Unicode
> character sequences may be used to represent the same displayed character (e.g.
> multiple accents added to a base character, and in addition the fully or
> partially accented character may have its own code point).  So even if we were
> to abandon case-insensitivity, we have to address normalisation
> 
> So I conclude that the best way forward is to specify that datanames are
> equivalent if their normalised, case-folded versions are identical as per
> RFC3454.  In the ASCII domain, this is equivalent to using tolower(), so I
> don't believe we are imposing a huge immediate burden on programmers.  A
> programmer should check for non-ASCII characters in datanames, and either
> abandon their processing (if the rely on tolower() and dataname comparisons are
> important in their context) or do a fully-fledged check as per RFC3454.
> 
> 
> 
> On Sat, May 22, 2010 at 4:29 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>       Dear Colleagues,
>
>         While everything John says about case-insensitivity is true,
>       I have to agree with the concept Nick has proposed, which
>       is the specify the algorithm to be used to achieve case
>       insensitivity.
>
>         Nick has proposed the use of a tolower routine to disambiguate.
>       What remains is to specify a particular tolower.  There are many
>       of them, and they differ in their behavior.  We should pick one.
>       The major constraint on such a choice is that it should agree
>       with current CIF1 behavior on ASCII characters.
>       =====================================================
>        Herbert J. Bernstein, Professor of Computer Science
>          Dowling College, Kramer Science Center, KSC 121
>               Idle Hour Blvd, Oakdale, NY, 11769
>
>                        +1-631-244-3035
>                        yaya@dowling.edu
>       =====================================================
> 
> On Fri, 21 May 2010, Bollinger, John C wrote:
> 
> > Dear All,
> >
> > First, my apologies to David for sticking to CIF2 syntax issues as he
> steers the discussion toward other DDLm questions.  CIF2 syntax seems
> settled enough to support the discussion he wants to have, but there
> remain a few syntax issues that need to be settled or at least clarified.
>  It would be better to handle those *before* CIF2 is released.  I hope
> that it will not distract too much from the discussion of other DDLm
> issues if they share the floor with some questions of syntax detail.
> >
> > Now,
> >
> > On 2/08/10 11:53:07 AM, "Nick Spadaccini" <nick@csse.uwa.edu.au> wrote:
> >
> > [...]
> >
> >> at an application level as one
> >> stored the data name it would dataname.toLower() first so that we were
> >> consistent.
> >>
> >> That is a solution and one we use in our system. Since it is case
> >> insensitive we choose a particular case to be consistent internally.
> You
> >> choose toUpper if you want it doesn't matter, sol long as you trap the
> fact
> >> that  _atom_site_frac_x and _Atom_Site_Frac_x are the same.
> >
> > [...]
> >
> > This is well and good, but it glosses over the complexities of Unicode
> case mapping.  As Joe originally pointed out, case mapping is
> *locale-sensitive*.  Moreover, Unicode case mappings are not all 1:1,
> therefore it makes a difference whether you convert to lowercase or
> uppercase for comparisons. For any who are interested, there is a Unicode
> standard annex (TR-21) that discusses these matters in some detail:
> > http://unicode.org/reports/tr21/tr21-5.html.
> >
> >> On 12/01/10 1:11 AM, "Joe Krahn" <krahn@niehs.nih.gov> wrote:
> >>
> >>> If the consensus is lower-case data names, why not make this part of
> the
> >>> CIF2 standard?
> >
> > [...]
> >
> > I don't think it is necessary or useful to require CIF data names to be
> lowercase, but I do think it would be prudent to define more precisely
> what "case-insensitive" means for CIF2.  It would be shamefully
> optimistic to assume that every string library implementation for every
> relevant language behaves identically for all inputs.  (For example, I
> was once painfully bitten by differing implementations of String.trim()
> in Java and C#.)  There *will* be uncertainty about the well-formedness
> of certain potential CIFs if CIF2 is not precise in this regard, and
> various processors *will* produce different results when fed such CIFs
> unless they all implement equivalent case folding semantics.
> >
> > Apparently, Turkish dotted and dotless 'i' provide a canonical example
> of the difficulties here.  The Unicode uppercase mapping for U+0131 is
> 'I', and of course the Unicode uppercase mapping for 'i' is also 'I'.
>  The Unicode lowercase mapping for U+0130 is 'i'.  Consider these data
> tags:
> >
> > _I
> > _i
> > _<U+0130>  # U+0130 is a Latin capital I with a dot above, used in
> Turkish
> > _<U+0131>  # U+0131 is a Latin lowercase i without a dot, also used in
> Turkish
> >
> > Which of them should be considered equivalent?  If you compare the
> Unicode uppercase conversions then all of them are the same except
> _<U+0130>.  If you compare the Unicode lowercase conversions then all are
> the same except _<U+0131>.  Unless your string library implements case
> mapping only for ASCII or maybe Latin-1, in which case it would tell you
> that only _i and _I were equivalent.  On the other hand, if you use the
> conventions of the Turkish locale then uppercasing and lowercasing
> provide consistent results, but they are different from what you get
> using Unicode convention (case folding according to Turkish yields the
> equivalent pairs (I, U+0131) and (U+0130, i)).
> >
> > It gets worse when you consider that some pre-composed characters have
> case mappings to decomposed character sequences.  This provides another
> way in which lowercase vs. uppercase comparison can yield different
> results about name equivalence. Section 1.4 of TR-21 discusses this
> issue, albeit more with an eye to the possibility of combining Unicode
> normalization with case folding.
> >
> > Indeed, normalization should not be ignored in this discussion.  Mainly
> because of the existence of Unicode combining characters in addition to
> many pre-composed characters, there are multiple, "canonically
> equivalent", representations of many Unicode strings.  A Unicode-aware
> presentation system will often render these equivalents identically, even
> though they differ at the string level (by use of pre-composed vs.
> decomposed characters, for example, or by differing order of combining
> characters).  Unicode defines several normalization formats on which CIF
> could rely for ignoring the distinctions among these canonical
> equivalents, if that were desirable.  Unicode's recommendation (TR-21,
> sections 1.4 and 2.5) is to use normalization in conjunction with case
> folding for case-insensitive matching.
> >
> >
> > TO SUMMARIZE, the CIF2 specification needs to define the details of
> what it means for data names (and block and frame codes) to be "case
> insensitive", in light of the added complexities of case mapping in a
> general Unicode context.  My recommendation would be to use Unicode's
> recommended procedure for caseless matching as the basis for judging
> whether CIF identifiers are equivalent, but this is not the only viable
> alternative.  It may also be worth expanding the concept of case
> insensitivity to encompass Unicode canonical (or compatibility)
> equivalence as well.  My personal inclination would be to do so, but that
> is certainly debatable, and mine is not a strong position on that.
> >
> >
> > Best Regards,
> >
> > John
> > --
> > John C. Bollinger, Ph.D.
> > Department of Structural Biology
> > St. Jude Children's Research Hospital
> >
> >
> >
> > Email Disclaimer:  www.stjude.org/emaildisclaimer
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.