Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Searching for a compromise on eliding

I just checked the Uncode 5.2.0 names "database" that Python 2.7 uses.
I has 21829 names.  There is a well-documented Python reference 
implementation of an API for translation at:

http://docs.python.org/library/unicodedata.html

If nobody has done it yet, at first glance it does not look
too difficult to make matching LGPL'd C/C++/Java APIs.  I
am not saying it is a trivial task, but is does look doable
as part of making a full UTF8 support package for CIF2.
Would having that make a Python 3 version of proposal
P-prime acceptable?

-- Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 25 Feb 2011, Herbert J. Bernstein wrote:

> Dear Colleagues,
>
>  I support both of Simon's suggestions:
>
>  1. Add the elides of P-prime to the strings delimited by
> the single quote and the strings delimited by the double quote;
> and
>
>  2. Review all currently proposed changes to ensure things have
> not become "messy"
>
> To help in understanding P-prime and Simon's first suggestion, and
> thereby to help in excuting Simon's second suggestion, here
> is where to find the Python 2.7 lexical analysis and elides:
>
> http://docs.python.org/reference/lexical_analysis.html
>
> Please note a very important difference between the Python semantics
> and those of C:
>
> Unlike Standard C, all unrecognized escape sequences are left in the string 
> unchanged, i.e., the backslash is left in the string. (This behavior is 
> useful when debugging: if an escape sequence is mistyped, the resulting 
> output is more easily recognized as broken), so under P-prime and Simon's 
> proposal 1, the full list of recognized elides is:
>
> \newline                ignored
> \\                      backslash
> \'                      single quote
> \"                      double quote
> \a                      ASCII Bell (BEL)
> \b                      ASCII Backspace (BS)
> \f                      ASCII Formfeed (FF)
> \n                      ASCII Linefeed (LF)
> \r                      ASCII Carriage Return (CR)
> \t                      ASCII Horizontab Tab (TAB)
> \v                      ASCII Vertical Tab (VT)
> \ooo                    Character with octal value ooo (1-3 octal digits)
> \xhh                    Character with hex value hh (2 hex digits)
>
> Note that hexadecimal and octal escapes denote the byte with the given value; 
> it is not necessary that the byte encodes a character in the source character 
> set.
>
> In deference to Simon's second suggestion, please note that this differs from 
> Python 3 handling of un-prefixed treble quotes in 2 ways:
>
> 1.  Python 3 adds \N{name} referencing names in the Unicode database, as well 
> as adding \uxxxx and \Uxxxxxxxx giving hex values for unicode code points
> 2.  The hexadecimal and octal escapes encode the unicode character at
> the code point.
>
> I suggest we stay with the 2.7 version.
>
>  Regards,
>    Herbert
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Fri, 25 Feb 2011, SIMON WESTRIP wrote:
>
>> If there is acceptance of P', logic suggests that the same
>> approach should be taken towards the single-quoted strings,
>> e.g. a user might question:
>> 
>> I can do this """C\"""", so why can't I do this: "C\"" ?
>> 
>> This would then just leave the semicolon-delimited fields as
>> the means to store 'raw' strings.
>> 
>> This may be 'maximally desruptive', but CIF2 is already distinct
>> from CIF1 and will require conversion of e.g. "C\"" in any case
>> (the latter is valid CIF1).
>> 
>> Basically, I worry that the compromise is starting to make CIF look
>> a bit 'messy'; perhaps all the changes should be reviewed...
>> 
>> Cheers
>> 
>> Simon
>> 
>> 
>> ____________________________________________________________________________
>> From: James Hester <jamesrhester@gmail.com>
>> To: ddlm-group <ddlm-group@iucr.org>
>> Sent: Friday, 25 February, 2011 2:50:59
>> Subject: [ddlm-group] Searching for a compromise on eliding
>> 
>> Dear DDLm-group,
>> 
>> I think we have all had a decent chance to argue our case for
>> Proposals P, F and F'.  I have also been in small side discussions
>> with Ralf and John W.  Their points of view can be summarised as
>> follows:
>> (i) Behaviour of triple-quoted strings will be too confusing unless
>> Python behaviour is followed (Ralf)
>> (ii) There is considerable criticism of CIF in the macromolecular
>> community because of idiosyncratic behaviour, particularly concerning
>> quoting.  We should therefore stick to accepted standards as much as
>> possible (John W)
>> 
>> For John W and Ralf these points outweigh any of the disadvantages of
>> Proposal P, and so Proposal P remains their first choice.  Proposal P
>> is therefore the first choice of 3 out of 5 COMCIFS voters, and the
>> last choice of the other two (I would rank it worse than doing
>> nothing, actually).  I note that non-voting members are uniformly
>> opposed to Proposal P.
>> 
>> I therefore want to try to seek some common middle ground in the hope
>> that I can find a proposal that could be at least as acceptable as
>> Proposal P to Ralf and/or Herbert and/or John W.
>> 
>> Consider the following four new proposals - P-prime, Q, G and null:
>> 
>> * Proposal P-prime: triple-quoted strings are treated as for Python
>> 2.7.  No Unicode or raw strings are defined (ie no strings starting
>> u""" or r""").
>> 
>> I interpret John W and Ralf's position to be that they would be able
>> to support this proposal as the preferred choice, as our syntax would
>> still be entirely consistent with Python.  This proposal is a
>> considerable improvement on Proposal P, because the dangers of raw
>> strings are taken out of the equation, and the Unicode database is no
>> longer a dependency.  We are still left with a whole bunch of (frankly
>> pointless) elides, leading to Proposal Q:
>> 
>> * Proposal Q: As for Proposal P-prime, with the following changes:
>> (1) Only <backslash><delimiter> and <backslash><backslash> when it
>> precedes <backslash><delimiter> are recognised escape sequences at the
>> syntactical level
>> (2) A DDLm string type, e.g. "CText", is defined in com_val.dic for
>> which the remaining escape sequences have the meaning assigned to them
>> by the Python 2.7 standard.  mmCIF and related domains can standardise
>> their definitions on this string type and derivatives, making the
>> above division between syntax and dictionary invisible to users and
>> programmers in their domain.
>> 
>> * Proposal G: Proposal F', but with a different delimiter
>> 
>> Ralf has indicated that he actually thinks Proposal F' is best, but
>> only if the delimiters are not going to be confused with Python
>> delimiters.  I interpret John W's position to be that he would not
>> support such a change in delimiters as that would simply make CIF even
>> more idiosyncratic.  Anyway, any such replacement delimiter would need
>> to be multi-character, easy to type and unlikely to occur as the first
>> characters in CIF1 datavalues.  We would also need to reduce the
>> characterset of non-delimited CIF2 strings to exclude any such
>> delimiters.  Ideas?
>> 
>> * Null proposal: do nothing as we can't agree
>> 
>> I think I could support Proposal Q as an acceptable fallback from F',
>> and if somebody can find sensible delimiters for Proposal G that works
>> for me as well.  The preferred treatment for backslash rich text for
>> Proposals P,P' and Q will necessarily be semicolon-delimited strings.
>> 
>> James.
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.