[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Searching for a compromise on eliding

I just checked the Uncode 5.2.0 names "database" that Python 2.7 uses.
I has 21829 names.  There is a well-documented Python reference 
implementation of an API for translation at:


If nobody has done it yet, at first glance it does not look
too difficult to make matching LGPL'd C/C++/Java APIs.  I
am not saying it is a trivial task, but is does look doable
as part of making a full UTF8 support package for CIF2.
Would having that make a Python 3 version of proposal
P-prime acceptable?

-- Herbert
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Fri, 25 Feb 2011, Herbert J. Bernstein wrote:

> Dear Colleagues,
>  I support both of Simon's suggestions:
>  1. Add the elides of P-prime to the strings delimited by
> the single quote and the strings delimited by the double quote;
> and
>  2. Review all currently proposed changes to ensure things have
> not become "messy"
> To help in understanding P-prime and Simon's first suggestion, and
> thereby to help in excuting Simon's second suggestion, here
> is where to find the Python 2.7 lexical analysis and elides:
> http://docs.python.org/reference/lexical_analysis.html
> Please note a very important difference between the Python semantics
> and those of C:
> Unlike Standard C, all unrecognized escape sequences are left in the string 
> unchanged, i.e., the backslash is left in the string. (This behavior is 
> useful when debugging: if an escape sequence is mistyped, the resulting 
> output is more easily recognized as broken), so under P-prime and Simon's 
> proposal 1, the full list of recognized elides is:
> \newline                ignored
> \\                      backslash
> \'                      single quote
> \"                      double quote
> \a                      ASCII Bell (BEL)
> \b                      ASCII Backspace (BS)
> \f                      ASCII Formfeed (FF)
> \n                      ASCII Linefeed (LF)
> \r                      ASCII Carriage Return (CR)
> \t                      ASCII Horizontab Tab (TAB)
> \v                      ASCII Vertical Tab (VT)
> \ooo                    Character with octal value ooo (1-3 octal digits)
> \xhh                    Character with hex value hh (2 hex digits)
> Note that hexadecimal and octal escapes denote the byte with the given value; 
> it is not necessary that the byte encodes a character in the source character 
> set.
> In deference to Simon's second suggestion, please note that this differs from 
> Python 3 handling of un-prefixed treble quotes in 2 ways:
> 1.  Python 3 adds \N{name} referencing names in the Unicode database, as well 
> as adding \uxxxx and \Uxxxxxxxx giving hex values for unicode code points
> 2.  The hexadecimal and octal escapes encode the unicode character at
> the code point.
> I suggest we stay with the 2.7 version.
>  Regards,
>    Herbert
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
> On Fri, 25 Feb 2011, SIMON WESTRIP wrote:
>> If there is acceptance of P', logic suggests that the same
>> approach should be taken towards the single-quoted strings,
>> e.g. a user might question:
>> I can do this """C\"""", so why can't I do this: "C\"" ?
>> This would then just leave the semicolon-delimited fields as
>> the means to store 'raw' strings.
>> This may be 'maximally desruptive', but CIF2 is already distinct
>> from CIF1 and will require conversion of e.g. "C\"" in any case
>> (the latter is valid CIF1).
>> Basically, I worry that the compromise is starting to make CIF look
>> a bit 'messy'; perhaps all the changes should be reviewed...
>> Cheers
>> Simon
>> ____________________________________________________________________________
>> From: James Hester <jamesrhester@gmail.com>
>> To: ddlm-group <ddlm-group@iucr.org>
>> Sent: Friday, 25 February, 2011 2:50:59
>> Subject: [ddlm-group] Searching for a compromise on eliding
>> Dear DDLm-group,
>> I think we have all had a decent chance to argue our case for
>> Proposals P, F and F'.  I have also been in small side discussions
>> with Ralf and John W.  Their points of view can be summarised as
>> follows:
>> (i) Behaviour of triple-quoted strings will be too confusing unless
>> Python behaviour is followed (Ralf)
>> (ii) There is considerable criticism of CIF in the macromolecular
>> community because of idiosyncratic behaviour, particularly concerning
>> quoting.  We should therefore stick to accepted standards as much as
>> possible (John W)
>> For John W and Ralf these points outweigh any of the disadvantages of
>> Proposal P, and so Proposal P remains their first choice.  Proposal P
>> is therefore the first choice of 3 out of 5 COMCIFS voters, and the
>> last choice of the other two (I would rank it worse than doing
>> nothing, actually).  I note that non-voting members are uniformly
>> opposed to Proposal P.
>> I therefore want to try to seek some common middle ground in the hope
>> that I can find a proposal that could be at least as acceptable as
>> Proposal P to Ralf and/or Herbert and/or John W.
>> Consider the following four new proposals - P-prime, Q, G and null:
>> * Proposal P-prime: triple-quoted strings are treated as for Python
>> 2.7.  No Unicode or raw strings are defined (ie no strings starting
>> u""" or r""").
>> I interpret John W and Ralf's position to be that they would be able
>> to support this proposal as the preferred choice, as our syntax would
>> still be entirely consistent with Python.  This proposal is a
>> considerable improvement on Proposal P, because the dangers of raw
>> strings are taken out of the equation, and the Unicode database is no
>> longer a dependency.  We are still left with a whole bunch of (frankly
>> pointless) elides, leading to Proposal Q:
>> * Proposal Q: As for Proposal P-prime, with the following changes:
>> (1) Only <backslash><delimiter> and <backslash><backslash> when it
>> precedes <backslash><delimiter> are recognised escape sequences at the
>> syntactical level
>> (2) A DDLm string type, e.g. "CText", is defined in com_val.dic for
>> which the remaining escape sequences have the meaning assigned to them
>> by the Python 2.7 standard.  mmCIF and related domains can standardise
>> their definitions on this string type and derivatives, making the
>> above division between syntax and dictionary invisible to users and
>> programmers in their domain.
>> * Proposal G: Proposal F', but with a different delimiter
>> Ralf has indicated that he actually thinks Proposal F' is best, but
>> only if the delimiters are not going to be confused with Python
>> delimiters.  I interpret John W's position to be that he would not
>> support such a change in delimiters as that would simply make CIF even
>> more idiosyncratic.  Anyway, any such replacement delimiter would need
>> to be multi-character, easy to type and unlikely to occur as the first
>> characters in CIF1 datavalues.  We would also need to reduce the
>> characterset of non-delimited CIF2 strings to exclude any such
>> delimiters.  Ideas?
>> * Null proposal: do nothing as we can't agree
>> I think I could support Proposal Q as an acceptable fallback from F',
>> and if somebody can find sensible delimiters for Proposal G that works
>> for me as well.  The preferred treatment for backslash rich text for
>> Proposals P,P' and Q will necessarily be semicolon-delimited strings.
>> James.
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]