[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: John Westbrook <[email protected]>
Date: Thu, 19 Nov 2009 07:51:03 -0500
In-Reply-To: <[email protected]>
References: <[email protected]> <C72B1C8D.124EF%[email protected]> <[email protected]><[email protected]>
Hi all,

In a previous posting I registered PDB preference to leave the interpretation
of elided characters to the application.    I am not presently aware of cases
from our work where it would be useful to introduce handling of elides within
the CIF syntax.  I am particularly worried about any lexical interpretation
of '\' that may interfere with their use in regular expressions which we
have a significant dependency in our dictionaries.    I continue to vote for the
particular interpretation of '\' and other special characters be defined
by the dictionary definition of the item.   I think this provides the greatest
flexibility for all applications.

Regards,

John


Herbert J. Bernstein wrote:
> Dear Colleagues,
> 
>   My personal preference would be to leave things in what to me seems the
> simpler approach of passing all reverse solidus glyphs to the application.
> However, the pragmatics achieving a consensus and getting on with coding 
> is more important that my personal taste.
> 
>   The major impact of a chnage un the handling of the reverse solidus in
> having some of them absorbed by the CIF2 parsers would be in then
> handling of legacy CIFs at the IUCr and at the PDB.  James is right
> that what we are discussing is the difference between raw and cooked
> python strings.  Inasmuch as CIF2 is now going to forbid the use of
> quote marks within non-delimited strings, in order to make the
> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible,
> may I suggest that we adopt both cooked and raw quoted strings
> from python, so that r"  and r' can be used to introduce any raw, 
> unconverted string taken from a CIF1 in which almost all existing
> CIF1 reverse solidus behavior could be left untouched, and that
> we accept James cooked approach for quoted strings not marked with
> the r' or r".
> 
>   What say the IUCr journal operation and the PDB?  It is their ox we 
> are goring here.
> 
>   Regards,
>     Herbert
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
> 
>                  +1-631-244-3035
>                  [email protected]
> =====================================================
> 
> On Thu, 19 Nov 2009, James Hester wrote:
> 
>> OK, fair enough.  Just to clarify, I am not advocating the full
>> repertoire of backslash elides, only two specific ones:
>> <backslash><terminator> and <backslash><backslash>.  Any other use of
>> backslash would simply leave that backslash untouched.
>>
>> Would suggesting that the cut-and-pasters restrict themselves to
>> semicolon-delimited strings or triple-quote delimited strings help
>> with legacy issues?
>>
>> Anyway, let us await the opinions of our Western Hemisphere colleagues...
>>
>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini 
>> <[email protected]> wrote:
>>>
>>>
>>>
>>> On 19/11/09 12:58 PM, "James Hester" <[email protected]> wrote:
>>>
>>>> We need to figure out the behaviour of elides.  This was previously
>>>> discussed in a thread entitled "The alphabet of non-delimited
>>>> strings", especially in messages around Oct 16th.  The behaviour
>>>> advocated by Nick is for both the eliding and elided character to be
>>>> returned from the parser.  The behaviour I would prefer is for the
>>>> eliding character to disappear; it should itself be elided if it is to
>>>> remain in the string.
>>>>
>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>>>> Oct 16, 2009 at 6:22AM and subsequently
>>>>
>>>> 1. We don't interpret elides because we don't know what algorithm to
>>>> use (i.e. it might be a greek character sequence)
>>>>
>>>> 2. The elide simply signals that the lexer should not interpret the
>>>> following character
>>>>
>>>> My counter-proposal is similar to Simon's original expectation: if the
>>>> elide character is really eliding a syntactically significant
>>>> character (i.e. a terminator character or an elide character), the
>>>> elide sequence is replaced by the single character.  I counter the
>>>> above arguments as follows:
>>>>
>>>> (a) The profusion of algorithms for backslash processing is
>>>> irrelevant. We can interpret the elides because the only algorithm
>>>> that has any relevance at the parser level is the simple
>>>> <backslash><character> -> <character>.  All other potential uses
>>>> belong to higher levels.  If the higher levels require a
>>>> <backslash><quote>, that is created by writing
>>>> <backslash><backslash><backslash><quote> in the on-disk string.
>>>
>>> Couldn't agree with you more, and you are preaching to the converted who
>>> were converted away by others. This is what I was arguing months ago 
>>> for how
>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
>>> always substitute the single binary character for these character 
>>> doublets
>>> ala unix/python/C etc. And you quite rightly argue if you want \n to 
>>> really
>>> mean the IUCr Greek nu then it will have to be \\n, and the same 
>>> parser will
>>> give the downstream application \n (having removed the leading elide).
>>> Beautiful, that's what the computer scientist in me argues.
>>>
>>> However others argued that many users vim/emacs the file and cut and 
>>> paste
>>> the text content. So if you have a LaTEX string "{\\em I am italicised}"
>>> that you cut and paste then it fails.  And the blasted backward
>>> compatibility argument comes in with existing CIF1 files that are not 
>>> doubly
>>> elided.
>>>
>>> What we can do is push the idea that a CIF2 string is a COMPLETELY 
>>> different
>>> beast to a CIF1 string. We know that with CIF1 data names and data 
>>> values we
>>> have to push our CIF2 parser in to a different grammar to handle things
>>> correctly. At that level elides in a string will have a strict CIF1 
>>> meaning
>>> (ie IUCr Greek markup).
>>>
>>> In CIF2 an elide in a string protects the following character from being
>>> interpreted as a delimiter. There is special meaning for \n, \t etc 
>>>  which
>>> are replaced by their single character. \u123456 (up to 6 hex numbers)
>>> indicate a unicode character which should be replaced by the correct 
>>> byte
>>> sequence. All other first reverse solidus should be removed, and the
>>> immediately following character passed on as part of the string. 
>>> Characters
>>> can be (multibyte) UTF-8.
>>>
>>> If you want to encode LaTEX (or IUCr-speak or something similar) then 
>>> you
>>> are going to have double all your reverse solidii. You can't cut and 
>>> paste
>>> from an editor - bad luck.
>>>
>>> I will wait for Herb's response to this because he was an advocate of
>>> leaving things as they were (I think). I am happy to move forward 
>>> with your
>>> suggested interpretation.
>>>
>>>> (b) The profusion of algorithms for backslash processing means that
>>>> we *must* remove ambiguity by removing the eliding character during
>>>> processing; otherwise, an application can't tell if it is e.g. looking
>>>> at an escaped prime or an acute accent without applying ugly
>>>> heuristics.  Note also that a caller of a CIF reading program doesn't
>>>> currently need to know what the particular string delimiting character
>>>> was for a given string value; in order to make a guess at what
>>>> the backslash might mean, it would often need to know this.
>>>>
>>>> It appears that Nick is describing Python raw string behaviour,
>>>> and I am describing Python 'cooked' string behaviour.  Note for the
>>>> following paragraph from
>>>> docs.python.org/reference/lexical_analysis.html#strings:
>>>>
>>>> When an 'r' or 'R' prefix is present, a character following a
>>>> backslash is included in the string without change, and all
>>>> backslashes are left in the string. For example, the string
>>>> literal r"\n" consists of two characters: a backslash and a
>>>> lowercase 'n'. String quotes can be escaped with a backslash,
>>>> but the backslash remains in the string; for example, r"\"" is
>>>> a valid string literal consisting of two characters: a
>>>> backslash and a double quote; r"\" is not a valid string
>>>> literal (even a raw string cannot end in an odd number of
>>>> backslashes). Specifically, a raw string cannot end in a
>>>> single backslash (since the backslash would escape the
>>>> following quote character). Note also that a single backslash
>>>> followed by a newline is interpreted as those two characters
>>>> as part of the string, not as a line continuation.
>>>>
>>>> Note that raw strings cannot end in a backslash, so I would consider
>>>> them slightly less expressive than cooked strings, which can express
>>>> everything.
>>>>
>>>> I would challenge Nick et. al. to explain what the advantage
>>>> of keeping the eliding character in the datavalue is, keeping in mind
>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>>>> CIF syntax from their users (as a service), and this proposal appears
>>>> to want to expose a confusing part of it to them.  Some questions we
>>>
>>> The original "advantage" (if you could call it that) was to keep others
>>> happy and to support backwards compatibility.
>>>
>>>> toolbox maintainers will need to ask if this goes through: Do you
>>>> handle escaping any strings passed to you for output?  How do you know
>>>> if the caller has done the escaping already, or not?  Do you really 
>>>> expect
>>>> the calling software to work out whether it wants a single or double
>>>> or triple quote delimited string?  Isn't that the service provided by
>>>> your software?  What are they (not) paying you for, anyway?
>>>
>>> When they pay, I'll answer that question!
>>>
>>> cheers
>>>
>>> Nick
>>>
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>>
>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> MBDP  M002
>>>
>>> CRICOS Provider Code: 00126G
>>>
>>> e: [email protected]
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> [email protected]
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>
>>
>>
>> -- 
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

References:

[ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Re: [ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Relationship of CIF2 to legacy platforms

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Use of elides in strings