# Re: [ddlm-group] Use of elides in strings

Hi Nick: you have described exactly what I would like to see.  I am
currently working on some way of taking into account John's issues and
will hopefully post something later on today.

On Fri, Nov 20, 2009 at 3:55 PM, Nick Spadaccini <nick@csse.uwa.edu.au> wrote:
> The general sentiment is to pass on the string as raw.
>
> Herb's suggestion of raw and cooked can't be done at a lexical level as it
> is with Python. r"..." u"..." are syntactically incorrect in STAR. It can
> happen at a dictionary level where a type could be declared as a raw string
> or a cooked string. Probably better still is that such data items should
> have an associated data item that indicates what the mark up is and how to
> handle it. This will handle a multitude of markups.
>
> James' original email didn't address the whole hog approach, but focussed
> only on what to do with \", \' and \\. I think James's is suggesting that if
> a user wants to store the string abc"def as a double-quote string then a
> user will want the application to deal with putting it out as "abc\"def"
>
> The same user will want the application to deliver on a read the string as
> abc"def.
>
> I think this is a reasonable expectation, and we should specify this
> behaviour in a CIF2 parser - THIS IS A PROPOSAL.
>
> But James goes further to discuss \\, which I didn't think was necessary at
> first and neither does Simon. But I can see pathological cases where it is
> problematic. Say the string is abc\"def.
>
> The above algorithm will create "abc\\"def", but on reading according to the
> current draft specification it should probably say \\ -> \, then a " ->
> terminate token. An alternative is we need to create "abc\\\"def" but now we
> have to duplicate all elides. Doing nothing and creating "abc\"def" will
> parse, but return abc"def - incorrect.
>
> But all of this can be solved simply by revisiting the wording of the
> specification. I have said that a \ protects the next character and it
> should be ignored as far as parsing is concerned. But since the contents are
> raw, it should be (eg)
>
> For a "" delimited string a \ has no meaning unless it immediately precedes
> a ", in which case the " is ignored as a token terminator. Hence given the
> above algorithm the string abc\"def is encoded as "abc\\"def" by the parser.
> When it is read the parsing process is first \ precedes a \, hence has no
> meaning (pass it through). Second \ precedes a ", this this is an elide,
> drop that reverse solidus and pass through the " as a legitimate character.
>
> You end up deriving abc\"def as required.
>
> Have I missed anything?
>
> SUMMARISING.
>
> (a) The contents of delimited strings are returned as raw, with the token
> delimiters removed.
> (b) Where a delimiter character is to be part of the string, that character
> must be preceded by a reverse solidus when written out to the file. When
> read, any reverse solidus preceding a terminating character is deleted.
> (c) It is the responsibility of the writing and reading application to
> insert and remove the reverse solidus preceding the terminating character.
> (d) Otherwise the presence of a reverse solidus in the string has no
> meaning.
>
> Does this cover all bases?
>
>
> On 20/11/09 4:57 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:
>
>> Dear all
>>
>> Haven't caught up with all the recent discussions yet, but hopefully have
>> identified
>> the following views appropriately:
>>
>> 1) Nick's proposal (preference):
>>
>> "In CIF2 an elide in a string protects the following character from being
>>  interpreted as a delimiter.
>>
>> There is special meaning for \n, \t etc  which
>>  are replaced by their single character.
>>
>> \u123456 (up to 6 hex numbers)
>>  indicate a unicode character which should be replaced by the correct byte
>>  sequence.
>>
>> All other first reverse solidus should be removed, and the
>>  immediately following character passed on as part of the string.
>>
>> Characters can be (multibyte) UTF-8.
>> "
>>
>> SPW: Though the logic of this is unquestionable (from a programmers
>> perspective),
>> I think this might be too disruptive. Though CIF2 promises interpretable
>> content to
>> enhance data processing, CIF is also an archiving format. I beleive that
>> restrictions on
>> the content of a data value should be minimal, governed by necessity
>> (e.g. restrictions to avoid delimiter conflicts), rather than restricting the
>> character set of the
>> content to facilitate parsing or interpretablity by any particular programming
>> language.
>> On the one hand CIF2 promises to be a more flexible archiving format by
>> extending its character
>> set, while on the other hand it could become more restrictive by requiring
>> that every reverse solidus
>> has to be 'doubled-up' in a data value.
>>
>> Granted, there are strong arguments that people will decreasingly need to
>> interact with a CIF
>> in its raw form so extra complexities of syntax are not too much of a problem,
>> but as many have pointed out,
>> they still will read/edit raw CIFs, and may well have no alternative on
>> occassion
>> (for example, the IUCr will shortly be requiring authors to include
>> refinement-software instruction
>> listings in their CIFs, which will need to be included 'as is' within the
>> restrictions of the data value delimiters
>> and line lengths, purely for review purposes and only available in their raw
>> form in the CIF)
>>
>> So on a fundamental level, I dont see that \n, \t, ... need to be reserved as
>> special within a data value,
>> nor \u123456. Definition of special meanings for these can be handled at a
>> higher level? Equally, unless the
>> reverse solidus escapes a delimiter character within the context of the
>> identified
>> opening delimiter, I dont see why it should be discarded by a parser.
>>
>> 2) James' proposal:
>>
>> "backslash elides, only two specific ones:
>>
>>  <backslash><terminator> and <backslash><backslash>.
>>
>> Any other use of
>>  backslash would simply leave that backslash untouched.
>> "
>>
>> SPW: tend to agree with this (see above), but why escape a backslash when they
>> will be untouched anyway if they're not
>> followed by a terminator?
>>
>> 3) Herbert's proposal:
>>
>> "may I suggest that we adopt both cooked and raw quoted strings
>> from python, so that r"  and r' can be used to introduce any raw,
>> unconverted string taken from a CIF1 in which almost all existing
>> CIF1 reverse solidus behavior could be left untouched, and that
>> we accept James cooked approach for quoted strings not marked with
>> the r' or r".
>> "
>>
>> SPW: could be a neat solution for backward-compatability, but with more
>> complexity comes the potential for more errors?
>> Also, what about r; (assuming we're not just talking about quoted strings)?
>>
>>
>> So if its not possible to allow context-sensitive handling of elides (escaping
>> a delimiter if the value is delimited by the same delimiter),
>> then I find myself supporting Nick's earlier conclusion (a month back) that
>> all elides will be returned at the parser level for
>> the application to deal with (THREAD 3)? If either of these approaches is
>> considered unsatisfactory, then 'go the whole hog' and adopt
>> the familiar 'programming syntax' treatment of elides as described by Nick.
>>
>> Cheers
>>
>> Simon
>>
>> PS usual disclaimer that these arn't necessarily the IUCr's views
>>
>>
>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Thursday, 19 November, 2009 11:55:37
>> Subject: Re: [ddlm-group] Use of elides in strings
>>
>> Dear Colleagues,
>>
>>    My personal preference would be to leave things in what to me seems the
>> simpler approach of passing all reverse solidus glyphs to the application.
>> However, the pragmatics achieving a consensus and getting on with coding
>> is more important that my personal taste.
>>
>>    The major impact of a chnage un the handling of the reverse solidus in
>> having some of them absorbed by the CIF2 parsers would be in then
>> handling of legacy CIFs at the IUCr and at the PDB.  James is right
>> that what we are discussing is the difference between raw and cooked
>> python strings.  Inasmuch as CIF2 is now going to forbid the use of
>> quote marks within non-delimited strings, in order to make the
>> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible,
>> may I suggest that we adopt both cooked and raw quoted strings
>> from python, so that r"  and r' can be used to introduce any raw,
>> unconverted string taken from a CIF1 in which almost all existing
>> CIF1 reverse solidus behavior could be left untouched, and that
>> we accept James cooked approach for quoted strings not marked with
>> the r' or r".
>>
>>    What say the IUCr journal operation and the PDB?  It is their ox we are
>> goring here.
>>
>>    Regards,
>>      Herbert
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Thu, 19 Nov 2009, James Hester wrote:
>>
>>> OK, fair enough.  Just to clarify, I am not advocating the full
>>> repertoire of backslash elides, only two specific ones:
>>> <backslash><terminator> and <backslash><backslash>.  Any other use of
>>> backslash would simply leave that backslash untouched.
>>>
>>> Would suggesting that the cut-and-pasters restrict themselves to
>>> semicolon-delimited strings or triple-quote delimited strings help
>>> with legacy issues?
>>>
>>> Anyway, let us await the opinions of our Western Hemisphere colleagues...
>>>
>>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au>
>>> wrote:
>>>>
>>>>
>>>>
>>>> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>>>>
>>>>> We need to figure out the behaviour of elides.  This was previously
>>>>> discussed in a thread entitled "The alphabet of non-delimited
>>>>> strings", especially in messages around Oct 16th.  The behaviour
>>>>> advocated by Nick is for both the eliding and elided character to be
>>>>> returned from the parser.  The behaviour I would prefer is for the
>>>>> eliding character to disappear; it should itself be elided if it is to
>>>>> remain in the string.
>>>>>
>>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>>>>> Oct 16, 2009 at 6:22AM and subsequently
>>>>>
>>>>> 1. We don't interpret elides because we don't know what algorithm to
>>>>> use (i.e. it might be a greek character sequence)
>>>>>
>>>>> 2. The elide simply signals that the lexer should not interpret the
>>>>> following character
>>>>>
>>>>> My counter-proposal is similar to Simon's original expectation: if the
>>>>> elide character is really eliding a syntactically significant
>>>>> character (i.e. a terminator character or an elide character), the
>>>>> elide sequence is replaced by the single character.  I counter the
>>>>> above arguments as follows:
>>>>>
>>>>> (a) The profusion of algorithms for backslash processing is
>>>>> irrelevant. We can interpret the elides because the only algorithm
>>>>> that has any relevance at the parser level is the simple
>>>>> <backslash><character> -> <character>.  All other potential uses
>>>>> belong to higher levels.  If the higher levels require a
>>>>> <backslash><quote>, that is created by writing
>>>>> <backslash><backslash><backslash><quote> in the on-disk string.
>>>>
>>>> Couldn't agree with you more, and you are preaching to the converted who
>>>> were converted away by others. This is what I was arguing months ago for how
>>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
>>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
>>>> always substitute the single binary character for these character doublets
>>>> ala unix/python/C etc. And you quite rightly argue if you want \n to really
>>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser will
>>>> give the downstream application \n (having removed the leading elide).
>>>> Beautiful, that's what the computer scientist in me argues.
>>>>
>>>> However others argued that many users vim/emacs the file and cut and paste
>>>> the text content. So if you have a LaTEX string "{\\em I am italicised}"
>>>> that you cut and paste then it fails.  And the blasted backward
>>>> compatibility argument comes in with existing CIF1 files that are not doubly
>>>> elided.
>>>>
>>>> What we can do is push the idea that a CIF2 string is a COMPLETELY different
>>>> beast to a CIF1 string. We know that with CIF1 data names and data values we
>>>> have to push our CIF2 parser in to a different grammar to handle things
>>>> correctly. At that level elides in a string will have a strict CIF1 meaning
>>>> (ie IUCr Greek markup).
>>>>
>>>> In CIF2 an elide in a string protects the following character from being
>>>> interpreted as a delimiter. There is special meaning for \n, \t etc  which
>>>> are replaced by their single character. \u123456 (up to 6 hex numbers)
>>>> indicate a unicode character which should be replaced by the correct byte
>>>> sequence. All other first reverse solidus should be removed, and the
>>>> immediately following character passed on as part of the string. Characters
>>>> can be (multibyte) UTF-8.
>>>>
>>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you
>>>> are going to have double all your reverse solidii. You can't cut and paste
>>>> from an editor - bad luck.
>>>>
>>>> I will wait for Herb's response to this because he was an advocate of
>>>> leaving things as they were (I think). I am happy to move forward with your
>>>> suggested interpretation.
>>>>
>>>>> (b) The profusion of algorithms for backslash processing means that
>>>>> we *must* remove ambiguity by removing the eliding character during
>>>>> processing; otherwise, an application can't tell if it is e.g. looking
>>>>> at an escaped prime or an acute accent without applying ugly
>>>>> heuristics.  Note also that a caller of a CIF reading program doesn't
>>>>> currently need to know what the particular string delimiting character
>>>>> was for a given string value; in order to make a guess at what
>>>>> the backslash might mean, it would often need to know this.
>>>>>
>>>>> It appears that Nick is describing Python raw string behaviour,
>>>>> and I am describing Python 'cooked' string behaviour.  Note for the
>>>>> following paragraph from
>>>>> docs.python.org/reference/lexical_analysis.html#strings:
>>>>>
>>>>> When an 'r' or 'R' prefix is present, a character following a
>>>>> backslash is included in the string without change, and all
>>>>> backslashes are left in the string. For example, the string
>>>>> literal r"\n" consists of two characters: a backslash and a
>>>>> lowercase 'n'. String quotes can be escaped with a backslash,
>>>>> but the backslash remains in the string; for example, r"\"" is
>>>>> a valid string literal consisting of two characters: a
>>>>> backslash and a double quote; r"\" is not a valid string
>>>>> literal (even a raw string cannot end in an odd number of
>>>>> backslashes). Specifically, a raw string cannot end in a
>>>>> single backslash (since the backslash would escape the
>>>>> following quote character). Note also that a single backslash
>>>>> followed by a newline is interpreted as those two characters
>>>>> as part of the string, not as a line continuation.
>>>>>
>>>>> Note that raw strings cannot end in a backslash, so I would consider
>>>>> them slightly less expressive than cooked strings, which can express
>>>>> everything.
>>>>>
>>>>> I would challenge Nick et. al. to explain what the advantage
>>>>> of keeping the eliding character in the datavalue is, keeping in mind
>>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>>>>> CIF syntax from their users (as a service), and this proposal appears
>>>>> to want to expose a confusing part of it to them.  Some questions we
>>>>
>>>> The original "advantage" (if you could call it that) was to keep others
>>>> happy and to support backwards compatibility.
>>>>
>>>>> toolbox maintainers will need to ask if this goes through: Do you
>>>>> handle escaping any strings passed to you for output?  How do you know
>>>>> if the caller has done the escaping already, or not?  Do you really expect
>>>>> the calling software to work out whether it wants a single or double
>>>>> or triple quote delimited string?  Isn't that the service provided by
>>>>> your software?  What are they (not) paying you for, anyway?
>>>>
>>>> When they pay, I'll answer that question!
>>>>
>>>> cheers
>>>>
>>>> Nick
>>>>
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>>
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>> <http://www.csse.uwa.edu.au/%7Enick>
>>>> MBDP  M002
>>>>
>>>> CRICOS Provider Code: 00126G
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group



Reply to: [list | sender only]