[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: SIMON WESTRIP <[email protected]>
Date: Fri, 20 Nov 2009 10:22:11 +0000 (GMT)
In-Reply-To: <C72C6177.1252A%[email protected]>
References: <C72C6177.1252A%[email protected]>

> SUMMARISING.
>
> (a) The contents of delimited strings are returned as raw, with the token
> delimiters removed.
> (b) Where a delimiter character is to be part of the string, that character
> must be preceded by a reverse solidus when written out to the file. When
> read, any reverse solidus preceding a terminating character is deleted.
> (c) It is the responsibility of the writing and reading application to
> insert and remove the reverse solidus preceding the terminating character.
> (d) Otherwise the presence of a reverse solidus in the string has no
> meaning.

Good, this is what I was hoping for and trying to exemplify back in THREAD 3

Cheers

Simon

From: Nick Spadaccini <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Friday, 20 November, 2009 7:08:39
Subject: Re: [ddlm-group] Use of elides in strings

We are agreed then. When the northern hemisphere wakes up they will have
read a brilliant argument as to why it is Option 1.

As for *nix tools. Isn't that all broken the moment we go to UTF-8? You'll
grep/sed/awk the ascii characters easily enough but the result will be
packed with "junk". And I don't know if I can specify a UTF-8 character
easily to grep on.

If necessary it wouldn't take much to create a *nix shell tool such that you
would execute

cifparse | grep ...

And cifparse can do whatever is necessary to the file to "clean" it up.

On 20/11/09 2:55 PM, "James Hester" <[email protected]> wrote:

> I agree that it is either option 1 or nothing. Note that option 4 is
> there to see if everyone is awake, but thanks for taking it seriously.
> Don't forget that 'readability' includes not only editors, but fine
> text tools like grep, awk and sed and scripting languages that want to
> do quick and dirty scans without importing or writing and CIF module.
> By keeping CIF compatible with such tools we make the format that much
> more accessible.
>
> Anyway, let's see what our Northern Hemisphere colleagues have to say...
>
> On Fri, Nov 20, 2009 at 5:34 PM, Nick Spadaccini <[email protected]> wrote:
>> On 20/11/09 1:49 PM, "James Hester" <[email protected]> wrote:
>>
>>> The essential issue here appears to be that CIF files are not only
>>> accessed via CIF applications, but also via general-purpose editors
>>> and text utilities, so the lexing/parsing stage should not produce
>>> string values which differ from what might be seen by non-CIF-aware
>>> applications. I think this is a reasonable concern.
>>
>> I do not think it is a reasonable concern any longer and certainly not in to
>> the future. I don't know how many people hand-crank or eye-read a CIF but it
>> has got to be vanishingly small. I can't see how a modern system can encode
>> anything in an (essentially) ASCII base and expect to have to conform to
>> human reading.
>>
>> It is almost like saying all image formats have to be ASCIIArt so that the
>> user can view it in an editor. Time to move forward. If any one wants to
>> read a CIF in that way then it comes with a proviso.
>>
>> Having said that, you are correct in that given the 5 ways to delimit a
>> string the likelihood to have to use elision is low, BUT we need to allow
>> for it so that at least if the case were to arise we can handle it.
>>
>> To that end only option 1 below is reasonable. 3 and 4 would just make
>> things too much more complex, and 2 is not lexically possible and would have
>> to be driven through the dictionary. And this is the way to specify what to
>> do with the raw string regarding its markup, anyway. So the behaviour I
>> specified in my previous mail will only need to be invoked in rare cases,
>> unless there are people will insist of only ever using "" strings in which
>> case elision will be needed more frequently.
>>
>>>
>>> However: if we disallow meddling with string content, then we must
>>> also not provide for eliding terminators, for the reasons put forward
>>> in my previous email: the higher-level application has no guaranteed
>>> way of determining whether the <elide><quote> digraph it finds in a
>>> string refers to a lexical escape, or to a domain specific meaning.
>>>
>>> I would note that we have 5 different possibilities for delimiting
>>> strings. What string is going to fail all those variations? It must
>>> contain both triple double and triple single quotes, as well as a
>>> semicolon following a newline. Frankly, the only realistic scenario
>>> that would produce such a monster string is CIF-inside-CIF, which is a
>>> scenario that can be dealt with either at the dictionary level by
>>> defining a transformation to break up the relevant di/trigraph, or by
>>> preparing the contents of the CIF-inside-CIF using alternative
>>> delimiters.
>>>
>>> I therefore strongly suggest that we just forget about trying to elide
>>> terminator characters.
>>>
>>> If there isn't support for dropping elision, I note that the following
>>> proposals are on the table, in addition to the original one of leaving
>>> the eliding character in the string (numbers 3,4 are ones I just
>>> thought up):
>>>
>>> 1. Nick has just put forward a refined version of my recent proposal
>>> which is about as minimalist as one could make it.
>>> 2. Herbert's r" suggestion
>>> 3. Adding a string concatenation character to the syntax, so that
>>> problem strings could be split into separate bits that use different
>>> delimiters.
>>> 4. Specify Standard LISP behaviour for all lists where the first entry
>>> is 'eval'. String concatenation is one of many possibilities that is
>>> opened up... :)
>>>
>>> On Fri, Nov 20, 2009 at 7:57 AM, SIMON WESTRIP
>>> <[email protected]> wrote:
>>>> Dear all
>>>>
>>>> Haven't caught up with all the recent discussions yet, but hopefully have
>>>> identified
>>>> the following views appropriately:
>>>>
>>>> 1) Nick's proposal (preference):
>>>>
>>>> "In CIF2 an elide in a string protects the following character from being
>>>> interpreted as a delimiter.
>>>>
>>>> There is special meaning for \n, \t etc which
>>>> are replaced by their single character.
>>>>
>>>> \u123456 (up to 6 hex numbers)
>>>> indicate a unicode character which should be replaced by the correct byte
>>>> sequence.
>>>>
>>>> All other first reverse solidus should be removed, and the
>>>> immediately following character passed on as part of the string.
>>>>
>>>> Characters can be (multibyte) UTF-8.
>>>> "
>>>>
>>>> SPW: Though the logic of this is unquestionable (from a programmers
>>>> perspective),
>>>> I think this might be too disruptive. Though CIF2 promises interpretable
>>>> content to
>>>> enhance data processing, CIF is also an archiving format. I beleive that
>>>> restrictions on
>>>> the content of a data value should be minimal, governed by necessity
>>>> (e.g. restrictions to avoid delimiter conflicts), rather than restricting
>>>> the character set of the
>>>> content to facilitate parsing or interpretablity by any particular
>>>> programming language.
>>>> On the one hand CIF2 promises to be a more flexible archiving format by
>>>> extending its character
>>>> set, while on the other hand it could become more restrictive by requiring
>>>> that every reverse solidus
>>>> has to be 'doubled-up' in a data value.
>>>>
>>>> Granted, there are strong arguments that people will decreasingly need to
>>>> interact with a CIF
>>>> in its raw form so extra complexities of syntax are not too much of a
>>>> problem, but as many have pointed out,
>>>> they still will read/edit raw CIFs, and may well have no alternative on
>>>> occassion
>>>> (for example, the IUCr will shortly be requiring authors to include
>>>> refinement-software instruction
>>>> listings in their CIFs, which will need to be included 'as is' within the
>>>> restrictions of the data value delimiters
>>>> and line lengths, purely for review purposes and only available in their
>>>> raw
>>>> form in the CIF)
>>>>
>>>> So on a fundamental level, I dont see that \n, \t, ... need to be reserved
>>>> as special within a data value,
>>>> nor \u123456. Definition of special meanings for these can be handled at a
>>>> higher level? Equally, unless the
>>>> reverse solidus escapes a delimiter character within the context of the
>>>> identified
>>>> opening delimiter, I dont see why it should be discarded by a parser.
>>>>
>>>> 2) James' proposal:
>>>>
>>>> "backslash elides, only two specific ones:
>>>>
>>>> <backslash><terminator> and <backslash><backslash>.
>>>>
>>>> Any other use of
>>>> backslash would simply leave that backslash untouched.
>>>> "
>>>>
>>>> SPW: tend to agree with this (see above), but why escape a backslash when
>>>> they will be untouched anyway if they're not
>>>> followed by a terminator?
>>>>
>>>> 3) Herbert's proposal:
>>>>
>>>> "may I suggest that we adopt both cooked and raw quoted strings
>>>> from python, so that r" and r' can be used to introduce any raw,
>>>> unconverted string taken from a CIF1 in which almost all existing
>>>> CIF1 reverse solidus behavior could be left untouched, and that
>>>> we accept James cooked approach for quoted strings not marked with
>>>> the r' or r".
>>>> "
>>>>
>>>> SPW: could be a neat solution for backward-compatability, but with more
>>>> complexity comes the potential for more errors?
>>>> Also, what about r; (assuming we're not just talking about quoted strings)?
>>>>
>>>>
>>>> So if its not possible to allow context-sensitive handling of elides
>>>> (escaping a delimiter if the value is delimited by the same delimiter),
>>>> then I find myself supporting Nick's earlier conclusion (a month back) that
>>>> all elides will be returned at the parser level for
>>>> the application to deal with (THREAD 3)? If either of these approaches is
>>>> considered unsatisfactory, then 'go the whole hog' and adopt
>>>> the familiar 'programming syntax' treatment of elides as described by Nick.
>>>>
>>>> Cheers
>>>>
>>>> Simon
>>>>
>>>> PS usual disclaimer that these arn't necessarily the IUCr's views
>>>> ________________________________
>>>> From: Herbert J. Bernstein <[email protected]>
>>>> To: Group finalising DDLm and associated dictionaries <[email protected]>
>>>> Cc: [email protected]
>>>> Sent: Thursday, 19 November, 2009 11:55:37
>>>> Subject: Re: [ddlm-group] Use of elides in strings
>>>>
>>>> Dear Colleagues,
>>>>
>>>> My personal preference would be to leave things in what to me seems the
>>>> simpler approach of passing all reverse solidus glyphs to the application.
>>>> However, the pragmatics achieving a consensus and getting on with coding
>>>> is more important that my personal taste.
>>>>
>>>> The major impact of a chnage un the handling of the reverse solidus in
>>>> having some of them absorbed by the CIF2 parsers would be in then
>>>> handling of legacy CIFs at the IUCr and at the PDB. James is right
>>>> that what we are discussing is the difference between raw and cooked
>>>> python strings. Inasmuch as CIF2 is now going to forbid the use of
>>>> quote marks within non-delimited strings, in order to make the
>>>> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible,
>>>> may I suggest that we adopt both cooked and raw quoted strings
>>>> from python, so that r" and r' can be used to introduce any raw,
>>>> unconverted string taken from a CIF1 in which almost all existing
>>>> CIF1 reverse solidus behavior could be left untouched, and that
>>>> we accept James cooked approach for quoted strings not marked with
>>>> the r' or r".
>>>>
>>>> What say the IUCr journal operation and the PDB? It is their ox we are
>>>> goring here.
>>>>
>>>> Regards,
>>>> Herbert
>>>> =====================================================
>>>> Herbert J. Bernstein, Professor of Computer Science
>>>> Dowling College, Kramer Science Center, KSC 121
>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>
>>>> +1-631-244-3035
>>>> [email protected]
>>>> =====================================================
>>>>
>>>> On Thu, 19 Nov 2009, James Hester wrote:
>>>>
>>>>> OK, fair enough. Just to clarify, I am not advocating the full
>>>>> repertoire of backslash elides, only two specific ones:
>>>>> <backslash><terminator> and <backslash><backslash>. Any other use of
>>>>> backslash would simply leave that backslash untouched.
>>>>>
>>>>> Would suggesting that the cut-and-pasters restrict themselves to
>>>>> semicolon-delimited strings or triple-quote delimited strings help
>>>>> with legacy issues?
>>>>>
>>>>> Anyway, let us await the opinions of our Western Hemisphere colleagues...
>>>>>
>>>>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 19/11/09 12:58 PM, "James Hester" <[email protected]> wrote:
>>>>>>
>>>>>>> We need to figure out the behaviour of elides. This was previously
>>>>>>> discussed in a thread entitled "The alphabet of non-delimited
>>>>>>> strings", especially in messages around Oct 16th. The behaviour
>>>>>>> advocated by Nick is for both the eliding and elided character to be
>>>>>>> returned from the parser. The behaviour I would prefer is for the
>>>>>>> eliding character to disappear; it should itself be elided if it is to
>>>>>>> remain in the string.
>>>>>>>
>>>>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>>>>>>> Oct 16, 2009 at 6:22AM and subsequently
>>>>>>>
>>>>>>> 1. We don't interpret elides because we don't know what algorithm to
>>>>>>> use (i.e. it might be a greek character sequence)
>>>>>>>
>>>>>>> 2. The elide simply signals that the lexer should not interpret the
>>>>>>> following character
>>>>>>>
>>>>>>> My counter-proposal is similar to Simon's original expectation: if the
>>>>>>> elide character is really eliding a syntactically significant
>>>>>>> character (i.e. a terminator character or an elide character), the
>>>>>>> elide sequence is replaced by the single character. I counter the
>>>>>>> above arguments as follows:
>>>>>>>
>>>>>>> (a) The profusion of algorithms for backslash processing is
>>>>>>> irrelevant. We can interpret the elides because the only algorithm
>>>>>>> that has any relevance at the parser level is the simple
>>>>>>> <backslash><character> -> <character>. All other potential uses
>>>>>>> belong to higher levels. If the higher levels require a
>>>>>>> <backslash><quote>, that is created by writing
>>>>>>> <backslash><backslash><backslash><quote> in the on-disk string.
>>>>>>
>>>>>> Couldn't agree with you more, and you are preaching to the converted who
>>>>>> were converted away by others. This is what I was arguing months ago for
>>>>>> how
>>>>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
>>>>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
>>>>>> always substitute the single binary character for these character
>>>>>> doublets
>>>>>> ala unix/python/C etc. And you quite rightly argue if you want \n to
>>>>>> really
>>>>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser
>>>>>> will
>>>>>> give the downstream application \n (having removed the leading elide).
>>>>>> Beautiful, that's what the computer scientist in me argues.
>>>>>>
>>>>>> However others argued that many users vim/emacs the file and cut and
>>>>>> paste
>>>>>> the text content. So if you have a LaTEX string "{\\em I am italicised}"
>>>>>> that you cut and paste then it fails. And the blasted backward
>>>>>> compatibility argument comes in with existing CIF1 files that are not
>>>>>> doubly
>>>>>> elided.
>>>>>>
>>>>>> What we can do is push the idea that a CIF2 string is a COMPLETELY
>>>>>> different
>>>>>> beast to a CIF1 string. We know that with CIF1 data names and data values
>>>>>> we
>>>>>> have to push our CIF2 parser in to a different grammar to handle things
>>>>>> correctly. At that level elides in a string will have a strict CIF1
>>>>>> meaning
>>>>>> (ie IUCr Greek markup).
>>>>>>
>>>>>> In CIF2 an elide in a string protects the following character from being
>>>>>> interpreted as a delimiter. There is special meaning for \n, \t etc
>>>>>> which
>>>>>> are replaced by their single character. \u123456 (up to 6 hex numbers)
>>>>>> indicate a unicode character which should be replaced by the correct byte
>>>>>> sequence. All other first reverse solidus should be removed, and the
>>>>>> immediately following character passed on as part of the string.
>>>>>> Characters
>>>>>> can be (multibyte) UTF-8.
>>>>>>
>>>>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you
>>>>>> are going to have double all your reverse solidii. You can't cut and
>>>>>> paste
>>>>>> from an editor - bad luck.
>>>>>>
>>>>>> I will wait for Herb's response to this because he was an advocate of
>>>>>> leaving things as they were (I think). I am happy to move forward with
>>>>>> your
>>>>>> suggested interpretation.
>>>>>>
>>>>>>> (b) The profusion of algorithms for backslash processing means that
>>>>>>> we *must* remove ambiguity by removing the eliding character during
>>>>>>> processing; otherwise, an application can't tell if it is e.g. looking
>>>>>>> at an escaped prime or an acute accent without applying ugly
>>>>>>> heuristics. Note also that a caller of a CIF reading program doesn't
>>>>>>> currently need to know what the particular string delimiting character
>>>>>>> was for a given string value; in order to make a guess at what
>>>>>>> the backslash might mean, it would often need to know this.
>>>>>>>
>>>>>>> It appears that Nick is describing Python raw string behaviour,
>>>>>>> and I am describing Python 'cooked' string behaviour. Note for the
>>>>>>> following paragraph from
>>>>>>> docs.python.org/reference/lexical_analysis.html#strings:
>>>>>>>
>>>>>>> When an 'r' or 'R' prefix is present, a character following a
>>>>>>> backslash is included in the string without change, and all
>>>>>>> backslashes are left in the string. For example, the string
>>>>>>> literal r"\n" consists of two characters: a backslash and a
>>>>>>> lowercase 'n'. String quotes can be escaped with a backslash,
>>>>>>> but the backslash remains in the string; for example, r"\"" is
>>>>>>> a valid string literal consisting of two characters: a
>>>>>>> backslash and a double quote; r"\" is not a valid string
>>>>>>> literal (even a raw string cannot end in an odd number of
>>>>>>> backslashes). Specifically, a raw string cannot end in a
>>>>>>> single backslash (since the backslash would escape the
>>>>>>> following quote character). Note also that a single backslash
>>>>>>> followed by a newline is interpreted as those two characters
>>>>>>> as part of the string, not as a line continuation.
>>>>>>>
>>>>>>> Note that raw strings cannot end in a backslash, so I would consider
>>>>>>> them slightly less expressive than cooked strings, which can express
>>>>>>> everything.
>>>>>>>
>>>>>>> I would challenge Nick et. al. to explain what the advantage
>>>>>>> of keeping the eliding character in the datavalue is, keeping in mind
>>>>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>>>>>>> CIF syntax from their users (as a service), and this proposal appears
>>>>>>> to want to expose a confusing part of it to them. Some questions we
>>>>>>
>>>>>> The original "advantage" (if you could call it that) was to keep others
>>>>>> happy and to support backwards compatibility.
>>>>>>
>>>>>>> toolbox maintainers will need to ask if this goes through: Do you
>>>>>>> handle escaping any strings passed to you for output? How do you know
>>>>>>> if the caller has done the escaping already, or not? Do you really
>>>>>>> expect
>>>>>>> the calling software to work out whether it wants a single or double
>>>>>>> or triple quote delimited string? Isn't that the service provided by
>>>>>>> your software? What are they (not) paying you for, anyway?
>>>>>>
>>>>>> When they pay, I'll answer that question!
>>>>>>
>>>>>> cheers
>>>>>>
>>>>>> Nick
>>>>>>
>>>>>> --------------------------------
>>>>>> Associate Professor N. Spadaccini, PhD
>>>>>> School of Computer Science & Software Engineering
>>>>>>
>>>>>> The University of Western Australia t: +61 (0)8 6488 3452
>>>>>> 35 Stirling Highway f: +61 (0)8 6488 1089
>>>>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
>>>>>> MBDP M002
>>>>>>
>>>>>> CRICOS Provider Code: 00126G
>>>>>>
>>>>>> e: [email protected]
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> [email protected]
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> [email protected]
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>>
>>>
>>>
>>
>> cheers
>>
>> Nick
>>
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>>
>> The University of Western Australia t: +61 (0)8 6488 3452
>> 35 Stirling Highway f: +61 (0)8 6488 1089
>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
>> MBDP M002
>>
>> CRICOS Provider Code: 00126G
>>
>> e: [email protected]
>>
>>
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
>

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002

CRICOS Provider Code: 00126G

e: [email protected]

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Relationship of CIF2 to legacy platforms

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: [ddlm-group] What we have resolved so far

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Use of elides in strings