[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: Nick Spadaccini <[email protected]>
Date: Fri, 20 Nov 2009 15:08:39 +0800
Authentication-Results: postfix;
In-Reply-To: <[email protected]>
We are agreed then. When the northern hemisphere wakes up they will have
read a brilliant argument as to why it is Option 1.

As for *nix tools. Isn't that all broken the moment we go to UTF-8? You'll
grep/sed/awk the ascii characters easily enough but the result will be
packed with "junk". And I don't know if I can specify a UTF-8 character
easily to grep on.

If necessary it wouldn't take much to create a *nix shell tool such that you
would execute

cifparse | grep ...

And cifparse can do whatever is necessary to the file to "clean" it up.


On 20/11/09 2:55 PM, "James Hester" <[email protected]> wrote:

> I agree that it is either option 1 or nothing.  Note that option 4 is
> there to see if everyone is awake, but thanks for taking it seriously.
>  Don't forget that 'readability' includes not only editors, but fine
> text tools like grep, awk and sed and scripting languages that want to
> do quick and dirty scans without importing or writing and CIF module.
> By keeping CIF compatible with such tools we make the format that much
> more accessible.
> 
> Anyway, let's see what our Northern Hemisphere colleagues have to say...
> 
> On Fri, Nov 20, 2009 at 5:34 PM, Nick Spadaccini <[email protected]> wrote:
>> On 20/11/09 1:49 PM, "James Hester" <[email protected]> wrote:
>> 
>>> The essential issue here appears to be that CIF files are not only
>>> accessed via CIF applications, but also via general-purpose editors
>>> and text utilities, so the lexing/parsing stage should not produce
>>> string values which differ from what might be seen by non-CIF-aware
>>> applications. �I think this is a reasonable concern.
>> 
>> I do not think it is a reasonable concern any longer and certainly not in to
>> the future. I don't know how many people hand-crank or eye-read a CIF but it
>> has got to be vanishingly small. I can't see how a modern system can encode
>> anything in an (essentially) ASCII base and expect to have to conform to
>> human reading.
>> 
>> It is almost like saying all image formats have to be ASCIIArt so that the
>> user can view it in an editor. Time to move forward. If any one wants to
>> read a CIF in that way then it comes with a proviso.
>> 
>> Having said that, you are correct in that given the 5 ways to delimit a
>> string the likelihood to have to use elision is low, BUT we need to allow
>> for it so that at least if the case were to arise we can handle it.
>> 
>> To that end only option 1 below is reasonable. 3 and 4 would just make
>> things too much more complex, and 2 is not lexically possible and would have
>> to be driven through the dictionary. And this is the way to specify what to
>> do with the raw string regarding its markup, anyway. So the behaviour I
>> specified in my previous mail will only need to be invoked in rare cases,
>> unless there are people will insist of only ever using "" strings in which
>> case elision will be needed more frequently.
>> 
>>> 
>>> However: if we disallow meddling with string content, then we must
>>> also not provide for eliding terminators, for the reasons put forward
>>> in my previous email: the higher-level application has no guaranteed
>>> way of determining whether the <elide><quote> digraph it finds in a
>>> string refers to a lexical escape, or to a domain specific meaning.
>>> 
>>> I would note that we have 5 different possibilities for delimiting
>>> strings. �What string is going to fail all those variations? �It must
>>> contain both triple double and triple single quotes, as well as a
>>> semicolon following a newline. �Frankly, the only realistic scenario
>>> that would produce such a monster string is CIF-inside-CIF, which is a
>>> scenario that can be dealt with either at the dictionary level by
>>> defining a transformation to break up the relevant di/trigraph, or by
>>> preparing the contents of the CIF-inside-CIF using alternative
>>> delimiters.
>>> 
>>> I therefore strongly suggest that we just forget about trying to elide
>>> terminator characters.
>>> 
>>> If there isn't support for dropping elision, I note that the following
>>> proposals are on the table, in addition to the original one of leaving
>>> the eliding character in the string (numbers 3,4 are ones I just
>>> thought up):
>>> 
>>> 1. Nick has just put forward a refined version of my recent proposal
>>> which is about as minimalist as one could make it.
>>> 2. Herbert's r" suggestion
>>> 3. Adding a string concatenation character to the syntax, so that
>>> problem strings could be split into separate bits that use different
>>> delimiters.
>>> 4. Specify Standard LISP behaviour for all lists where the first entry
>>> is 'eval'. String concatenation is one of many possibilities that is
>>> opened up... :)
>>> 
>>> On Fri, Nov 20, 2009 at 7:57 AM, SIMON WESTRIP
>>> <[email protected]> wrote:
>>>> Dear all
>>>> 
>>>> Haven't caught up with all the recent discussions yet, but hopefully have
>>>> identified
>>>> the following views appropriately:
>>>> 
>>>> 1) Nick's proposal (preference):
>>>> 
>>>> "In CIF2 an elide in a string protects the following character from being
>>>> �interpreted as a delimiter.
>>>> 
>>>> There is special meaning for \n, \t etc �which
>>>> �are replaced by their single character.
>>>> 
>>>> \u123456 (up to 6 hex numbers)
>>>> �indicate a unicode character which should be replaced by the correct byte
>>>> �sequence.
>>>> 
>>>> All other first reverse solidus should be removed, and the
>>>> �immediately following character passed on as part of the string.
>>>> 
>>>> Characters can be (multibyte) UTF-8.
>>>> "
>>>> 
>>>> SPW: Though the logic of this is unquestionable (from a programmers
>>>> perspective),
>>>> I think this might be too disruptive. Though CIF2 promises interpretable
>>>> content to
>>>> enhance data processing, CIF is also an archiving format. I beleive that
>>>> restrictions on
>>>> the content of a data value should be minimal, governed by necessity
>>>> (e.g. restrictions to avoid delimiter conflicts), rather than restricting
>>>> the character set of the
>>>> content to facilitate parsing or interpretablity by any particular
>>>> programming language.
>>>> On the one hand CIF2 promises to be a more flexible archiving format by
>>>> extending its character
>>>> set, while on the other hand it could become more restrictive by requiring
>>>> that every reverse solidus
>>>> has to be 'doubled-up' in a data value.
>>>> 
>>>> Granted, there are strong arguments that people will decreasingly need to
>>>> interact with a CIF
>>>> in its raw form so extra complexities of syntax are not too much of a
>>>> problem, but as many have pointed out,
>>>> they still will read/edit raw CIFs, and may well have no alternative on
>>>> occassion
>>>> (for example, the IUCr will shortly be requiring authors to include
>>>> refinement-software instruction
>>>> listings in their CIFs, which will need to be included 'as is' within the
>>>> restrictions of the data value delimiters
>>>> and line lengths, purely for review purposes and only available in their
>>>> raw
>>>> form in the CIF)
>>>> 
>>>> So on a fundamental level, I dont see that \n, \t, ... need to be reserved
>>>> as special within a data value,
>>>> nor \u123456. Definition of special meanings for these can be handled at a
>>>> higher level? Equally, unless the
>>>> reverse solidus escapes a delimiter character within the context of the
>>>> identified
>>>> opening delimiter, I dont see why it should be discarded by a parser.
>>>> 
>>>> 2) James' proposal:
>>>> 
>>>> "backslash elides, only two specific ones:
>>>> 
>>>> �<backslash><terminator> and <backslash><backslash>.
>>>> 
>>>> Any other use of
>>>> �backslash would simply leave that backslash untouched.
>>>> "
>>>> 
>>>> SPW: tend to agree with this (see above), but why escape a backslash when
>>>> they will be untouched anyway if they're not
>>>> followed by a terminator?
>>>> 
>>>> 3) Herbert's proposal:
>>>> 
>>>> "may I suggest that we adopt both cooked and raw quoted strings
>>>> from python, so that r"� and r' can be used to introduce any raw,
>>>> unconverted string taken from a CIF1 in which almost all existing
>>>> CIF1 reverse solidus behavior could be left untouched, and that
>>>> we accept James cooked approach for quoted strings not marked with
>>>> the r' or r".
>>>> "
>>>> 
>>>> SPW: could be a neat solution for backward-compatability, but with more
>>>> complexity comes the potential for more errors?
>>>> Also, what about r; (assuming we're not just talking about quoted strings)?
>>>> 
>>>> 
>>>> So if its not possible to allow context-sensitive handling of elides
>>>> (escaping a delimiter if the value is delimited by the same delimiter),
>>>> then I find myself supporting Nick's earlier conclusion (a month back) that
>>>> all elides will be returned at the parser level for
>>>> the application to deal with (THREAD 3)? If either of these approaches is
>>>> considered unsatisfactory, then 'go the whole hog' and adopt
>>>> the familiar 'programming syntax' treatment of elides as described by Nick.
>>>> 
>>>> Cheers
>>>> 
>>>> Simon
>>>> 
>>>> PS usual disclaimer that these arn't necessarily the IUCr's views
>>>> ________________________________
>>>> From: Herbert J. Bernstein <[email protected]>
>>>> To: Group finalising DDLm and associated dictionaries <[email protected]>
>>>> Cc: [email protected]
>>>> Sent: Thursday, 19 November, 2009 11:55:37
>>>> Subject: Re: [ddlm-group] Use of elides in strings
>>>> 
>>>> Dear Colleagues,
>>>> 
>>>> � My personal preference would be to leave things in what to me seems the
>>>> simpler approach of passing all reverse solidus glyphs to the application.
>>>> However, the pragmatics achieving a consensus and getting on with coding
>>>> is more important that my personal taste.
>>>> 
>>>> � The major impact of a chnage un the handling of the reverse solidus in
>>>> having some of them absorbed by the CIF2 parsers would be in then
>>>> handling of legacy CIFs at the IUCr and at the PDB.� James is right
>>>> that what we are discussing is the difference between raw and cooked
>>>> python strings.� Inasmuch as CIF2 is now going to forbid the use of
>>>> quote marks within non-delimited strings, in order to make the
>>>> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible,
>>>> may I suggest that we adopt both cooked and raw quoted strings
>>>> from python, so that r"� and r' can be used to introduce any raw,
>>>> unconverted string taken from a CIF1 in which almost all existing
>>>> CIF1 reverse solidus behavior could be left untouched, and that
>>>> we accept James cooked approach for quoted strings not marked with
>>>> the r' or r".
>>>> 
>>>> � What say the IUCr journal operation and the PDB?� It is their ox we are
>>>> goring here.
>>>> 
>>>> � Regards,
>>>> � � Herbert
>>>> =====================================================
>>>> � Herbert J. Bernstein, Professor of Computer Science
>>>> � � Dowling College, Kramer Science Center, KSC 121
>>>> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>>>> 
>>>> � � � � � � � � � +1-631-244-3035
>>>> � � � � � � � � � [email protected]
>>>> =====================================================
>>>> 
>>>> On Thu, 19 Nov 2009, James Hester wrote:
>>>> 
>>>>> OK, fair enough.� Just to clarify, I am not advocating the full
>>>>> repertoire of backslash elides, only two specific ones:
>>>>> <backslash><terminator> and <backslash><backslash>.� Any other use of
>>>>> backslash would simply leave that backslash untouched.
>>>>> 
>>>>> Would suggesting that the cut-and-pasters restrict themselves to
>>>>> semicolon-delimited strings or triple-quote delimited strings help
>>>>> with legacy issues?
>>>>> 
>>>>> Anyway, let us await the opinions of our Western Hemisphere colleagues...
>>>>> 
>>>>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 19/11/09 12:58 PM, "James Hester" <[email protected]> wrote:
>>>>>> 
>>>>>>> We need to figure out the behaviour of elides. �This was previously
>>>>>>> discussed in a thread entitled "The alphabet of non-delimited
>>>>>>> strings", especially in messages around Oct 16th. �The behaviour
>>>>>>> advocated by Nick is for both the eliding and elided character to be
>>>>>>> returned from the parser. �The behaviour I would prefer is for the
>>>>>>> eliding character to disappear; it should itself be elided if it is to
>>>>>>> remain in the string.
>>>>>>> 
>>>>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>>>>>>> Oct 16, 2009 at 6:22AM and subsequently
>>>>>>> 
>>>>>>> 1. We don't interpret elides because we don't know what algorithm to
>>>>>>> use (i.e. it might be a greek character sequence)
>>>>>>> 
>>>>>>> 2. The elide simply signals that the lexer should not interpret the
>>>>>>> following character
>>>>>>> 
>>>>>>> My counter-proposal is similar to Simon's original expectation: if the
>>>>>>> elide character is really eliding a syntactically significant
>>>>>>> character (i.e. a terminator character or an elide character), the
>>>>>>> elide sequence is replaced by the single character. �I counter the
>>>>>>> above arguments as follows:
>>>>>>> 
>>>>>>> (a) The profusion of algorithms for backslash processing is
>>>>>>> irrelevant. We can interpret the elides because the only algorithm
>>>>>>> that has any relevance at the parser level is the simple
>>>>>>> <backslash><character> -> <character>. �All other potential uses
>>>>>>> belong to higher levels. �If the higher levels require a
>>>>>>> <backslash><quote>, that is created by writing
>>>>>>> <backslash><backslash><backslash><quote> in the on-disk string.
>>>>>> 
>>>>>> Couldn't agree with you more, and you are preaching to the converted who
>>>>>> were converted away by others. This is what I was arguing months ago for
>>>>>> how
>>>>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
>>>>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
>>>>>> always substitute the single binary character for these character
>>>>>> doublets
>>>>>> ala unix/python/C etc. And you quite rightly argue if you want \n to
>>>>>> really
>>>>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser
>>>>>> will
>>>>>> give the downstream application \n (having removed the leading elide).
>>>>>> Beautiful, that's what the computer scientist in me argues.
>>>>>> 
>>>>>> However others argued that many users vim/emacs the file and cut and
>>>>>> paste
>>>>>> the text content. So if you have a LaTEX string "{\\em I am italicised}"
>>>>>> that you cut and paste then it fails. �And the blasted backward
>>>>>> compatibility argument comes in with existing CIF1 files that are not
>>>>>> doubly
>>>>>> elided.
>>>>>> 
>>>>>> What we can do is push the idea that a CIF2 string is a COMPLETELY
>>>>>> different
>>>>>> beast to a CIF1 string. We know that with CIF1 data names and data values
>>>>>> we
>>>>>> have to push our CIF2 parser in to a different grammar to handle things
>>>>>> correctly. At that level elides in a string will have a strict CIF1
>>>>>> meaning
>>>>>> (ie IUCr Greek markup).
>>>>>> 
>>>>>> In CIF2 an elide in a string protects the following character from being
>>>>>> interpreted as a delimiter. There is special meaning for \n, \t etc
>>>>>> �which
>>>>>> are replaced by their single character. \u123456 (up to 6 hex numbers)
>>>>>> indicate a unicode character which should be replaced by the correct byte
>>>>>> sequence. All other first reverse solidus should be removed, and the
>>>>>> immediately following character passed on as part of the string.
>>>>>> Characters
>>>>>> can be (multibyte) UTF-8.
>>>>>> 
>>>>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you
>>>>>> are going to have double all your reverse solidii. You can't cut and
>>>>>> paste
>>>>>> from an editor - bad luck.
>>>>>> 
>>>>>> I will wait for Herb's response to this because he was an advocate of
>>>>>> leaving things as they were (I think). I am happy to move forward with
>>>>>> your
>>>>>> suggested interpretation.
>>>>>> 
>>>>>>> (b) The profusion of algorithms for backslash processing means that
>>>>>>> we *must* remove ambiguity by removing the eliding character during
>>>>>>> processing; otherwise, an application can't tell if it is e.g. looking
>>>>>>> at an escaped prime or an acute accent without applying ugly
>>>>>>> heuristics. �Note also that a caller of a CIF reading program doesn't
>>>>>>> currently need to know what the particular string delimiting character
>>>>>>> was for a given string value; in order to make a guess at what
>>>>>>> the backslash might mean, it would often need to know this.
>>>>>>> 
>>>>>>> It appears that Nick is describing Python raw string behaviour,
>>>>>>> and I am describing Python 'cooked' string behaviour. �Note for the
>>>>>>> following paragraph from
>>>>>>> docs.python.org/reference/lexical_analysis.html#strings:
>>>>>>> 
>>>>>>> When an 'r' or 'R' prefix is present, a character following a
>>>>>>> backslash is included in the string without change, and all
>>>>>>> backslashes are left in the string. For example, the string
>>>>>>> literal r"\n" consists of two characters: a backslash and a
>>>>>>> lowercase 'n'. String quotes can be escaped with a backslash,
>>>>>>> but the backslash remains in the string; for example, r"\"" is
>>>>>>> a valid string literal consisting of two characters: a
>>>>>>> backslash and a double quote; r"\" is not a valid string
>>>>>>> literal (even a raw string cannot end in an odd number of
>>>>>>> backslashes). Specifically, a raw string cannot end in a
>>>>>>> single backslash (since the backslash would escape the
>>>>>>> following quote character). Note also that a single backslash
>>>>>>> followed by a newline is interpreted as those two characters
>>>>>>> as part of the string, not as a line continuation.
>>>>>>> 
>>>>>>> Note that raw strings cannot end in a backslash, so I would consider
>>>>>>> them slightly less expressive than cooked strings, which can express
>>>>>>> everything.
>>>>>>> 
>>>>>>> I would challenge Nick et. al. to explain what the advantage
>>>>>>> of keeping the eliding character in the datavalue is, keeping in mind
>>>>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>>>>>>> CIF syntax from their users (as a service), and this proposal appears
>>>>>>> to want to expose a confusing part of it to them. �Some questions we
>>>>>> 
>>>>>> The original "advantage" (if you could call it that) was to keep others
>>>>>> happy and to support backwards compatibility.
>>>>>> 
>>>>>>> toolbox maintainers will need to ask if this goes through: Do you
>>>>>>> handle escaping any strings passed to you for output? �How do you know
>>>>>>> if the caller has done the escaping already, or not? �Do you really
>>>>>>> expect
>>>>>>> the calling software to work out whether it wants a single or double
>>>>>>> or triple quote delimited string? �Isn't that the service provided by
>>>>>>> your software? �What are they (not) paying you for, anyway?
>>>>>> 
>>>>>> When they pay, I'll answer that question!
>>>>>> 
>>>>>> cheers
>>>>>> 
>>>>>> Nick
>>>>>> 
>>>>>> --------------------------------
>>>>>> Associate Professor N. Spadaccini, PhD
>>>>>> School of Computer Science & Software Engineering
>>>>>> 
>>>>>> The University of Western Australia � �t: +61 (0)8 6488 3452
>>>>>> 35 Stirling Highway � � � � � � � � � �f: +61 (0)8 6488 1089
>>>>>> CRAWLEY, Perth, �WA �6009 AUSTRALIA � w3: www.csse.uwa.edu.au/~nick
>>>>>> MBDP �M002
>>>>>> 
>>>>>> CRICOS Provider Code: 00126G
>>>>>> 
>>>>>> e: [email protected]
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> [email protected]
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> [email protected]
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>> 
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> 
>>>> 
>>> 
>>> 
>> 
>> cheers
>> 
>> Nick
>> 
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>> 
>> The University of Western Australia � �t: +61 (0)8 6488 3452
>> 35 Stirling Highway � � � � � � � � � �f: +61 (0)8 6488 1089
>> CRAWLEY, Perth, �WA �6009 AUSTRALIA � w3: www.csse.uwa.edu.au/~nick
>> MBDP �M002
>> 
>> CRICOS Provider Code: 00126G
>> 
>> e: [email protected]
>> 
>> 
>> 
>> 
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
> 
> 

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: [email protected]




_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

References:

Re: [ddlm-group] Use of elides in strings (James Hester)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Use of elides in strings

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Use of elides in strings