Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

On 20/11/09 1:49 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> The essential issue here appears to be that CIF files are not only
> accessed via CIF applications, but also via general-purpose editors
> and text utilities, so the lexing/parsing stage should not produce
> string values which differ from what might be seen by non-CIF-aware
> applications.  I think this is a reasonable concern.

I do not think it is a reasonable concern any longer and certainly not in to
the future. I don't know how many people hand-crank or eye-read a CIF but it
has got to be vanishingly small. I can't see how a modern system can encode
anything in an (essentially) ASCII base and expect to have to conform to
human reading.

It is almost like saying all image formats have to be ASCIIArt so that the
user can view it in an editor. Time to move forward. If any one wants to
read a CIF in that way then it comes with a proviso.

Having said that, you are correct in that given the 5 ways to delimit a
string the likelihood to have to use elision is low, BUT we need to allow
for it so that at least if the case were to arise we can handle it.

To that end only option 1 below is reasonable. 3 and 4 would just make
things too much more complex, and 2 is not lexically possible and would have
to be driven through the dictionary. And this is the way to specify what to
do with the raw string regarding its markup, anyway. So the behaviour I
specified in my previous mail will only need to be invoked in rare cases,
unless there are people will insist of only ever using "" strings in which
case elision will be needed more frequently.

> 
> However: if we disallow meddling with string content, then we must
> also not provide for eliding terminators, for the reasons put forward
> in my previous email: the higher-level application has no guaranteed
> way of determining whether the <elide><quote> digraph it finds in a
> string refers to a lexical escape, or to a domain specific meaning.
> 
> I would note that we have 5 different possibilities for delimiting
> strings.  What string is going to fail all those variations?  It must
> contain both triple double and triple single quotes, as well as a
> semicolon following a newline.  Frankly, the only realistic scenario
> that would produce such a monster string is CIF-inside-CIF, which is a
> scenario that can be dealt with either at the dictionary level by
> defining a transformation to break up the relevant di/trigraph, or by
> preparing the contents of the CIF-inside-CIF using alternative
> delimiters.
> 
> I therefore strongly suggest that we just forget about trying to elide
> terminator characters.
> 
> If there isn't support for dropping elision, I note that the following
> proposals are on the table, in addition to the original one of leaving
> the eliding character in the string (numbers 3,4 are ones I just
> thought up):
> 
> 1. Nick has just put forward a refined version of my recent proposal
> which is about as minimalist as one could make it.
> 2. Herbert's r" suggestion
> 3. Adding a string concatenation character to the syntax, so that
> problem strings could be split into separate bits that use different
> delimiters.
> 4. Specify Standard LISP behaviour for all lists where the first entry
> is 'eval'. String concatenation is one of many possibilities that is
> opened up... :)
> 
> On Fri, Nov 20, 2009 at 7:57 AM, SIMON WESTRIP
> <simonwestrip@btinternet.com> wrote:
>> Dear all
>> 
>> Haven't caught up with all the recent discussions yet, but hopefully have
>> identified
>> the following views appropriately:
>> 
>> 1) Nick's proposal (preference):
>> 
>> "In CIF2 an elide in a string protects the following character from being
>>  interpreted as a delimiter.
>> 
>> There is special meaning for \n, \t etc  which
>>  are replaced by their single character.
>> 
>> \u123456 (up to 6 hex numbers)
>>  indicate a unicode character which should be replaced by the correct byte
>>  sequence.
>> 
>> All other first reverse solidus should be removed, and the
>>  immediately following character passed on as part of the string.
>> 
>> Characters can be (multibyte) UTF-8.
>> "
>> 
>> SPW: Though the logic of this is unquestionable (from a programmers
>> perspective),
>> I think this might be too disruptive. Though CIF2 promises interpretable
>> content to
>> enhance data processing, CIF is also an archiving format. I beleive that
>> restrictions on
>> the content of a data value should be minimal, governed by necessity
>> (e.g. restrictions to avoid delimiter conflicts), rather than restricting
>> the character set of the
>> content to facilitate parsing or interpretablity by any particular
>> programming language.
>> On the one hand CIF2 promises to be a more flexible archiving format by
>> extending its character
>> set, while on the other hand it could become more restrictive by requiring
>> that every reverse solidus
>> has to be 'doubled-up' in a data value.
>> 
>> Granted, there are strong arguments that people will decreasingly need to
>> interact with a CIF
>> in its raw form so extra complexities of syntax are not too much of a
>> problem, but as many have pointed out,
>> they still will read/edit raw CIFs, and may well have no alternative on
>> occassion
>> (for example, the IUCr will shortly be requiring authors to include
>> refinement-software instruction
>> listings in their CIFs, which will need to be included 'as is' within the
>> restrictions of the data value delimiters
>> and line lengths, purely for review purposes and only available in their raw
>> form in the CIF)
>> 
>> So on a fundamental level, I dont see that \n, \t, ... need to be reserved
>> as special within a data value,
>> nor \u123456. Definition of special meanings for these can be handled at a
>> higher level? Equally, unless the
>> reverse solidus escapes a delimiter character within the context of the
>> identified
>> opening delimiter, I dont see why it should be discarded by a parser.
>> 
>> 2) James' proposal:
>> 
>> "backslash elides, only two specific ones:
>> 
>>  <backslash><terminator> and <backslash><backslash>.
>> 
>> Any other use of
>>  backslash would simply leave that backslash untouched.
>> "
>> 
>> SPW: tend to agree with this (see above), but why escape a backslash when
>> they will be untouched anyway if they're not
>> followed by a terminator?
>> 
>> 3) Herbert's proposal:
>> 
>> "may I suggest that we adopt both cooked and raw quoted strings
>> from python, so that r"  and r' can be used to introduce any raw,
>> unconverted string taken from a CIF1 in which almost all existing
>> CIF1 reverse solidus behavior could be left untouched, and that
>> we accept James cooked approach for quoted strings not marked with
>> the r' or r".
>> "
>> 
>> SPW: could be a neat solution for backward-compatability, but with more
>> complexity comes the potential for more errors?
>> Also, what about r; (assuming we're not just talking about quoted strings)?
>> 
>> 
>> So if its not possible to allow context-sensitive handling of elides
>> (escaping a delimiter if the value is delimited by the same delimiter),
>> then I find myself supporting Nick's earlier conclusion (a month back) that
>> all elides will be returned at the parser level for
>> the application to deal with (THREAD 3)? If either of these approaches is
>> considered unsatisfactory, then 'go the whole hog' and adopt
>> the familiar 'programming syntax' treatment of elides as described by Nick.
>> 
>> Cheers
>> 
>> Simon
>> 
>> PS usual disclaimer that these arn't necessarily the IUCr's views
>> ________________________________
>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Cc: Nick.Spadaccini@uwa.edu.au
>> Sent: Thursday, 19 November, 2009 11:55:37
>> Subject: Re: [ddlm-group] Use of elides in strings
>> 
>> Dear Colleagues,
>> 
>>   My personal preference would be to leave things in what to me seems the
>> simpler approach of passing all reverse solidus glyphs to the application.
>> However, the pragmatics achieving a consensus and getting on with coding
>> is more important that my personal taste.
>> 
>>   The major impact of a chnage un the handling of the reverse solidus in
>> having some of them absorbed by the CIF2 parsers would be in then
>> handling of legacy CIFs at the IUCr and at the PDB.  James is right
>> that what we are discussing is the difference between raw and cooked
>> python strings.  Inasmuch as CIF2 is now going to forbid the use of
>> quote marks within non-delimited strings, in order to make the
>> conversion of legacy CIFs from CIF1 to CIF2 as easy as possible,
>> may I suggest that we adopt both cooked and raw quoted strings
>> from python, so that r"  and r' can be used to introduce any raw,
>> unconverted string taken from a CIF1 in which almost all existing
>> CIF1 reverse solidus behavior could be left untouched, and that
>> we accept James cooked approach for quoted strings not marked with
>> the r' or r".
>> 
>>   What say the IUCr journal operation and the PDB?  It is their ox we are
>> goring here.
>> 
>>   Regards,
>>     Herbert
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>> 
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>> 
>> On Thu, 19 Nov 2009, James Hester wrote:
>> 
>>> OK, fair enough.  Just to clarify, I am not advocating the full
>>> repertoire of backslash elides, only two specific ones:
>>> <backslash><terminator> and <backslash><backslash>.  Any other use of
>>> backslash would simply leave that backslash untouched.
>>> 
>>> Would suggesting that the cut-and-pasters restrict themselves to
>>> semicolon-delimited strings or triple-quote delimited strings help
>>> with legacy issues?
>>> 
>>> Anyway, let us await the opinions of our Western Hemisphere colleagues...
>>> 
>>> On Thu, Nov 19, 2009 at 7:02 PM, Nick Spadaccini <nick@csse.uwa.edu.au>
>>> wrote:
>>>> 
>>>> 
>>>> 
>>>> On 19/11/09 12:58 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>>>> 
>>>>> We need to figure out the behaviour of elides.  This was previously
>>>>> discussed in a thread entitled "The alphabet of non-delimited
>>>>> strings", especially in messages around Oct 16th.  The behaviour
>>>>> advocated by Nick is for both the eliding and elided character to be
>>>>> returned from the parser.  The behaviour I would prefer is for the
>>>>> eliding character to disappear; it should itself be elided if it is to
>>>>> remain in the string.
>>>>> 
>>>>> To summarize Nick's and Herbert's arguments from the emails dated Fri
>>>>> Oct 16, 2009 at 6:22AM and subsequently
>>>>> 
>>>>> 1. We don't interpret elides because we don't know what algorithm to
>>>>> use (i.e. it might be a greek character sequence)
>>>>> 
>>>>> 2. The elide simply signals that the lexer should not interpret the
>>>>> following character
>>>>> 
>>>>> My counter-proposal is similar to Simon's original expectation: if the
>>>>> elide character is really eliding a syntactically significant
>>>>> character (i.e. a terminator character or an elide character), the
>>>>> elide sequence is replaced by the single character.  I counter the
>>>>> above arguments as follows:
>>>>> 
>>>>> (a) The profusion of algorithms for backslash processing is
>>>>> irrelevant. We can interpret the elides because the only algorithm
>>>>> that has any relevance at the parser level is the simple
>>>>> <backslash><character> -> <character>.  All other potential uses
>>>>> belong to higher levels.  If the higher levels require a
>>>>> <backslash><quote>, that is created by writing
>>>>> <backslash><backslash><backslash><quote> in the on-disk string.
>>>> 
>>>> Couldn't agree with you more, and you are preaching to the converted who
>>>> were converted away by others. This is what I was arguing months ago for
>>>> how
>>>> to interpret the """ strings. That is \n (EXPLICITLY THE ASCII REVERSE
>>>> SOLIDUS) is always a newline, \t is always a tab etc. The parser should
>>>> always substitute the single binary character for these character
>>>> doublets
>>>> ala unix/python/C etc. And you quite rightly argue if you want \n to
>>>> really
>>>> mean the IUCr Greek nu then it will have to be \\n, and the same parser
>>>> will
>>>> give the downstream application \n (having removed the leading elide).
>>>> Beautiful, that's what the computer scientist in me argues.
>>>> 
>>>> However others argued that many users vim/emacs the file and cut and
>>>> paste
>>>> the text content. So if you have a LaTEX string "{\\em I am italicised}"
>>>> that you cut and paste then it fails.  And the blasted backward
>>>> compatibility argument comes in with existing CIF1 files that are not
>>>> doubly
>>>> elided.
>>>> 
>>>> What we can do is push the idea that a CIF2 string is a COMPLETELY
>>>> different
>>>> beast to a CIF1 string. We know that with CIF1 data names and data values
>>>> we
>>>> have to push our CIF2 parser in to a different grammar to handle things
>>>> correctly. At that level elides in a string will have a strict CIF1
>>>> meaning
>>>> (ie IUCr Greek markup).
>>>> 
>>>> In CIF2 an elide in a string protects the following character from being
>>>> interpreted as a delimiter. There is special meaning for \n, \t etc
>>>>  which
>>>> are replaced by their single character. \u123456 (up to 6 hex numbers)
>>>> indicate a unicode character which should be replaced by the correct byte
>>>> sequence. All other first reverse solidus should be removed, and the
>>>> immediately following character passed on as part of the string.
>>>> Characters
>>>> can be (multibyte) UTF-8.
>>>> 
>>>> If you want to encode LaTEX (or IUCr-speak or something similar) then you
>>>> are going to have double all your reverse solidii. You can't cut and
>>>> paste
>>>> from an editor - bad luck.
>>>> 
>>>> I will wait for Herb's response to this because he was an advocate of
>>>> leaving things as they were (I think). I am happy to move forward with
>>>> your
>>>> suggested interpretation.
>>>> 
>>>>> (b) The profusion of algorithms for backslash processing means that
>>>>> we *must* remove ambiguity by removing the eliding character during
>>>>> processing; otherwise, an application can't tell if it is e.g. looking
>>>>> at an escaped prime or an acute accent without applying ugly
>>>>> heuristics.  Note also that a caller of a CIF reading program doesn't
>>>>> currently need to know what the particular string delimiting character
>>>>> was for a given string value; in order to make a guess at what
>>>>> the backslash might mean, it would often need to know this.
>>>>> 
>>>>> It appears that Nick is describing Python raw string behaviour,
>>>>> and I am describing Python 'cooked' string behaviour.  Note for the
>>>>> following paragraph from
>>>>> docs.python.org/reference/lexical_analysis.html#strings:
>>>>> 
>>>>> When an 'r' or 'R' prefix is present, a character following a
>>>>> backslash is included in the string without change, and all
>>>>> backslashes are left in the string. For example, the string
>>>>> literal r"\n" consists of two characters: a backslash and a
>>>>> lowercase 'n'. String quotes can be escaped with a backslash,
>>>>> but the backslash remains in the string; for example, r"\"" is
>>>>> a valid string literal consisting of two characters: a
>>>>> backslash and a double quote; r"\" is not a valid string
>>>>> literal (even a raw string cannot end in an odd number of
>>>>> backslashes). Specifically, a raw string cannot end in a
>>>>> single backslash (since the backslash would escape the
>>>>> following quote character). Note also that a single backslash
>>>>> followed by a newline is interpreted as those two characters
>>>>> as part of the string, not as a line continuation.
>>>>> 
>>>>> Note that raw strings cannot end in a backslash, so I would consider
>>>>> them slightly less expressive than cooked strings, which can express
>>>>> everything.
>>>>> 
>>>>> I would challenge Nick et. al. to explain what the advantage
>>>>> of keeping the eliding character in the datavalue is, keeping in mind
>>>>> that programs like CIFtbx and PyCIFRW and several others aim to hide
>>>>> CIF syntax from their users (as a service), and this proposal appears
>>>>> to want to expose a confusing part of it to them.  Some questions we
>>>> 
>>>> The original "advantage" (if you could call it that) was to keep others
>>>> happy and to support backwards compatibility.
>>>> 
>>>>> toolbox maintainers will need to ask if this goes through: Do you
>>>>> handle escaping any strings passed to you for output?  How do you know
>>>>> if the caller has done the escaping already, or not?  Do you really
>>>>> expect
>>>>> the calling software to work out whether it wants a single or double
>>>>> or triple quote delimited string?  Isn't that the service provided by
>>>>> your software?  What are they (not) paying you for, anyway?
>>>> 
>>>> When they pay, I'll answer that question!
>>>> 
>>>> cheers
>>>> 
>>>> Nick
>>>> 
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>> 
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>> MBDP  M002
>>>> 
>>>> CRICOS Provider Code: 00126G
>>>> 
>>>> e: Nick.Spadaccini@uwa.edu.au
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> 
>>> 
>>> 
>>> 
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> 
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>> 
> 
> 

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.