[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

I am now totally lost.  Please start over with a coherent proposal
for the syntax of a quoted string.  In particular, please state
how the following strings will be parsed

"ab\"cd"
'ab\"cd'
"ab\\"cd"
'ab\\"cd'

;ab\"cd\
;

;ab\\"cd\\
;

"""ab\""""
"""ab\\""""


{"abcd\"":ggg}
{'abcd\"':ggg}

"resum\'ee"
'resum\'ee'
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 23 Nov 2009, Nick Spadaccini wrote:

> I'll try again.
>
> Yes Simon, what I am suggesting here after James' proposal is that it is
> different to the end of THREAD 3.
>
> READ MY LIPS (FINGERS)
>
> (1) This is CIF2 specification. What is done in CIF1 encoding is irrelevant.
> ALL parsers will have to drop in and out of CIF1 and CIF2 modes for legacy
> reasons. Please desist from including CIF1 examples, they are a distraction.
> (2) James (and I support) proposes that the very special behaviour of a
> terminator (AND ONLY THIS CHARACTER) within a string delimited by the same
> character is a CIF2 issue that should be handled by the parser (read this to
> mean that application that is responsible for writing and reading CIF 2
> files). WE MUST agree where this is done, because users have to know when
> they write or read a string who will handle the specialness of terminators.
>
> Herb and John argue it is the users responsibility. Sorry but most users
> would not understand what the issues are (there is enough confusion amongst
> us to make that clear).
> (3) As an example lets just discuss "" delimited strings, and I wish to
> include a " in that string. The proposal is that the parser ONLY EVER deals
> with the " character. ALL OTHER ELIDES etc are irrelevant - they will be
> passed on as raw. The \\ is NOT a special case, and I can see no reason for
> it to be considered.
>
> The process is, if the string to be written out has a " included this will
> be elided as it is written out. When reading, as you parse left to write,
> when you find a " if it has a preceding \ delete it, skip over the " and
> continue, otherwise it must be the terminating character. Everything else is
> left untouched.
>
> Examples
>
> HB example
>
> Say a user (Dr Joe Ordinary) wishes to output in a "" delimited string
>
> (1) abcd\"efgh. The parser would output "abcd\\"efgh"
> (2) abcd"efgh. The parser would output "abcd\"efgh"
>
> JW example
>
> [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*. The parser would output
> "[][ _(),.;:\"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*"
>
> The proposal is that ONLY \" is of any relevance to this process, all other
> elides are irrelevant. John do you really expect to leave it up to the user
> to know that the \" in your regular expression was inserted as a CIF2
> requirement, rather than an actual regex?
>
> Again the price is you can't cut and paste, but you can't do that anyway
> whether you take the JRH/NS view or the HB/JW view.
>
> Is see real strength in requiring the parser to do this for the user. I can
> see no downside, but I do see downsides if you expect all users to know what
> to do.
>
> Herb is correct that all other handling of elides (the encodings) is left to
> the dictionary level definitions. JW quite correctly states that they avoid
> the problem by using a different delimiter character to the character they
> wish to insert. There will be very rare cases though, where that may to be
> possible and the above HAS TO be done.
>
>
> On 23/11/09 2:29 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
> wrote:
>
>> Yes, I would prefer that the lexer and parser deliver
>>
>> "A\"BC"
>>
>> as
>>
>>
>> A\"BC
>>
>> -- Herbert
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Sun, 22 Nov 2009, SIMON WESTRIP wrote:
>>
>>> Thanks for this
>>>
>>> Does everyone else agree that the value in
>>>
>>> _label "A\"BC"
>>>
>>> is A\"BC ?
>>>
>>> Sorry to keep on about this, but at the end of THREAD3 I understood this to
>>> be
>>> the conclusion, then more recently it looked like there was some acceptance
>>> of
>>>  dropping the backslash in a context-sensitive manner - i.e. the value being
>>>  A"BC in this particular case.
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>>
>>> ____________________________________________________________________________
>>> From: John Westbrook <jwest@pdb-mail.rutgers.edu>
>>> To: SIMON WESTRIP <simonwestrip@btinternet.com>
>>> Cc: jwest@rcsb.rutgers.edu; Group finalising DDLm and associated
>>> dictionaries <ddlm-group@iucr.org>
>>> Sent: Sunday, 22 November, 2009 15:12:10
>>> Subject: Re: [ddlm-group] Use of elides in strings
>>>
>>>
>>> Hi Simon -
>>>
>>> Subject to the regex example we would process this as A\"BC
>>> as the '\' allowed in the regex.  We loads of similar cases in which
>>> there is an embedded quote in a character string which is not surrounded
>>> by whitespace.  When Nick visited us recently he analyzed these cases
>>> and we agreed that we would be able to quote these in the opposite
>>> since 'AB"C' or "AB'C", or in semi-colons for the odd case in which
>>> both should occur.    In none of these cases would we expect the
>>> internal quote to be escaped.
>>>
>>> John
>>>
>>>
>>> SIMON WESTRIP wrote:
>>>> lexical analysis and parsing aside, in terms of specifying the syntax of
>>> CIF2,
>>>> what should someone expect the following to represent:
>>>>
>>>> _label "A\"BC"
>>>>
>>>> Is the value A\"BC or A"BC?
>>>>
>>>> I'm talking CIF2 only here (the use of elides for greek etc in CIF1 will
>>> no longer be part of the spec; rather it can be handled at the application
>>> level or perhaps defined in the dictionary in some way as some sort of item
>>> content type?)
>>>>
>>>> Cheers
>>>>
>>>> Simon
>>>>
>>>>
>>>>
>>>>
>>>> ________________________________
>>>> From: John Westbrook <jwest@pdb-mail.rutgers.edu>
>>>> To: Group finalising DDLm and associated dictionaries
>>> <ddlm-group@iucr.org>
>>>> Sent: Saturday, 21 November, 2009 14:02:19
>>>> Subject: Re: [ddlm-group] Use of elides in strings
>>>>
>>>>
>>>> To take another example in support of passing the data without
>>>> processing back to the application, DDL2 depends heavily on
>>>> using dictionary regex's to define the interpretation for
>>>> the application.  For instance, the regex for an atom code in
>>>> our dictionary is -
>>>>
>>>> [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*
>>>>
>>>> Not only does processing this regex present an issue for the
>>>> treatment of the '\', but it also defines how the '\' will
>>>> be interpreted in data subject to the regex.
>>>>
>>>> I agree with Herb's conclusion that the lexer should do
>>>> the minimum of interpretation.
>>>>
>>>> Regards,
>>>>
>>>> John
>>>>
>>>> Herbert J. Bernstein wrote:
>>>>> Dear Colleagues,
>>>>>
>>>>>   Let us consider James' example.  He is actually making the case
>>>>> for _not_ removing the reverse-solidus from a string at the
>>>>> lexical level.
>>>>>
>>>>>   xxxx<backslash><quote>elxxxx
>>>>>
>>>>> or to be more specific
>>>>>
>>>>>   abcd\'efgh
>>>>>
>>>>> and we are presented with the question of ho should the
>>>>> dictionary interpret that string.
>>>>>
>>>>> If we have a string intended to be part of the modern pythonesque
>>>>> world, then I would expect the data element to have been typed
>>>>> in a way that says we should read the string as
>>>>>
>>>>>   abcd'efgh
>>>>>
>>>>> If we have a string that is a legacy from a CIF 1 file with
>>>>> IUCr type-setting codes, I would expect the data element to
>>>>> have beentyped in a way that says we should read the string as
>>>>> abcd{e with an acute accent)fgh
>>>>>
>>>>> Anything the lexer does to remove the reverse-solidus is
>>>>> going to disfavor one intepretation or the other.
>>>>>
>>>>> By moving these two interpretations one level up to two
>>>>> different utility routines, we gain much more use from
>>>>> a common lexer and nobody loses any functionality.
>>>>>
>>>>>   Regards,
>>>>>     Herbert
>>>>>
>>>>> =====================================================
>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>>                   +1-631-244-3035
>>>>>                   yaya@dowling.edu
>>>>> =====================================================
>>>>>
>>>>> On Sat, 21 Nov 2009, James Hester wrote:
>>>>>
>>>>>> Joe, I agree with you.  There is a fundamental issue here that I have
>>>>>> already raised, but can't see Herbert and John's proposal addressing:
>>>>>> if we allow lexical escaping, but then pass on both the escaping and
>>>>>> escaped character, how does the dictionary layer know if a given
>>>>>> character sequence represents an escape, or corresponds to something
>>>>>> else?  If the dictionary layer gets a string like:
>>>>>>
>>>>>> 'xxxx<backslash><quote>elxxxx', does that mean:
>>>>>>
>>>>>> 'xxxx<quote>elxxxx'
>>>>>>
>>>>>> or does it mean
>>>>>>
>>>>>> 'xxxx<e acute>lxxxx' ?
>>>>>>
>>>>>> (First case might be from a string "He said 'elephants are pink' ",
>>>>>> second case "Fren<e acute>l formalism" (apologies to French speakers,
>>>>>> I have no idea when to use e acute).
>>>>>>
>>>>>> Similar examples can be constructed no matter what the alternative
>>>>>> meaning of <backslash><quote> might be in the particular domain.  The
>>>>>> key point is that you can't overload the meaning of
>>>>>> <escape><terminator>: either it is an instruction to the lexer, or it
>>>>>> has semantic meaning, but not both.  It doesn't even matter if the
>>>>>> lexer reads the dictionary definition before reading in the string
>>>>>> value: if two meanings are possible, the dictionary layer faces the
>>>>>> same problem.
>>>>>>
>>>>>> So: here is my latest proposal to deal with this issue:
>>>>>>
>>>>>> 1.  As in CIF1, there is no lexical elision available at all, ever.
>>>>>> All instances of the terminator terminate (unlike CIF1).
>>>>>>
>>>>>> 2.  Dictionary writers anticipate when a string value may run into
>>>>>> trouble due to this lack of elision (because those string values could
>>>>>> contain all of triple quote/triple double quote/<eol><semicolon>) and
>>>>>> describe a workaround in the dictionary: for example, inserting a
>>>>>> space between <eol> and <semicolon> when writing these string values,
>>>>>> and removing the space when reading them back.  We could provide
>>>>>> support by defining a special string type in DDLm with these
>>>>>> properties.
>>>>>>
>>>>>> I believe that this deals with all real and imagined problems.
>>>>>>
>>>>>> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>>>>>>> Clearly, Herbert is not referring to the 'reading and writing
>>>>>>> application' as the parser, but the application calling the parser. It
>>>>>>> makes things easier for the parser, but harder for the caller. It would
>>>>>>> not be that much of a problem, except that there are now several ways
>>> to
>>>>>>> quote strings, and the disallowed character sequences that need
>>> encoding
>>>>>>> varies among them.
>>>>>>>
>>>>>>> Herbert seems to view "the calling application" as a middle layer,
>>>>>>> rather than the program making use of the data. That sort of makes
>>>>>>> sense, in that conversion between strings and numeric values cannot
>>>>>>> happen at the CIF level. You could argue that a dictionary level middle
>>>>>>> layer is required to convert data to the final end-user form, and that
>>>>>>> handling character conversions at that level is more flexible. In
>>>>>>> general, that is a reasonable approach. However, even in that case, I
>>>>>>> think it is much less problematic to handle the few conversions that
>>> are
>>>>>>> specific to a given string quoting method at the parser level.
>>>>>>>
>>>>>>> Joe
>>>>>>>
>>>>>>> James Hester wrote:
>>>>>>>> First in reply to Joe: I believe that when Nick refers to the 'reading
>>>>>>>> and writing application' he indeed has in mind the CIF parser/CIF
>>>>>>>> writer layer, so I would guess that he agrees with your opinion as
>>>>>>>> well.  The issue is that we do not present an opaque storage format,
>>>>>>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs
>>>>>>>> with text tools, so we need to cater to this method of interfacing to
>>>>>>>> CIFs as well.
>>>>>>>>
>>>>>>>> In reply to Herbert: your suggestion implies that we abandon any
>>>>>>>> *lexical* meaning for <elide><terminator>.  Or are you suggesting that
>>>>>>>> an application reads the dataname, then looks up the dictionary to
>>>>>>>> decide if it should continue to input the string when it sees
>>>>>>>> <elide><terminator>?  So we have dictionary-driven parsing?
>>>>>>>>
>>>>>>>> I can't work out from your previous email whether you are now in
>>>>>>>> support of abandoning elision as well as supporting treating all
>>>>>>>> strings as raw.  Please clarify...
>>>>>>>>
>>>>>>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein
>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>> Dear Colelagues,
>>>>>>>>>
>>>>>>>>>   There is a difference between what are useful utitlties to have in
>>>>>>>>> an API in support of CIF2 and what is formally part of the base CIF2.
>>>>>>>>> I am all in favor of utiltities to apply and unapply the various
>>>>>>>>> uses for the reverse solidus -- one for cleaning up python-style
>>>>>>>>> use, one to handle the IUCr special characters, one for line folding,
>>>>>>>>> etc., but I don;t think that means we have to make one of those
>>>>>>>>> particular uses formally part of the base CIF2.
>>>>>>>>>
>>>>>>>>>   Regards,
>>>>>>>>>     Herbert
>>>>>>>>>
>>>>>>>>> =====================================================
>>>>>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>
>>>>>>>>>                   +1-631-244-3035
>>>>>>>>>                   yaya@dowling.edu
>>>>>>>>> =====================================================
>>>>>>>>>
>>>>>>>>> On Fri, 20 Nov 2009, Joe Krahn wrote:
>>>>>>>>>
>>>>>>>>>> Unlike others here, I feel that a proper text archive library should
>>> be
>>>>>>>>>> able to take any string from the calling application, and return
>>> that
>>>>>>>>>> exact same string when reading it back in. It is the job of the
>>> archive
>>>>>>>>>> format to avoid delimiter problems. An applications should be able
>>> to
>>>>>>>>>> store and retrieve strings without such worries, and interface to an
>>> SQL
>>>>>>>>>> database the same is it would interface to CIF. All commonly used
>>>>>>>>>> database libraries work this way. Why should CIF continue to take an
>>>>>>>>>> archaic approach?
>>>>>>>>>>
>>>>>>>>>> I essentially agree with the design below, except that the library
>>>>>>>>>> should handle insertion and removal of the reverse solidus for the
>>>>>>>>>> limited cases where it is required.
>>>>>>>>>>
>>>>>>>>>> If it is the client application's responsibility to deal with
>>> reverse
>>>>>>>>>> solidus escape sequences, then the description below doesn't make
>>> sense.
>>>>>>>>>> In that case, the reverse solidus never has any special meaning to
>>> CIF2.
>>>>>>>>>> Instead, CIF2 simply disallows certain character sequences. A client
>>>>>>>>>> application can use whatever it wants to encode/decode the
>>> disallowed
>>>>>>>>>> character sequences.
>>>>>>>>>>
>>>>>>>>>> The advantage of having well-defined escape sequences at the I/O
>>> library
>>>>>>>>>> level is that updates to the format do not require updates to client
>>>>>>>>>> applications. A CIF client application should be able to send a
>>> string
>>>>>>>>>> to the CIF library, and not have to know in advance what CIF
>>> revision is
>>>>>>>>>> in use, or whether the string is semicolong block quoted or triple
>>>>>>>>>> quoted. By requiring the client to escape invalid sequences, the
>>> client
>>>>>>>>>> will have to escape strings differently, i.e. triple quote is OK
>>> withing
>>>>>>>>>> semi-colon quotes, and a leading semicolon is OK within triple
>>> quotes,
>>>>>>>>>> but not the other way around.
>>>>>>>>>>
>>>>>>>>>> Joe Krahn
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Nick Spadaccini wrote:
>>>>>>>>>>> SUMMARISING.
>>>>>>>>>>>
>>>>>>>>>>> (a) The contents of delimited strings are returned as raw, with the
>>> token
>>>>>>>>>>> delimiters removed.
>>>>>>>>>>> (b) Where a delimiter character is to be part of the string, that
>>> character
>>>>>>>>>>> must be preceded by a reverse solidus when written out to the file.
>>> When
>>>>>>>>>>> read, any reverse solidus preceding a terminating character is
>>> deleted.
>>>>>>>>>>> (c) It is the responsibility of the writing and reading application
>>> to
>>>>>>>>>>> insert and remove the reverse solidus preceding the terminating
>>> character.
>>>>>>>>>>> (d) Otherwise the presence of a reverse solidus in the string has
>>> no
>>>>>>>>>>> meaning.
>>>>>>>>>> _______________________________________________
>>>>>>>>>> ddlm-group mailing list
>>>>>>>>>> ddlm-group@iucr.org
>>>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ddlm-group mailing list
>>>>>>>>> ddlm-group@iucr.org
>>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> ddlm-group mailing list
>>>>>>> ddlm-group@iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>
>>>>>>
>>>>>>
>>>>>> -- T +61 (02) 9717 9907
>>>>>> F +61 (02) 9717 3145
>>>>>> M +61 (04) 0249 4148
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>
>>>>> ------------------------------------------------------------------------
>>>>>
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]