[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Use of elides in strings
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Sat, 21 Nov 2009 08:27:56 -0500 (EST)
In-Reply-To: <279aad2a0911210437v53726196p7ffd0fa9a3e1cee8@mail.gmail.com>
References: <C72C423A.12515%nick@csse.uwa.edu.au><4B06DEAF.4070109@niehs.nih.gov><alpine.BSF.2.00.0911201441550.25803@epsilon.pair.com><279aad2a0911201545m22547e50i39df8f165c1c340e@mail.gmail.com><4B0744F9.3040907@niehs.nih.gov><279aad2a0911210437v53726196p7ffd0fa9a3e1cee8@mail.gmail.com>

Dear Colleagues,

  Let us consider James' example.  He is actually making the case
for _not_ removing the reverse-solidus from a string at the
lexical level.

   xxxx<backslash><quote>elxxxx

or to be more specific

   abcd\'efgh

and we are presented with the question of ho should the
dictionary interpret that string.

If we have a string intended to be part of the modern pythonesque
world, then I would expect the data element to have been typed
in a way that says we should read the string as

   abcd'efgh

If we have a string that is a legacy from a CIF 1 file with
IUCr type-setting codes, I would expect the data element to
have beentyped in a way that says we should read the string as
abcd{e with an acute accent)fgh

Anything the lexer does to remove the reverse-solidus is
going to disfavor one intepretation or the other.

By moving these two interpretations one level up to two
different utility routines, we gain much more use from
a common lexer and nobody loses any functionality.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 21 Nov 2009, James Hester wrote:

> Joe, I agree with you.  There is a fundamental issue here that I have
> already raised, but can't see Herbert and John's proposal addressing:
> if we allow lexical escaping, but then pass on both the escaping and
> escaped character, how does the dictionary layer know if a given
> character sequence represents an escape, or corresponds to something
> else?  If the dictionary layer gets a string like:
>
> 'xxxx<backslash><quote>elxxxx', does that mean:
>
> 'xxxx<quote>elxxxx'
>
> or does it mean
>
> 'xxxx<e acute>lxxxx' ?
>
> (First case might be from a string "He said 'elephants are pink' ",
> second case "Fren<e acute>l formalism" (apologies to French speakers,
> I have no idea when to use e acute).
>
> Similar examples can be constructed no matter what the alternative
> meaning of <backslash><quote> might be in the particular domain.  The
> key point is that you can't overload the meaning of
> <escape><terminator>: either it is an instruction to the lexer, or it
> has semantic meaning, but not both.  It doesn't even matter if the
> lexer reads the dictionary definition before reading in the string
> value: if two meanings are possible, the dictionary layer faces the
> same problem.
>
> So: here is my latest proposal to deal with this issue:
>
> 1.  As in CIF1, there is no lexical elision available at all, ever.
> All instances of the terminator terminate (unlike CIF1).
>
> 2.  Dictionary writers anticipate when a string value may run into
> trouble due to this lack of elision (because those string values could
> contain all of triple quote/triple double quote/<eol><semicolon>) and
> describe a workaround in the dictionary: for example, inserting a
> space between <eol> and <semicolon> when writing these string values,
> and removing the space when reading them back.  We could provide
> support by defining a special string type in DDLm with these
> properties.
>
> I believe that this deals with all real and imagined problems.
>
> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>> Clearly, Herbert is not referring to the 'reading and writing
>> application' as the parser, but the application calling the parser. It
>> makes things easier for the parser, but harder for the caller. It would
>> not be that much of a problem, except that there are now several ways to
>> quote strings, and the disallowed character sequences that need encoding
>> varies among them.
>>
>> Herbert seems to view "the calling application" as a middle layer,
>> rather than the program making use of the data. That sort of makes
>> sense, in that conversion between strings and numeric values cannot
>> happen at the CIF level. You could argue that a dictionary level middle
>> layer is required to convert data to the final end-user form, and that
>> handling character conversions at that level is more flexible. In
>> general, that is a reasonable approach. However, even in that case, I
>> think it is much less problematic to handle the few conversions that are
>> specific to a given string quoting method at the parser level.
>>
>> Joe
>>
>> James Hester wrote:
>>> First in reply to Joe: I believe that when Nick refers to the 'reading
>>> and writing application' he indeed has in mind the CIF parser/CIF
>>> writer layer, so I would guess that he agrees with your opinion as
>>> well. �The issue is that we do not present an opaque storage format,
>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs
>>> with text tools, so we need to cater to this method of interfacing to
>>> CIFs as well.
>>>
>>> In reply to Herbert: your suggestion implies that we abandon any
>>> *lexical* meaning for <elide><terminator>. �Or are you suggesting that
>>> an application reads the dataname, then looks up the dictionary to
>>> decide if it should continue to input the string when it sees
>>> <elide><terminator>? �So we have dictionary-driven parsing?
>>>
>>> I can't work out from your previous email whether you are now in
>>> support of abandoning elision as well as supporting treating all
>>> strings as raw. � Please clarify...
>>>
>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein
>>> <yaya@bernstein-plus-sons.com> wrote:
>>>> Dear Colelagues,
>>>>
>>>> � There is a difference between what are useful utitlties to have in
>>>> an API in support of CIF2 and what is formally part of the base CIF2.
>>>> I am all in favor of utiltities to apply and unapply the various
>>>> uses for the reverse solidus -- one for cleaning up python-style
>>>> use, one to handle the IUCr special characters, one for line folding,
>>>> etc., but I don;t think that means we have to make one of those
>>>> particular uses formally part of the base CIF2.
>>>>
>>>> � Regards,
>>>> � �Herbert
>>>>
>>>> =====================================================
>>>> �Herbert J. Bernstein, Professor of Computer Science
>>>> � �Dowling College, Kramer Science Center, KSC 121
>>>> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>>>>
>>>> � � � � � � � � �+1-631-244-3035
>>>> � � � � � � � � �yaya@dowling.edu
>>>> =====================================================
>>>>
>>>> On Fri, 20 Nov 2009, Joe Krahn wrote:
>>>>
>>>>> Unlike others here, I feel that a proper text archive library should be
>>>>> able to take any string from the calling application, and return that
>>>>> exact same string when reading it back in. It is the job of the archive
>>>>> format to avoid delimiter problems. An applications should be able to
>>>>> store and retrieve strings without such worries, and interface to an SQL
>>>>> database the same is it would interface to CIF. All commonly used
>>>>> database libraries work this way. Why should CIF continue to take an
>>>>> archaic approach?
>>>>>
>>>>> I essentially agree with the design below, except that the library
>>>>> should handle insertion and removal of the reverse solidus for the
>>>>> limited cases where it is required.
>>>>>
>>>>> If it is the client application's responsibility to deal with reverse
>>>>> solidus escape sequences, then the description below doesn't make sense.
>>>>> In that case, the reverse solidus never has any special meaning to CIF2.
>>>>> Instead, CIF2 simply disallows certain character sequences. A client
>>>>> application can use whatever it wants to encode/decode the disallowed
>>>>> character sequences.
>>>>>
>>>>> The advantage of having well-defined escape sequences at the I/O library
>>>>> level is that updates to the format do not require updates to client
>>>>> applications. A CIF client application should be able to send a string
>>>>> to the CIF library, and not have to know in advance what CIF revision is
>>>>> in use, or whether the string is semicolong block quoted or triple
>>>>> quoted. By requiring the client to escape invalid sequences, the client
>>>>> will have to escape strings differently, i.e. triple quote is OK withing
>>>>> semi-colon quotes, and a leading semicolon is OK within triple quotes,
>>>>> but not the other way around.
>>>>>
>>>>> Joe Krahn
>>>>>
>>>>>
>>>>> Nick Spadaccini wrote:
>>>>>> SUMMARISING.
>>>>>>
>>>>>> (a) The contents of delimited strings are returned as raw, with the token
>>>>>> delimiters removed.
>>>>>> (b) Where a delimiter character is to be part of the string, that character
>>>>>> must be preceded by a reverse solidus when written out to the file. When
>>>>>> read, any reverse solidus preceding a terminating character is deleted.
>>>>>> (c) It is the responsibility of the writing and reading application to
>>>>>> insert and remove the reverse solidus preceding the terminating character.
>>>>>> (d) Otherwise the presence of a reverse solidus in the string has no
>>>>>> meaning.
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (John Westbrook)

References:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Re: [ddlm-group] Use of elides in strings (Joe Krahn)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Re: [ddlm-group] Use of elides in strings (James Hester)

Re: [ddlm-group] Use of elides in strings (Joe Krahn)

Re: [ddlm-group] Use of elides in strings (James Hester)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Use of elides in strings

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Use of elides in strings