[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Technical issues with Proposal P

I think the application has no choice but to accept that the data value contains the backslash.

I assume the onus is on the user to be aware that the backslash will be
parsed as part of the data value and processed as such by the CIF application
(hence 'potentially confusing' - certainly a 'potential pitfall').

Cheers

Simon


From: James Hester <jamesrhester@gmail.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 22 February, 2011 22:54:35
Subject: Re: [ddlm-group] Technical issues with Proposal P

I am trying to focus relentlessly on a particular and very real
technical issue.  I repeat that I am not concerned about the
transformation from surface syntax to a sequence of characters.  I
accept that that is well-defined and unambiguous for all proposals on
the table.  If you think that IDLE can resolve this problem, you
haven't understood my question.

My question relates to the next step: how does the CIF application
downstream from the parser interpret this sequence of characters?
Under all previous incarnations of CIF, it was safe to assume that no
artefacts of syntactical representation were left in the string, so
the string had purely domain-specific meaning.  However, with the
introduction of raw strings, <backslash><delimiter> will escape the
delimiter, but the <backslash> is required to remain in the string.
So the downstream application must decide between artefacts of the
syntactical representation (<backslash><delimiter>) that have remained
in raw strings, and domain-specific character sequences
(<backslash><delimiter>).  Here those examples are again (remember
this is the character sequence after syntactic processing):

<start> I have no idea what the last characters of this string are\"<finish>
<start> Does this string have two\""" or three internal quotes?<finish>

Assume the domain-specific meaning of <backslash><quote> when found in
a datavalue is to accent the letter preceding the <backslash>.

Does the first string finish with a double quote, or with an accented e?
Does the second string contain an accented o, followed by two double
quotes, or a letter o followed by three quotes?


On Wed, Feb 23, 2011 at 8:01 AM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
> Dear all
>
> Reviewing the exchanges in this thread ("Technical issues with Proposal P"),
> it seems that
> the 'technical issues' might better be described as 'potentially confusing
> issues' :-)
> That is, under proposal P, there is no ambiguity about how the string should
> be read, but
> there is potential for misinterpretation by the user (e.g. an erroneous
> assumption that by using a backslash
> to escape a quotation mark, the backslash will not be included as part of
> the parsed data value (in the raw variant)).
> So, as John says, perhaps this simply demonstrates that "the complexity of
> the syntax and semantics
> provided by proposal P would be likely to be a source of confusion for
> developers and users both", and maybe
> therein lies the merit of this particular thread? It reinforces those
> arguements against proposal P that suggest
> that the introduction of a more complex syntax for one of the delimiter
> types is a potential source of
> confusion for many existing CIF users.
>
> Cheers
>
> Simon
> ________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Tuesday, 22 February, 2011 20:22:57
> Subject: Re: [ddlm-group] Technical issues with Proposal P
>
> Dear Simon,
>
>   I make mistakes on this, too.  That is why I like having IDLE
> handy and sticking to Python syntax.
>
>   Regards,
>     Herbert
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Tue, 22 Feb 2011, SIMON WESTRIP wrote:
>
>> Dear Herbert - I've just realized I confused myself by misreading your
>> example
>> and treating it as equivalent to my own example! Sorry about that.
>>
>> Cheers
>>
>> Simon
>>
>>
>>
>> _______________________________________________________________________________________________________________________
>> From: SIMON WESTRIP <simonwestrip@btinternet.com>
>> To: Group finalising DDLm and associated dictionaries
>> <ddlm-group@iucr.org>
>> Sent: Tuesday, 22 February, 2011 14:51:03
>> Subject: Re: [ddlm-group] Technical issues with Proposal P
>>
>> Dear Herbert
>>
>> I'm still a bit confused. Following python semantics,
>> a CIF application reading the following items
>>
>> _item_a """C\""""
>> _item_b r"""C\""""
>>
>> should return values of
>>
>> C" for _item_a
>> C\" for _item_b
>>
>> Are you suggesting that the application should then *assume* that in the
>> case of
>> _item_b the use of the backslash was purely to escape the final quote and
>> should
>> discard the backslash from the value, thus assuming a value of C" ?
>>
>> Cheers
>>
>> Simon
>>
>>
>> _______________________________________________________________________________________________________________________
>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> To: Group finalising DDLm and associated dictionaries
>> <ddlm-group@iucr.org>
>> Sent: Tuesday, 22 February, 2011 13:51:02
>> Subject: Re: [ddlm-group] Technical issues with Proposal P
>>
>> Dear Simon,
>>
>>   From the point of view of writing a pure "CIF2" application
>> that is not aware of the whitespace, particular quote marks,
>> comments, etc, those two string are identical.
>>
>>   From the point of view of a more general CIF API, in which
>> comments, magic numbers, and partiular quote marks, those
>> two string are different in precisely the same way that
>> the string 'ABC' and "ABC" are different, and 13.4 and
>> 1.34e1 are different.
>>
>>   This is _not_ an ambiguity.  It is a matter of whether
>> we are looking for the information in a file or looking
>> for the representations of the data in the file.
>>
>>   Regards,
>>     Herbert
>>
>>
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Feb 2011, SIMON WESTRIP wrote:
>>
>> > So
>> > """\\\"""" and r"""\""""
>> > should strictly be treated as different, despite any recommendations you
>> > may
>> > have made to the contrary?
>> >
>> >
>> >
>> > ____________________________________________________________________________
>> > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> > To: Group finalising DDLm and associated dictionaries
>> > <ddlm-group@iucr.org>
>> > Sent: Tuesday, 22 February, 2011 12:46:57
>> > Subject: Re: [ddlm-group] Technical issues with Proposal P
>> >
>> > > So what is r"""C\"""" ?
>> > >
>> > > Is it C\" or is it C" ?
>> >
>> > """C\"""" is C"
>> >
>> > r"""C\"""" is C\"
>> >
>> > You can test this with IDLE.  It is very clearly defined and
>> > reproducible Python string behavior, and I believe helps to make
>> > the case for sticking to the Python approach.  It is very easy
>> > for any software developer or user to work out how the boundary
>> > cases are being handled.
>> >
>> > Regards,
>> >   Herbert
>> >
>> > =====================================================
>> > Herbert J. Bernstein, Professor of Computer Science
>> >   Dowling College, Kramer Science Center, KSC 121
>> >         Idle Hour Blvd, Oakdale, NY, 11769
>> >
>> >                 +1-631-244-3035
>> >                 yaya@dowling.edu
>> > =====================================================
>> >
>> > On Tue, 22 Feb 2011, SIMON WESTRIP wrote:
>> >
>> > > I am a little confused:
>> > >
>> > > So what is r"""C\"""" ?
>> > >
>> > > Is it C\" or is it C" ?
>> > >
>> > > Python says it should be C\", so CIF2 should say its C\" if CIF2 is
>> > adopting
>> > > Python?
>> > >
>> > > Or are you suggesting that we should adopt a fuzzy interpretation of
>> > Python?
>> > >
>> > > Cheers
>> > >
>> > > Simon
>> > >
>> >
>> > > >___________________________________________________________________________
>> > _
>> > > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> > > To: Group finalising DDLm and associated dictionaries
>> > <ddlm-group@iucr.org>
>> > > Sent: Tuesday, 22 February, 2011 12:01:23
>> > > Subject: Re: [ddlm-group] Technical issues with Proposal P
>> > >
>> > > Dear Colleagues,
>> > >
>> > >   Working under the assumption of Ralf's proposal, rather
>> > > than Simon's, we have several very distinct string presentaions
>> > > to consider:  a (non-raw) treble quoted string, a raw treble
>> > > quoted string a unicode treble quoted string and a raw unicode
>> > > treble quoted string.  As for Python 3, under CIF2, because
>> > > the "native" character encoding is UTF-8, under reasonable coding
>> > > constraints, this collapses to just two cases the application
>> > > needs to deal with:  non-raw (i.e. cooked) versus raw.  The intent of
>> > > the cooked is for the lexer to process the elides, so the response
>> > > I gave is, I believe correct -- just push the string through IDLE.
>> > > The intent of the raw is precisely to push through the string
>> > > with the backslahes still in place, e.g. for TeX text in which
>> > > you don't want to double-up your backslashes.  While I personally
>> > > would recommend against such a use of raw, it is not ambiguous.
>> > > It gives the application a very well-defined string of characters
>> > > to deal with.  Yes, there are applications that are intended to
>> > > deal with CIF with the encoding exposed (e.g. cif2cbf, cif2cif, etc.)
>> > > bit, I agree that the cleanest design is for an application to
>> > > only make use of the string content, not the representation.
>> > >
>> > >   Thus, for most applications, I would recommend that they treat
>> > >
>> > >   """\\\"""" and r"""\""""
>> > >
>> > > as equivalent, but for applications that are, for example,
>> > > intended to do faithful copies of the representations that
>> > > they treat them as different.
>> > >
>> > >   We have had, and will continue to have this subtle problem
>> > > with all versions of CIF in the handling of things such as
>> > > magic number, comments, white space, line folding, and choices
>> > > of quoting characters.  I don't see how the introduction of
>> > > the Python treble quote makes the situation any worse or
>> > > any more or less ambiguous.
>> > >
>> > >   Regards,
>> > >     Herbert
>> > >
>> > > =====================================================
>> > >   Herbert J. Bernstein, Professor of Computer Science
>> > >     Dowling College, Kramer Science Center, KSC 121
>> > >         Idle Hour Blvd, Oakdale, NY, 11769
>> > >
>> > >                   +1-631-244-3035
>> > >                   yaya@dowling.edu
>> > > =====================================================
>> > >
>> > > On Tue, 22 Feb 2011, James Hester wrote:
>> > >
>> > > > I will focus this email on the technical issues and try to return to
>> > > > the other issues at a later date (I've changed the subject
>> > > > accordingly)
>> > > >
>> > > > [edit]
>> > > >
>> > > > My apologies for not being clear: my examples of embedded elides
>> > > > already give the internal representation of the strings,
>> > > > deliberately
>> > > > leaving out the particular delimiters that might have been used to
>> > > > produce those strings.  Herbert mistakenly thought I was giving
>> > > > triple-double-quote delimited strings and asking what the internal
>> > > > representation was. So, unfortunately, IDLE cannot help here, as the
>> > > > internal representation is not in question.
>> > > >
>> > > > My question therefore remains: how does the CIF application
>> > > > interpret
>> > > > these strings? Is the <backslash><delimiter> in my examples simply
>> > > > an
>> > > > elide that could not be removed from a raw string and therefore
>> > > > should
>> > > > be ignored, or is it a character sequence intended for the
>> > > > application
>> > > > (eg a LaTeX accent on the o or e)?
>> > > >
>> > > > In your answer you may assume that the CIF application knows that
>> > > > the
>> > > > string was a raw string delimited by triple double quotes (even
>> > > > though
>> > > > requiring communication of such information would be a very
>> > > > unfortunate loss of clean design).
>> > > >
>> > > > Those strings again:
>> > > >
>> > > > <start> I have no idea what the last characters of this string
>> > > are\"<finish>
>> > > > <start> Does this string have two\""" or three internal
>> > > > quotes?<finish>
>> > > >
>> > > >
>> > > > Herbert writes:
>> > > >>   Now for your two examples of embedded elides of quotes:
>> > > >>
>> > > >> <start> I have no idea what the last characters of this string
>> > > are\"<finish>
>> > > >>
>> > > >> is, internally, as a C-string
>> > > >>
>> > > >> I have no idea what the last characters of this string are"\0
>> > > >>
>> > > >> <start> Does this string have two\""" or three internal
>> > > >> quotes?<finish>
>> > > >>
>> > > >> is, internally as a C-string
>> > > >>
>> > > >> Does this string have two""" or three internal quotes?\0
>> > > >>
>> > > >> I settled that by simply cranking up IDLE and doing:
>> > > >>
>> > > >>>>>  print """I have no idea what the last characters of this string
>> > > >>>>> are\"""" I have no idea what the last characters of this string
>> > > >>>>> are" >>> print """Does this string have two\""" or three
>> > > >>>>> internal
>> > > >>>>> quotes?""" Does this string have two""" or three internal
>> > > >>>>> quotes?
>> > > >>
>> > > >> As you well know, having IDLE around is a big help.
>> > > >>
>> > > >>   Thank you again for taking the time to clarify your position
>> > > >> on Ralf's proposal.  I think I now understand why you prefer
>> > > >> Simon's
>> > > >> proposal.
>> > > >>
>> > > >>   Regards,
>> > > >>     Herbert
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >>
>> > > >
>> > > >>> One technical issue with Proposal P that has not been resolved is
>> > > >>> how
>> > > >>> a CIF application is supposed to interpret the sequence
>> > > >>> <backslash><double quote> when encountered in a string returned
>> > > >>> from
>> > > >>> the parser.  Is this sequence:
>> > > >>> (a) a terminator elide sequence that was left in a raw string, so
>> > > >>> corresponds to <double quote>?
>> > > >>> (b) something with meaning for the application so should be
>> > > >>> <backslash><double quote>?
>> > > >>>
>> > > >>> Please therefore advise how a CIF application will disambiguate
>> > > >>> the
>> > > >>> following string content from a Proposal P parser:
>> > > >>>
>> > > >>> <start> I have no idea what the last characters of this string
>> > > are\"<finish>
>> > > >>> <start> Does this string have two\""" or three internal
>> > quotes?<finish>
>> > > >>>
>> > > >>> James
>> > > >>>
>> > > >
>> > > >
>> > > >
>> > > > --
>> > > > T +61 (02) 9717 9907
>> > > > F +61 (02) 9717 3145
>> > > > M +61 (04) 0249 4148
>> > > > _______________________________________________
>> > > > ddlm-group mailing list
>> > > > ddlm-group@iucr.org
>> > > > http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> > > >
>> > >
>> > >
>> >
>> >
>>
>>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]