Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

I'll try again.

Yes Simon, what I am suggesting here after James' proposal is that it is
different to the end of THREAD 3.


(1) This is CIF2 specification. What is done in CIF1 encoding is irrelevant.
ALL parsers will have to drop in and out of CIF1 and CIF2 modes for legacy
reasons. Please desist from including CIF1 examples, they are a distraction.
(2) James (and I support) proposes that the very special behaviour of a
terminator (AND ONLY THIS CHARACTER) within a string delimited by the same
character is a CIF2 issue that should be handled by the parser (read this to
mean that application that is responsible for writing and reading CIF 2
files). WE MUST agree where this is done, because users have to know when
they write or read a string who will handle the specialness of terminators.

Herb and John argue it is the users responsibility. Sorry but most users
would not understand what the issues are (there is enough confusion amongst
us to make that clear).
(3) As an example lets just discuss "" delimited strings, and I wish to
include a " in that string. The proposal is that the parser ONLY EVER deals
with the " character. ALL OTHER ELIDES etc are irrelevant - they will be
passed on as raw. The \\ is NOT a special case, and I can see no reason for
it to be considered.

The process is, if the string to be written out has a " included this will
be elided as it is written out. When reading, as you parse left to write,
when you find a " if it has a preceding \ delete it, skip over the " and
continue, otherwise it must be the terminating character. Everything else is
left untouched.


HB example

Say a user (Dr Joe Ordinary) wishes to output in a "" delimited string

(1) abcd\"efgh. The parser would output "abcd\\"efgh"
(2) abcd"efgh. The parser would output "abcd\"efgh"

JW example

[][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*. The parser would output
"[][ _(),.;:\"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*"

The proposal is that ONLY \" is of any relevance to this process, all other
elides are irrelevant. John do you really expect to leave it up to the user
to know that the \" in your regular expression was inserted as a CIF2
requirement, rather than an actual regex?

Again the price is you can't cut and paste, but you can't do that anyway
whether you take the JRH/NS view or the HB/JW view.

Is see real strength in requiring the parser to do this for the user. I can
see no downside, but I do see downsides if you expect all users to know what
to do.

Herb is correct that all other handling of elides (the encodings) is left to
the dictionary level definitions. JW quite correctly states that they avoid
the problem by using a different delimiter character to the character they
wish to insert. There will be very rare cases though, where that may to be
possible and the above HAS TO be done.

On 23/11/09 2:29 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>

> Yes, I would prefer that the lexer and parser deliver
> "A\"BC"
> as
> A\"BC
> -- Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> On Sun, 22 Nov 2009, SIMON WESTRIP wrote:
>> Thanks for this
>> Does everyone else agree that the value in
>> _label "A\"BC"
>> is A\"BC ?
>> Sorry to keep on about this, but at the end of THREAD3 I understood this to
>> be
>> the conclusion, then more recently it looked like there was some acceptance
>> of
>>  dropping the backslash in a context-sensitive manner - i.e. the value being
>>  A"BC in this particular case.
>> Cheers
>> Simon
>> ____________________________________________________________________________
>> From: John Westbrook <jwest@pdb-mail.rutgers.edu>
>> To: SIMON WESTRIP <simonwestrip@btinternet.com>
>> Cc: jwest@rcsb.rutgers.edu; Group finalising DDLm and associated
>> dictionaries <ddlm-group@iucr.org>
>> Sent: Sunday, 22 November, 2009 15:12:10
>> Subject: Re: [ddlm-group] Use of elides in strings
>> Hi Simon -
>> Subject to the regex example we would process this as A\"BC
>> as the '\' allowed in the regex.  We loads of similar cases in which
>> there is an embedded quote in a character string which is not surrounded
>> by whitespace.  When Nick visited us recently he analyzed these cases
>> and we agreed that we would be able to quote these in the opposite
>> since 'AB"C' or "AB'C", or in semi-colons for the odd case in which
>> both should occur.    In none of these cases would we expect the
>> internal quote to be escaped.
>> John
>>> lexical analysis and parsing aside, in terms of specifying the syntax of
>> CIF2,
>>> what should someone expect the following to represent:
>>> _label "A\"BC"
>>> Is the value A\"BC or A"BC?
>>> I'm talking CIF2 only here (the use of elides for greek etc in CIF1 will
>> no longer be part of the spec; rather it can be handled at the application
>> level or perhaps defined in the dictionary in some way as some sort of item
>> content type?)
>>> Cheers
>>> Simon
>>> ________________________________
>>> From: John Westbrook <jwest@pdb-mail.rutgers.edu>
>>> To: Group finalising DDLm and associated dictionaries
>> <ddlm-group@iucr.org>
>>> Sent: Saturday, 21 November, 2009 14:02:19
>>> Subject: Re: [ddlm-group] Use of elides in strings
>>> To take another example in support of passing the data without
>>> processing back to the application, DDL2 depends heavily on
>>> using dictionary regex's to define the interpretation for
>>> the application.  For instance, the regex for an atom code in
>>> our dictionary is -
>>> [][ _(),.;:"&<>/\{}'`~!@#$%?+=*A-Za-z0-9|^-]*
>>> Not only does processing this regex present an issue for the
>>> treatment of the '\', but it also defines how the '\' will
>>> be interpreted in data subject to the regex.
>>> I agree with Herb's conclusion that the lexer should do
>>> the minimum of interpretation.
>>> Regards,
>>> John
>>> Herbert J. Bernstein wrote:
>>>> Dear Colleagues,
>>>>   Let us consider James' example.  He is actually making the case
>>>> for _not_ removing the reverse-solidus from a string at the
>>>> lexical level.
>>>>   xxxx<backslash><quote>elxxxx
>>>> or to be more specific
>>>>   abcd\'efgh
>>>> and we are presented with the question of ho should the
>>>> dictionary interpret that string.
>>>> If we have a string intended to be part of the modern pythonesque
>>>> world, then I would expect the data element to have been typed
>>>> in a way that says we should read the string as
>>>>   abcd'efgh
>>>> If we have a string that is a legacy from a CIF 1 file with
>>>> IUCr type-setting codes, I would expect the data element to
>>>> have beentyped in a way that says we should read the string as
>>>> abcd{e with an acute accent)fgh
>>>> Anything the lexer does to remove the reverse-solidus is
>>>> going to disfavor one intepretation or the other.
>>>> By moving these two interpretations one level up to two
>>>> different utility routines, we gain much more use from
>>>> a common lexer and nobody loses any functionality.
>>>>   Regards,
>>>>     Herbert
>>>> =====================================================
>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>                   +1-631-244-3035
>>>>                   yaya@dowling.edu
>>>> =====================================================
>>>> On Sat, 21 Nov 2009, James Hester wrote:
>>>>> Joe, I agree with you.  There is a fundamental issue here that I have
>>>>> already raised, but can't see Herbert and John's proposal addressing:
>>>>> if we allow lexical escaping, but then pass on both the escaping and
>>>>> escaped character, how does the dictionary layer know if a given
>>>>> character sequence represents an escape, or corresponds to something
>>>>> else?  If the dictionary layer gets a string like:
>>>>> 'xxxx<backslash><quote>elxxxx', does that mean:
>>>>> 'xxxx<quote>elxxxx'
>>>>> or does it mean
>>>>> 'xxxx<e acute>lxxxx' ?
>>>>> (First case might be from a string "He said 'elephants are pink' ",
>>>>> second case "Fren<e acute>l formalism" (apologies to French speakers,
>>>>> I have no idea when to use e acute).
>>>>> Similar examples can be constructed no matter what the alternative
>>>>> meaning of <backslash><quote> might be in the particular domain.  The
>>>>> key point is that you can't overload the meaning of
>>>>> <escape><terminator>: either it is an instruction to the lexer, or it
>>>>> has semantic meaning, but not both.  It doesn't even matter if the
>>>>> lexer reads the dictionary definition before reading in the string
>>>>> value: if two meanings are possible, the dictionary layer faces the
>>>>> same problem.
>>>>> So: here is my latest proposal to deal with this issue:
>>>>> 1.  As in CIF1, there is no lexical elision available at all, ever.
>>>>> All instances of the terminator terminate (unlike CIF1).
>>>>> 2.  Dictionary writers anticipate when a string value may run into
>>>>> trouble due to this lack of elision (because those string values could
>>>>> contain all of triple quote/triple double quote/<eol><semicolon>) and
>>>>> describe a workaround in the dictionary: for example, inserting a
>>>>> space between <eol> and <semicolon> when writing these string values,
>>>>> and removing the space when reading them back.  We could provide
>>>>> support by defining a special string type in DDLm with these
>>>>> properties.
>>>>> I believe that this deals with all real and imagined problems.
>>>>> On Sat, Nov 21, 2009 at 12:40 PM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>>>>>> Clearly, Herbert is not referring to the 'reading and writing
>>>>>> application' as the parser, but the application calling the parser. It
>>>>>> makes things easier for the parser, but harder for the caller. It would
>>>>>> not be that much of a problem, except that there are now several ways
>> to
>>>>>> quote strings, and the disallowed character sequences that need
>> encoding
>>>>>> varies among them.
>>>>>> Herbert seems to view "the calling application" as a middle layer,
>>>>>> rather than the program making use of the data. That sort of makes
>>>>>> sense, in that conversion between strings and numeric values cannot
>>>>>> happen at the CIF level. You could argue that a dictionary level middle
>>>>>> layer is required to convert data to the final end-user form, and that
>>>>>> handling character conversions at that level is more flexible. In
>>>>>> general, that is a reasonable approach. However, even in that case, I
>>>>>> think it is much less problematic to handle the few conversions that
>> are
>>>>>> specific to a given string quoting method at the parser level.
>>>>>> Joe
>>>>>> James Hester wrote:
>>>>>>> First in reply to Joe: I believe that when Nick refers to the 'reading
>>>>>>> and writing application' he indeed has in mind the CIF parser/CIF
>>>>>>> writer layer, so I would guess that he agrees with your opinion as
>>>>>>> well.  The issue is that we do not present an opaque storage format,
>>>>>>> unlike SQL or HDF; it is pretty easy to create and manipulate CIFs
>>>>>>> with text tools, so we need to cater to this method of interfacing to
>>>>>>> CIFs as well.
>>>>>>> In reply to Herbert: your suggestion implies that we abandon any
>>>>>>> *lexical* meaning for <elide><terminator>.  Or are you suggesting that
>>>>>>> an application reads the dataname, then looks up the dictionary to
>>>>>>> decide if it should continue to input the string when it sees
>>>>>>> <elide><terminator>?  So we have dictionary-driven parsing?
>>>>>>> I can't work out from your previous email whether you are now in
>>>>>>> support of abandoning elision as well as supporting treating all
>>>>>>> strings as raw.  Please clarify...
>>>>>>> On Sat, Nov 21, 2009 at 6:44 AM, Herbert J. Bernstein
>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>> Dear Colelagues,
>>>>>>>>   There is a difference between what are useful utitlties to have in
>>>>>>>> an API in support of CIF2 and what is formally part of the base CIF2.
>>>>>>>> I am all in favor of utiltities to apply and unapply the various
>>>>>>>> uses for the reverse solidus -- one for cleaning up python-style
>>>>>>>> use, one to handle the IUCr special characters, one for line folding,
>>>>>>>> etc., but I don;t think that means we have to make one of those
>>>>>>>> particular uses formally part of the base CIF2.
>>>>>>>>   Regards,
>>>>>>>>     Herbert
>>>>>>>> =====================================================
>>>>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>                   +1-631-244-3035
>>>>>>>>                   yaya@dowling.edu
>>>>>>>> =====================================================
>>>>>>>> On Fri, 20 Nov 2009, Joe Krahn wrote:
>>>>>>>>> Unlike others here, I feel that a proper text archive library should
>> be
>>>>>>>>> able to take any string from the calling application, and return
>> that
>>>>>>>>> exact same string when reading it back in. It is the job of the
>> archive
>>>>>>>>> format to avoid delimiter problems. An applications should be able
>> to
>>>>>>>>> store and retrieve strings without such worries, and interface to an
>> SQL
>>>>>>>>> database the same is it would interface to CIF. All commonly used
>>>>>>>>> database libraries work this way. Why should CIF continue to take an
>>>>>>>>> archaic approach?
>>>>>>>>> I essentially agree with the design below, except that the library
>>>>>>>>> should handle insertion and removal of the reverse solidus for the
>>>>>>>>> limited cases where it is required.
>>>>>>>>> If it is the client application's responsibility to deal with
>> reverse
>>>>>>>>> solidus escape sequences, then the description below doesn't make
>> sense.
>>>>>>>>> In that case, the reverse solidus never has any special meaning to
>> CIF2.
>>>>>>>>> Instead, CIF2 simply disallows certain character sequences. A client
>>>>>>>>> application can use whatever it wants to encode/decode the
>> disallowed
>>>>>>>>> character sequences.
>>>>>>>>> The advantage of having well-defined escape sequences at the I/O
>> library
>>>>>>>>> level is that updates to the format do not require updates to client
>>>>>>>>> applications. A CIF client application should be able to send a
>> string
>>>>>>>>> to the CIF library, and not have to know in advance what CIF
>> revision is
>>>>>>>>> in use, or whether the string is semicolong block quoted or triple
>>>>>>>>> quoted. By requiring the client to escape invalid sequences, the
>> client
>>>>>>>>> will have to escape strings differently, i.e. triple quote is OK
>> withing
>>>>>>>>> semi-colon quotes, and a leading semicolon is OK within triple
>> quotes,
>>>>>>>>> but not the other way around.
>>>>>>>>> Joe Krahn
>>>>>>>>> Nick Spadaccini wrote:
>>>>>>>>>> SUMMARISING.
>>>>>>>>>> (a) The contents of delimited strings are returned as raw, with the
>> token
>>>>>>>>>> delimiters removed.
>>>>>>>>>> (b) Where a delimiter character is to be part of the string, that
>> character
>>>>>>>>>> must be preceded by a reverse solidus when written out to the file.
>> When
>>>>>>>>>> read, any reverse solidus preceding a terminating character is
>> deleted.
>>>>>>>>>> (c) It is the responsibility of the writing and reading application
>> to
>>>>>>>>>> insert and remove the reverse solidus preceding the terminating
>> character.
>>>>>>>>>> (d) Otherwise the presence of a reverse solidus in the string has
>> no
>>>>>>>>>> meaning.
>>>>>>>>> _______________________________________________
>>>>>>>>> ddlm-group mailing list
>>>>>>>>> ddlm-group@iucr.org
>>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>> _______________________________________________
>>>>>>>> ddlm-group mailing list
>>>>>>>> ddlm-group@iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>> -- T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> ------------------------------------------------------------------------
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.