[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 24 Nov 2009 18:53:42 -0500 (EST)
- In-Reply-To: <279aad2a0911241454h12811f4eqfc47dd5eafa22c84@mail.gmail.com>
- References: <279aad2a0911231800g6c26bdaancdd4a38fecebbb7a@mail.gmail.com><C731AC95.125CB%nick@csse.uwa.edu.au><279aad2a0911241414j1d89b6b3mfec464fdc401fbfd@mail.gmail.com><alpine.BSF.2.00.0911241717100.78685@epsilon.pair.com><279aad2a0911241454h12811f4eqfc47dd5eafa22c84@mail.gmail.com>
Dear James, I started to write: "No, in CIF 1.1, none of the terminal quote marks, including the \n; are effective unless followed by whitespace (\n, space, tab, of end of file). This is a well-established, and very tricky part of the CIF spec going back to 1990. That is why Nick had to explicitly specify that a terminal quote mark would be effective no matter what it was followed by." But the grammer currently on the IUCr web site is _not_ the one that I recall COMCIFs discussing and approving. It now explcitly removes the requirement for terminal white space in the special case of the \n; text field terminator. I don't recall when that change was adopted, but it appears that you are right under the current spec about the example I chose. Inasmuch as there is a lot of working code that enforces and uses the original whitespace handling and uses it in line-folding, I will not revise CIFtbx 3, but I will try to do something to adapt to this change for CIFtbx 4. I guess we are just going to have yet another few dialects of CIF. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 25 Nov 2009, James Hester wrote: > To be precise, we are not 'referring all elides to the application' > because no elides are recognised by the lexer under Nick's latest > suggestion, so there are no elides to refer to the application. > > My understanding of CIF1.1 syntax suggests that the string you provide > would produce a syntax error in CIF1.1, as the semicolon at the start > of the second line would terminate the string, and so whitespace > should then appear as the second character on the second line, rather > than reverse solidus. > > On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: >> The only problem with referring all elisdes to the application is that >> with the removal of the requirement of a blank after a \n; for it to be >> effective, the line folding protocol develops a slight gap. The >> case is as follows >> >> ;\ >> ;\ >> ; >> >> Is a valid single text field in CIF 1.1, which when handled with the >> line folding protocol translates to the equivalent of ';' because the >> embedded ;\ is not a valid text terminator. If we require that >> a text field the begins with "\n;\\" must be terminated by "\n; " >> or "\n;\n" or "\n;\t" that problem would be fixed. >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Wed, 25 Nov 2009, James Hester wrote: >> >>> I wholeheartedly agree with Nick's suggestion. >>> >>> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini <nick@csse.uwa.edu.au> >>> wrote: >>>> >>>> It appears to me that we have spent far too long on a syntactic issue >>>> which >>>> can be avoided 99.9999% of the time. Quite simply given the 5 ways to >>>> delimit strings, it is next to impossible to get a situation where you >>>> cannot choose one of those to make the problem go away. >>>> >>>> I think the RCSB systematically avoid it by choosing >>>> >>>> "ab'cd" >>>> 'ab"cd' >>>> ;ab'"cd >>>> ; >>>> >>>> But now we additionally have """ and ''' to choose from, making it even >>>> easier. >>>> >>>> So I propose in line with James' position there is NO eliding of >>>> terminator >>>> character at the CIF2 syntax level. ALL elides in the string are assumed >>>> to >>>> be user specific encoding (say TeX, IUCr \greek) which can be resolved at >>>> the dictionary level. >>>> >>>> This necessarily means NO terminator character can appear in a string >>>> delimited by the same terminator character. You will need to choose a >>>> different terminator character. That is >>>> >>>> No " in "strings" >>>> No ' in 'strings' >>>> No """ in """strings""" (but separable individual and doublet " are >>>> allowed) >>>> No ''' in '''strings''' (but separable individual and doublet ' are >>>> allowed) >>>> >>>> EVERYTHING in the string is returned as raw (except the initiating and >>>> terminating character). >>>> >>>> The only time you will not be able to encode anything in a delimited >>>> string >>>> is when you want to include ' " """ ''' and \n; in the one string. The >>>> likelihood of that is almost zero, unless you may want to include a CIF >>>> within a CIF (a silly thing to do IMHO). In that case the contents can be >>>> encoded in a dictionary driven way. I suggest it be declared as a BASE64 >>>> type and then all the syntactic ambiguity disappears. >>>> >>>> Problem solved! No need to elide because of CIF2 syntax rules all elides >>>> are >>>> user driven, contents are returned raw. >>>> >>>> As for Herbs comment in a recent email what about line-folding, then the >>>> same holds. That is NOT a lexer issue and it has nothing to do with the >>>> parser, everything is read literally and returned raw and what to do with >>>> it >>>> is promulgated to the downstream application. >>>> >>>> Straw vote - No elides of terminator strings as described above - Nick >>>> >>>> >>>> On 24/11/09 10:00 AM, "James Hester" <jamesrhester@gmail.com> wrote: >>>> >>>>> OK, my rewritten voting proposal appears to be an abject failure. Let >>>>> me repeat 1 as clearly as possible >>>>> >>>>> 1. Should CIF2 allow elision of terminator characters? In other >>>>> words, should we make it possible to include <quote> as a normal >>>>> character in a <quote> delimited string? >>>>> >>>>> Herbert: It's difficult to understand how to rephrase things if it is >>>>> not clear where exactly the problem lies. >>>>> >>>>> Joe: good point about double backslash. Consider this added to proposal >>>>> (a). >>>>> >>>>> Before we discuss (2) precisely, can we agree to use the following >>>>> abstract model and terminology for CIF2 file parsing and dictionary >>>>> application? If not, please indicate your alternative. >>>>> >>>>> 1. A CIF lexer separates a CIF file into tokens according to the CIF2 >>>>> syntax specification only, that is, this process cannot be altered by >>>>> DDL directives. >>>>> >>>>> 2. A CIF parser accepts the tokens from the lexer. CIF parsers can be >>>>> modelled as performing at least the following actions with these >>>>> tokens: >>>>> (i) assignment of datavalue to dataname >>>>> (ii) grouping looped datanames into a set >>>>> (iii) assigning looped datavalues to the appropriate dataname and >>>>> packet >>>>> (iv) editing datavalues according to the syntax specification if >>>>> this has not been performed in the lexer (e.g. stripping enclosing >>>>> quotes, removing elides) >>>>> >>>>> 3. DDL dictionaries operate on and refer to the datavalues and >>>>> datanames returned by the CIF parser after (2). They have no ability >>>>> to influence the lexing process, or the parsing actions listed above >>>>> (in particular the datavalue editing). >>>>> >>>>> 4. The 'string value' or 'value' of a token is that value returned by >>>>> the parser in (2). In particular, this is the value that: >>>>> (i) may be checked against regular expressions in the dictionary; >>>>> (ii) is accessed by dREL expressions; >>>>> (iii) is returned by dREL expressions; >>>>> (iv) is referred to in dictionary descriptive text; >>>>> (v) may be passed to client routines for further editing; >>>>> (vi) may be passed to external applications >>>>> >>>>> [Side note: in other words the parser returns the CIF "infoset" and >>>>> the dictionaries refer to the CIF "infoset", but we haven't been >>>>> talking in those terms so I've been more explicit]. >>>>> >>>>> So my voting question (2) is: should the 'string value' of a token >>>>> referred to in (4) include the eliding characters? >>>>> >>>>> >>>>> On Tue, Nov 24, 2009 at 10:57 AM, Joe Krahn <krahn@niehs.nih.gov> wrote: >>>>>> >>>>>> A few points to consider: >>>>>> >>>>>> James Hester wrote: >>>>>> ... >>>>>>> >>>>>>> 2. Character(s) used to indicate elision should be part of the string >>>>>>> value >>>>>> >>>>>> This does not specify where the elision character should be stripped. >>>>>> It >>>>>> could be done by the parser or the dictionary-level code. The rule only >>>>>> refers to the final string for the final output text, right? >>>>>> >>>>>>> >>>>>>> Now for the specifics: >>>>>>> >>>>>>> 3. Which of the following elision proposals do you support (more than >>>>>>> one >>>>>>> OK)? >>>>>>> >>>>>>> Proposal (a) (intended to correspond to Nick's) >>>>>>> (i) A character which would otherwise be interpreted as a delimiter >>>>>>> is elided by immediately preceding it with a reverse solidus. >>>>>>> (ii) Otherwise a reverse solidus in the string has no special >>>>>>> lexical significance. >>>>>>> >>>>>>> Proposal (b) >>>>>>> (i) The combinations <reverse solidus><quote> or a <reverse >>>>>>> solidus><double quote> always signify <quote> and <double quote> >>>>>>> respectively, regardless of the delimiter used in a particular string. >>>>>>> (ii) The combinations in (i) elide the <quote> or <double quote> >>>>>>> character where that character would otherwise terminate the string >>>>>>> (iii) Apart from (i) and (ii), the reverse solidus has no special >>>>>>> significance >>>>>>> (iv) If not used as the string delimiter, <quote> or <double quote> >>>>>>> when not preceded by <reverse solidus> represent themselves. >>>>>> >>>>>> In both forms <reverse solidus><reverse solidus> should also be defined >>>>>> in order to allow a literal string that ends in <reverse solidus>. For >>>>>> example, a single <reverse solidus> character has to be written as >>>>>> "\\", >>>>>> to avoid eliding the close quote. >>>>>> >>>>>> Joe Krahn >>>>>> _______________________________________________ >>>>>> ddlm-group mailing list >>>>>> ddlm-group@iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>> >>>>> >>>>> >>>> >>>> cheers >>>> >>>> Nick >>>> >>>> -------------------------------- >>>> Associate Professor N. Spadaccini, PhD >>>> School of Computer Science & Software Engineering >>>> >>>> The University of Western Australia t: +61 (0)8 6488 3452 >>>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>>> MBDP M002 >>>> >>>> CRICOS Provider Code: 00126G >>>> >>>> e: Nick.Spadaccini@uwa.edu.au >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>> >>> >>> >>> -- >>> T +61 (02) 9717 9907 >>> F +61 (02) 9717 3145 >>> M +61 (04) 0249 4148 >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- References:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):