[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: Brian McMahon <bm@iucr.org>
- Date: Wed, 25 Nov 2009 15:20:51 +0000
- In-Reply-To: <279aad2a0911241806s3129bd4bvc6f315ca0764d3e3@mail.gmail.com>
- References: <279aad2a0911231800g6c26bdaancdd4a38fecebbb7a@mail.gmail.com><C731AC95.125CB%nick@csse.uwa.edu.au><279aad2a0911241414j1d89b6b3mfec464fdc401fbfd@mail.gmail.com><alpine.BSF.2.00.0911241717100.78685@epsilon.pair.com><279aad2a0911241454h12811f4eqfc47dd5eafa22c84@mail.gmail.com><alpine.BSF.2.00.0911241807480.78685@epsilon.pair.com><279aad2a0911241602u63486a1es2e98c940526af7c4@mail.gmail.com><alpine.BSF.2.00.0911241916470.78685@epsilon.pair.com><4B0C825E.5020102@pdb-mail.rutgers.edu><279aad2a0911241806s3129bd4bvc6f315ca0764d3e3@mail.gmail.com>
I believe the IUCr could cope with *not* having the elision mechanism we have been discussing. [Personally I would prefer 'escape' mechanism rather than 'elide', since the latter implies to me that the following character should be dropped - but that's just a pedantic comment.] James and Nick (and others) make a compelling case for the scope for confusion that arises in overloading the meaning of an elide character within a string value. The consequent prohibition of same-terminator characters within a terminator-delimited string does require care in writing strings, but I am sure that we can find ways around it. In practical terms, the prohibition of the ' and " characters from non-delimited strings (which I think we have already agreed) is more of a nuisance, because many existing CIFs have atom labels such as O1', Ti" (but I am told that according to some chemical naming conventions use of both single and double quotes in the same label can occur). Some important crystallographic software still has 4-character limits on atom labels, so forcing the addition of opening and terminating quote marks will rather reduce its effectiveness! Of course, such a program may never become CIF2-compatible, so I'm not using that as an argument for accommodating the new standards to its limitations; but it's an interesting example of the real cases for which we will need to create suitable handling procedures. Regards Brian On Wed, Nov 25, 2009 at 01:06:46PM +1100, James Hester wrote: > Thanks for the quick reply over Thanksgiving, John. I take from your > message that the PDB does not need any elide mechanism to be defined > in the CIF2 syntax. Would you therefore be prepared to vote in favour > of not defining any elides, or would you prefer to abstain? > > Votes so far: > > No elides: James, Nick, Herbert if the IUCr + PDB say it is OK > Elides:? > > Unknown: John, Joe, David B., Brian, Simon > > On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook > <jwest@pdb-mail.rutgers.edu> wrote: > > > > I confess that I am having difficulty keeping up with all aspects > > of this discussion. Following Herb's suggestion I will try to > > summarize the quoting issues from the PDB perspective. > > > > 1. As there are multiple ways of quoting a string our tools and files > > surround embedded quotes with quotes of the opposite sense or with > > semicolons in the mixed case. I think that this point has been > > covered a number of times now and I believe that Nick has suggested > > that all reasonable cases can be handled by using this approach. > > > > 2. I too was not aware that original definition of terminators > > had changed and did not include either a leading or trailing > > whitespace. Certainly this must still be the case for single > > and double quotes. I cannot recall ever seeing an example > > where the terminator \n; was following by a whitespace character, > > but about half of the codes that I am familiar with would > > fall over on \n;next_token. > > > > 3. Line folding has never been an issue for PDB nor has line length. > > > > Regards, > > > > John > > > > > > Herbert J. Bernstein wrote: > >> My major concern about anything we do is to be able to preserve > >> the functionality of the practices that the IUCr is following in > >> journal publications and the PDB is following. Inasmuch as they seem > >> able to cope with no elide in CIF 1.1, the remaining question is whether > >> they will be negatively impacted by the change in string termination > >> without any elide. If they can use CIF 2 with these changes, my > >> objections are purely academic and irrelevant. -- Herberrt > >> > >> ===================================================== > >> Herbert J. Bernstein, Professor of Computer Science > >> Dowling College, Kramer Science Center, KSC 121 > >> Idle Hour Blvd, Oakdale, NY, 11769 > >> > >> +1-631-244-3035 > >> yaya@dowling.edu > >> ===================================================== > >> > >> On Wed, 25 Nov 2009, James Hester wrote: > >> > >>> Herbert: I have the dubious advantage of not having participated in > >>> all those CIF1.0/1.1 discussions, so only have the spec as written > >>> down to rely on. > >>> > >>> Anyway, how do you feel about abandoning any specification of elides > >>> in CIF2 syntax, as suggested by Nick? > >>> > >>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein > >>> <yaya@bernstein-plus-sons.com> wrote: > >>>> Dear James, > >>>> > >>>> I started to write: > >>>> "No, in CIF 1.1, none of the terminal quote marks, including the \n; > >>>> are > >>>> effective unless followed by whitespace (\n, space, tab, of end of > >>>> file). > >>>> This is a well-established, and very tricky part of the CIF spec > >>>> going back > >>>> to 1990. That is why Nick had to explicitly specify that a terminal > >>>> quote > >>>> mark would be effective no matter what it was followed by." > >>>> > >>>> But the grammer currently on the IUCr web site is _not_ the one that I > >>>> recall COMCIFs discussing and approving. It now explcitly removes > >>>> the requirement for terminal white space in the special case of > >>>> the \n; text field terminator. I don't recall when that change was > >>>> adopted, > >>>> but it appears that you are right under the current spec > >>>> about the example I chose. Inasmuch as there is a lot of working code > >>>> that enforces and uses the original whitespace handling and uses it > >>>> in line-folding, I will not revise CIFtbx 3, but I will try to do > >>>> something to adapt to this change for CIFtbx 4. > >>>> > >>>> I guess we are just going to have yet another few dialects of CIF. > >>>> > >>>> Regards, > >>>> Herbert > >>>> ===================================================== > >>>> Herbert J. Bernstein, Professor of Computer Science > >>>> Dowling College, Kramer Science Center, KSC 121 > >>>> Idle Hour Blvd, Oakdale, NY, 11769 > >>>> > >>>> +1-631-244-3035 > >>>> yaya@dowling.edu > >>>> ===================================================== > >>>> > >>>> On Wed, 25 Nov 2009, James Hester wrote: > >>>> > >>>>> To be precise, we are not 'referring all elides to the application' > >>>>> because no elides are recognised by the lexer under Nick's latest > >>>>> suggestion, so there are no elides to refer to the application. > >>>>> > >>>>> My understanding of CIF1.1 syntax suggests that the string you provide > >>>>> would produce a syntax error in CIF1.1, as the semicolon at the start > >>>>> of the second line would terminate the string, and so whitespace > >>>>> should then appear as the second character on the second line, rather > >>>>> than reverse solidus. > >>>>> > >>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein > >>>>> <yaya@bernstein-plus-sons.com> wrote: > >>>>>> > >>>>>> The only problem with referring all elisdes to the application is that > >>>>>> with the removal of the requirement of a blank after a \n; for it > >>>>>> to be > >>>>>> effective, the line folding protocol develops a slight gap. The > >>>>>> case is as follows > >>>>>> > >>>>>> ;\ > >>>>>> ;\ > >>>>>> ; > >>>>>> > >>>>>> Is a valid single text field in CIF 1.1, which when handled with the > >>>>>> line folding protocol translates to the equivalent of ';' because the > >>>>>> embedded ;\ is not a valid text terminator. If we require that > >>>>>> a text field the begins with "\n;\\" must be terminated by "\n; " > >>>>>> or "\n;\n" or "\n;\t" that problem would be fixed. > >>>>>> > >>>>>> ===================================================== > >>>>>> Herbert J. Bernstein, Professor of Computer Science > >>>>>> Dowling College, Kramer Science Center, KSC 121 > >>>>>> Idle Hour Blvd, Oakdale, NY, 11769 > >>>>>> > >>>>>> +1-631-244-3035 > >>>>>> yaya@dowling.edu > >>>>>> ===================================================== > >>>>>> > >>>>>> On Wed, 25 Nov 2009, James Hester wrote: > >>>>>> > >>>>>>> I wholeheartedly agree with Nick's suggestion. > >>>>>>> > >>>>>>> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini > >>>>>>> <nick@csse.uwa.edu.au> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> It appears to me that we have spent far too long on a syntactic > >>>>>>>> issue > >>>>>>>> which > >>>>>>>> can be avoided 99.9999% of the time. Quite simply given the 5 > >>>>>>>> ways to > >>>>>>>> delimit strings, it is next to impossible to get a situation > >>>>>>>> where you > >>>>>>>> cannot choose one of those to make the problem go away. > >>>>>>>> > >>>>>>>> I think the RCSB systematically avoid it by choosing > >>>>>>>> > >>>>>>>> "ab'cd" > >>>>>>>> 'ab"cd' > >>>>>>>> ;ab'"cd > >>>>>>>> ; > >>>>>>>> > >>>>>>>> But now we additionally have """ and ''' to choose from, making > >>>>>>>> it even > >>>>>>>> easier. > >>>>>>>> > >>>>>>>> So I propose in line with James' position there is NO eliding of > >>>>>>>> terminator > >>>>>>>> character at the CIF2 syntax level. ALL elides in the string are > >>>>>>>> assumed > >>>>>>>> to > >>>>>>>> be user specific encoding (say TeX, IUCr \greek) which can be > >>>>>>>> resolved > >>>>>>>> at > >>>>>>>> the dictionary level. > >>>>>>>> > >>>>>>>> This necessarily means NO terminator character can appear in a > >>>>>>>> string > >>>>>>>> delimited by the same terminator character. You will need to > >>>>>>>> choose a > >>>>>>>> different terminator character. That is > >>>>>>>> > >>>>>>>> No " in "strings" > >>>>>>>> No ' in 'strings' > >>>>>>>> No """ in """strings""" (but separable individual and doublet " are > >>>>>>>> allowed) > >>>>>>>> No ''' in '''strings''' (but separable individual and doublet ' are > >>>>>>>> allowed) > >>>>>>>> > >>>>>>>> EVERYTHING in the string is returned as raw (except the > >>>>>>>> initiating and > >>>>>>>> terminating character). > >>>>>>>> > >>>>>>>> The only time you will not be able to encode anything in a delimited > >>>>>>>> string > >>>>>>>> is when you want to include ' " """ ''' and \n; in the one > >>>>>>>> string. The > >>>>>>>> likelihood of that is almost zero, unless you may want to include > >>>>>>>> a CIF > >>>>>>>> within a CIF (a silly thing to do IMHO). In that case the > >>>>>>>> contents can > >>>>>>>> be > >>>>>>>> encoded in a dictionary driven way. I suggest it be declared as a > >>>>>>>> BASE64 > >>>>>>>> type and then all the syntactic ambiguity disappears. > >>>>>>>> > >>>>>>>> Problem solved! No need to elide because of CIF2 syntax rules all > >>>>>>>> elides > >>>>>>>> are > >>>>>>>> user driven, contents are returned raw. > >>>>>>>> > >>>>>>>> As for Herbs comment in a recent email what about line-folding, then > >>>>>>>> the > >>>>>>>> same holds. That is NOT a lexer issue and it has nothing to do > >>>>>>>> with the > >>>>>>>> parser, everything is read literally and returned raw and what to do > >>>>>>>> with > >>>>>>>> it > >>>>>>>> is promulgated to the downstream application. > >>>>>>>> > >>>>>>>> Straw vote - No elides of terminator strings as described above - > >>>>>>>> Nick > >>>>>>>> > >>>>>>>> > >>> > >>>> > >>> > >>> > >>> > >>> -- > >>> T +61 (02) 9717 9907 > >>> F +61 (02) 9717 3145 > >>> M +61 (04) 0249 4148 > >>> _______________________________________________ > >>> ddlm-group mailing list > >>> ddlm-group@iucr.org > >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group > >>> > >> > >> ------------------------------------------------------------------------ > >> > >> _______________________________________________ > >> ddlm-group mailing list > >> ddlm-group@iucr.org > >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- References:
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- Re: [ddlm-group] Use of elides in strings (John Westbrook)
- Re: [ddlm-group] Use of elides in strings (James Hester)
- Prev by Date: [ddlm-group] Handling single string values longer than maximum linelength
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):