[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Thu, 26 Nov 2009 23:10:27 +0800
- Authentication-Results: postfix;
- In-Reply-To: <539530.205.qm@web87004.mail.ird.yahoo.com>
On 26/11/09 9:59 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: > Just as a distraction from trying to understand modulated structure CIFs, > here goes: > > I'd use semicolon delimiters (see long arrows <------ below) > and if I didnt know the definition of the item, I would > respect the whitespace. > > Actually, I'd probably bung in a couple of extra newlines for good measure > if I knew what I were dealing with - i.e. > > ; > O'" > ; > > Funnily enough, this is actually easier to read using my eyes > than "O'"" :-) Strictly the string is ;O'" ; Since there is a desire that everything has to be returned as a raw string. Looking at it as a byte stream we have \n;O'"\n; and once you strip off the string delimiters (\n;) you get O'". Voila! Read on I have inserted additional comments. > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> > To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated > dictionaries <ddlm-group@iucr.org> > Sent: Thursday, 26 November, 2009 12:41:46 > Subject: Re: [ddlm-group] Use of elides in strings > > I am trying to get some CIF-2 related software done, so please advise > me on some specific cases: > > How should the following C-style strings followed by their CIF-1.1 > representations be presented in a CIF 2 document? I've only put > in CIF 2 cases where I think there is no question, but feel free > to correct those. > > C-style CIF-1.1 style CIF-2 > > "O'" "O'" or 'O'' "O'" > "O\"" "O"" or 'O"' 'O"' > "O'\"" "O'"" or 'O'"' ? > <------------------------ \n;O'"\n; Or '''O'"''' but not with """ because the terminator is corrupted. > "''O''" "''O''" or '''O''' "''O''" > "'''O'''" "'''O'''" or ''''O'''' ? > <------------------------------- \n;'''O'''\n; Or """'''O'''""" > "\"\"O'\"\"" """O'""" or '""O'""' ? > <---------------------- \n;""O'""\n; Or '''""O'""''' > "\"\"\"O'\"\"\"" """"O'"""" or '"""O'"""' ? > <---------------------- \n;"""O'"""\n; Or '''"""O'"""''' > and for semi-colon delimited string, is the last new-line part of > the string or part of the delimiter, i.e. if the string is > "abc\n" is the CIF-2 version My reading of it has always been given by the definition of the delimiter, which is \n;. These are what I strip off. When we speak of stripping off the delimiters at both ends, then just as we strip the """ trigram from both ends, the same is true of \n; digram. Hence I say the second of the two examples \n;abc\n\n; > ;abc > ; <-------------------- if newlines are not required by the items > definition, I'd be tempted to strip the whitespace The above is the string "abc" > or > > ;abc > <-------------------- without knowing the items definition, I'd be > tempted to respect the whitespace > ; > This is "abc\n" > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Thu, 26 Nov 2009, Nick Spadaccini wrote: > >> >> >> >> On 25/11/09 10:24 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: >> >>> What Brian has said here - specifically >>> >>> "if this were dropped as part of the CIF2 specification, >>> we would need to think carefully about how else to retain this >>> functionality" >>> >>> is also relevant to how we handle the CIF1.1 markup conventions. >>> As I understand it in CIF1.1 these are the default conventions for >>> text fields unless the dictionary prohibits them, but in CIF2 all such >>> conventions will _not_ be part of the spec, and can only be interpretted at >>> the dictionary level. >>> >>> Is this correct? >> >> Yes, this is my understanding. There will be many different conventions I >> presume, some will be widely accepted and standard, they will be part of the >> underlying systems that interpret the files. For instance if something is >> declared as a TeX encoding, we know what to do. >> >>> >>> I'm only asking because we (at the IUCr at least) will have to address this >>> issue sooner rather than later when adopting CIF2, so I just want to make >>> sure >>> I understand base CIF2 correctly >>> >>> Cheers >>> >>> Simon >>> >>> >>> >>> From: Brian McMahon <bm@iucr.org> >>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >>> Sent: Wednesday, 25 November, 2009 13:34:05 >>> Subject: Re: [ddlm-group] Use of elides in strings >>> >>> (I've switched the thread title to deal separately with line folding.) >>> >>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35 >>> of the ITG bible). Currently, it invokes a special meaning for the >>> backslash (reverse solidus) character, but only when it is the first >>> non-blank after an opening semicolon or comment hash delimiter. We have >>> yet to discuss whether to extend it to other string types (specifically >>> the triple-quoted strings). >>> >>> It's quite easy these days to generate single strings that are longer >>> than 2048 characters (or any other arbitrary line limit) - e.g. a >>> protein or nucleic acid sequence. Many, many chemical names broke the old >>> 80-character line length limit. >>> >>> We're very happy with CIF applications that do not interpret the >>> line-folding protocol, so long as they preserve the existing backslashes. >>> However, a fully-compliant CIF 1.1 parser should be able to return an >>> unfolded string to an application that requests it. >>> >>> As Herbert says, if this were dropped as part of the CIF2 specification, >>> we would need to think carefully about how else to retain this >>> functionality. >>> >>> Regards >>> Brian >>> >>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote: >>>> The line folding protocol was discussed and adopted by COMCIFS and is >>>> posted, aong with other "Common Semantic Features" at >>>> >>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics >>>> >>>> but that is neither here nor there. The point is that the IUCr uses CIF >>>> to get work done. If we disable something they are using, we should offer >>>> some equivalent functionality so they can use CIF 2 to do their work. >>>> Otherwise, they will have to do the sensible thing, and continue to use >>>> CIF 1, or, worse, create their own dialect of CIF 2. >>>> >>>> Now, I broke my nose yesterday morning and find myself a bit punchy today, >>>> so I will drop out of this discussion for a while. Hopefully, when I >>>> return to it, this whole matter will be settled in some way that will >>>> allow people to actually use CIF 2, instead of it becoming what it seems >>>> on its way to becoming -- something elegant but not terrible useful, a bit >>>> like PL/I. >>>> >>>> Cheers, >>>> Herbert >>>> >>>> ===================================================== >>>> Herbert J. Bernstein, Professor of Computer Science >>>> Dowling College, Kramer Science Center, KSC 121 >>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>> >>>> +1-631-244-3035 >>>> yaya@dowling.edu >>>> ===================================================== >>>> >>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote: >>>> >>>>> I am with John. STAR has no line-folding protocol. As far as I can recall >>>>> neither did CIF. Somewhere along the way line folding was discussed (or >>>>> introduced?), but I am not sure it is formally part of any spec. >>>>> >>>>> None of my software handles anything about line folding. I can see no >>>>> reason >>>>> for it, since with a 2048 maximum record length, and a free format >>>>> structure >>>>> there is plenty of room to output your data. The only time it would be >>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that >>>>> ever going to happen? >>>>> >>>>> May be the desire for it comes from making the data "pretty" and read well >>>>> in a text editor. Well that is the task of an application to read the CIF >>>>> and present it appropriately. The CIF is strictly about CONTENT and not >>>>> FORM. >>>>> >>>>> Since we have given up on elided characters being part of CIF syntax, and >>>>> the belief by others that this not be a lexer issue, I think we should >>>>> absolutely consistent. The lexer knows how to identify tokens and reads >>>>> everything within them as a raw string. >>>>> >>>>> If your "encoding" for \n; strings includes characters that break the >>>>> lexer, >>>>> then protect it in some way so that when you pass that string back as raw >>>>> in >>>>> your software, somebody knows how to unprotect it back to the original (as >>>>> with ALL string encoding). >>>>> >>>>> One concession I think we can consider is to change the delimiter from \n; >>>>> to \n;\n. I don't see this as causing me any problems, since I handle >>>>> >>>>> ; stuff >>>>> More stuff >>>>> ; _newname >>>>> >>>>> routinely, but others don't. I believe most people do use (and probably >>>>> think) the delimiter is \n;\n anyway. >>>>> >>>>> Two questions >>>>> >>>>> (1) Do you agree that line folding just another encoding and therefore not >>>>> a >>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not >>>>> to >>>>> break the lexer. >>>>> (2) Do we think \n;\n is a better delimiter? >>>>> >>>>> On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote: >>>>> >>>>>> Hi James, >>>>>> >>>>>> My preference is avoid the elides in the syntax for the purpose of >>>>>> escaping >>>>>> terminators >>>>>> in strings deferring interpretation to the application. >>>>>> >>>>>> I do not understand all of the issues related to line folding, which I >>>>>> believe is an issue for Brian and Simon. >>>>>> >>>>>> John >>>>>> >>>>>> >>>>>> James Hester wrote: >>>>>>> Thanks for the quick reply over Thanksgiving, John. I take from your >>>>>>> message that the PDB does not need any elide mechanism to be defined >>>>>>> in the CIF2 syntax. Would you therefore be prepared to vote in favour >>>>>>> of not defining any elides, or would you prefer to abstain? >>>>>>> >>>>>>> Votes so far: >>>>>>> >>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK >>>>>>> Elides:? >>>>>>> >>>>>>> Unknown: John, Joe, David B., Brian, Simon >>>>>>> >>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook >>>>>>> <jwest@pdb-mail.rutgers.edu> wrote: >>>>>>>> I confess that I am having difficulty keeping up with all aspects >>>>>>>> of this discussion. Following Herb's suggestion I will try to >>>>>>>> summarize the quoting issues from the PDB perspective. >>>>>>>> >>>>>>>> 1. As there are multiple ways of quoting a string our tools and files >>>>>>>> surround embedded quotes with quotes of the opposite sense or with >>>>>>>> semicolons in the mixed case. I think that this point has been >>>>>>>> covered a number of times now and I believe that Nick has suggested >>>>>>>> that all reasonable cases can be handled by using this approach. >>>>>>>> >>>>>>>> 2. I too was not aware that original definition of terminators >>>>>>>> had changed and did not include either a leading or trailing >>>>>>>> whitespace. Certainly this must still be the case for single >>>>>>>> and double quotes. I cannot recall ever seeing an example >>>>>>>> where the terminator \n; was following by a whitespace character, >>>>>>>> but about half of the codes that I am familiar with would >>>>>>>> fall over on \n;next_token. >>>>>>>> >>>>>>>> 3. Line folding has never been an issue for PDB nor has line length. >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> John >>>>>>>> >>>>>>>> >>>>>>>> Herbert J. Bernstein wrote: >>>>>>>>> My major concern about anything we do is to be able to preserve >>>>>>>>> the functionality of the practices that the IUCr is following in >>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem >>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is >>>>>>>>> whether >>>>>>>>> they will be negatively impacted by the change in string termination >>>>>>>>> without any elide. If they can use CIF 2 with these changes, my >>>>>>>>> objections are purely academic and irrelevant. -- Herberrt >>>>>>>>> >>>>>>>>> ===================================================== >>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>> >>>>>>>>> +1-631-244-3035 >>>>>>>>> yaya@dowling.edu >>>>>>>>> ===================================================== >>>>>>>>> >>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote: >>>>>>>>> >>>>>>>>>> Herbert: I have the dubious advantage of not having participated in >>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written >>>>>>>>>> down to rely on. >>>>>>>>>> >>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides >>>>>>>>>> in CIF2 syntax, as suggested by Nick? >>>>>>>>>> >>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein >>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote: >>>>>>>>>>> Dear James, >>>>>>>>>>> >>>>>>>>>>> I started to write: >>>>>>>>>>> "No, in CIF 1.1, none of the terminal quote marks, including the >>>>>>>>>>> \n; >>>>>>>>>>> are >>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of >>>>>>>>>>> file). >>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec >>>>>>>>>>> going back >>>>>>>>>>> to 1990. That is why Nick had to explicitly specify that a terminal >>>>>>>>>>> quote >>>>>>>>>>> mark would be effective no matter what it was followed by." >>>>>>>>>>> >>>>>>>>>>> But the grammer currently on the IUCr web site is _not_ the one >>>>>>>>>>> that >>>>>>>>>>> I >>>>>>>>>>> recall COMCIFs discussing and approving. It now explcitly removes >>>>>>>>>>> the requirement for terminal white space in the special case of >>>>>>>>>>> the \n; text field terminator. I don't recall when that change was >>>>>>>>>>> adopted, >>>>>>>>>>> but it appears that you are right under the current spec >>>>>>>>>>> about the example I chose. Inasmuch as there is a lot of working >>>>>>>>>>> code >>>>>>>>>>> that enforces and uses the original whitespace handling and uses it >>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do >>>>>>>>>>> something to adapt to this change for CIFtbx 4. >>>>>>>>>>> >>>>>>>>>>> I guess we are just going to have yet another few dialects of CIF. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Herbert >>>>>>>>>>> ===================================================== >>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>>>> >>>>>>>>>>> +1-631-244-3035 >>>>>>>>>>> yaya@dowling.edu >>>>>>>>>>> ===================================================== >>>>>>>>>>> >>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote: >>>>>>>>>>> >>>>>>>>>>>> To be precise, we are not 'referring all elides to the application' >>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest >>>>>>>>>>>> suggestion, so there are no elides to refer to the application. >>>>>>>>>>>> >>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you >>>>>>>>>>>> provide >>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the >>>>>>>>>>>> start >>>>>>>>>>>> of the second line would terminate the string, and so whitespace >>>>>>>>>>>> should then appear as the second character on the second line, >>>>>>>>>>>> rather >>>>>>>>>>>> than reverse solidus. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein >>>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote: >>>>>>>>>>>> The only problem with referring all elisdes to the application is >>>>>>>>>>>> that >>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it >>>>>>>>>>>> to be >>>>>>>>>>>> effective, the line folding protocol develops a slight gap. The >>>>>>>>>>>> case is as follows >>>>>>>>>>>> >>>>>>>>>>>> ;\ >>>>>>>>>>>> ;\ >>>>>>>>>>>> ; >>>>>>>>>>>> >>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with >>>>>>>>>>>> the >>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because >>>>>>>>>>>> the >>>>>>>>>>>> embedded ;\ is not a valid text terminator. If we require that >>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; " >>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed. >>>>>>>>>>>> >>>>>>>>>>>> ===================================================== >>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>>>>> >>>>>>>>>>>> +1-631-244-3035 >>>>>>>>>>>> yaya@dowling.edu >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> cheers >> >> Nick >> >> -------------------------------- >> Associate Professor N. Spadaccini, PhD >> School of Computer Science & Software Engineering >> >> The University of Western Australia t: +61 (0)8 6488 3452 >> 35 Stirling Highway f: +61 (0)8 6488 1089 >> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >> <http://www.csse.uwa.edu.au/%7Enick> >> MBDP M002 >> >> CRICOS Provider Code: 00126G >> >> e: Nick.Spadaccini@uwa.edu.au >> >> >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):