[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 26 Nov 2009 10:16:47 -0500 (EST)
- In-Reply-To: <C734BB63.12644%nick@csse.uwa.edu.au>
- References: <C734BB63.12644%nick@csse.uwa.edu.au>
Thank you. I will code to that spec. -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Thu, 26 Nov 2009, Nick Spadaccini wrote: > > > > On 26/11/09 9:59 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: > >> Just as a distraction from trying to understand modulated structure CIFs, >> here goes: >> >> I'd use semicolon delimiters (see long arrows <------ below) >> and if I didnt know the definition of the item, I would >> respect the whitespace. >> >> Actually, I'd probably bung in a couple of extra newlines for good measure >> if I knew what I were dealing with - i.e. >> >> ; >> O'" >> ; >> >> Funnily enough, this is actually easier to read using my eyes >> than "O'"" :-) > > Strictly the string is > > ;O'" > ; > > Since there is a desire that everything has to be returned as a raw string. > Looking at it as a byte stream we have \n;O'"\n; and once you strip off the > string delimiters (\n;) you get O'". Voila! > > Read on I have inserted additional comments. > >> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >> To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated >> dictionaries <ddlm-group@iucr.org> >> Sent: Thursday, 26 November, 2009 12:41:46 >> Subject: Re: [ddlm-group] Use of elides in strings >> >> I am trying to get some CIF-2 related software done, so please advise >> me on some specific cases: >> >> How should the following C-style strings followed by their CIF-1.1 >> representations be presented in a CIF 2 document? I've only put >> in CIF 2 cases where I think there is no question, but feel free >> to correct those. >> >> C-style CIF-1.1 style CIF-2 >> >> "O'" "O'" or 'O'' "O'" > >> "O\"" "O"" or 'O"' 'O"' > >> "O'\"" "O'"" or 'O'"' ? >> <------------------------ \n;O'"\n; > Or '''O'"''' but not with """ because the terminator is corrupted. > >> "''O''" "''O''" or '''O''' "''O''" > >> "'''O'''" "'''O'''" or ''''O'''' ? >> <------------------------------- \n;'''O'''\n; > Or """'''O'''""" > >> "\"\"O'\"\"" """O'""" or '""O'""' ? >> <---------------------- \n;""O'""\n; > Or '''""O'""''' > >> "\"\"\"O'\"\"\"" """"O'"""" or '"""O'"""' ? >> <---------------------- \n;"""O'"""\n; > Or '''"""O'"""''' > >> and for semi-colon delimited string, is the last new-line part of >> the string or part of the delimiter, i.e. if the string is >> "abc\n" is the CIF-2 version > > My reading of it has always been given by the definition of the delimiter, > which is \n;. These are what I strip off. > > When we speak of stripping off the delimiters at both ends, then just as we > strip the """ trigram from both ends, the same is true of \n; digram. Hence > I say the second of the two examples \n;abc\n\n; >> ;abc >> ; <-------------------- if newlines are not required by the items >> definition, I'd be tempted to strip the whitespace > > The above is the string "abc" > >> or >> >> ;abc >> <-------------------- without knowing the items definition, I'd be >> tempted to respect the whitespace >> ; >> > > This is "abc\n" >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Thu, 26 Nov 2009, Nick Spadaccini wrote: >> >>> >>> >>> >>> On 25/11/09 10:24 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: >>> >>>> What Brian has said here - specifically >>>> >>>> "if this were dropped as part of the CIF2 specification, >>>> we would need to think carefully about how else to retain this >>>> functionality" >>>> >>>> is also relevant to how we handle the CIF1.1 markup conventions. >>>> As I understand it in CIF1.1 these are the default conventions for >>>> text fields unless the dictionary prohibits them, but in CIF2 all such >>>> conventions will _not_ be part of the spec, and can only be interpretted at >>>> the dictionary level. >>>> >>>> Is this correct? >>> >>> Yes, this is my understanding. There will be many different conventions I >>> presume, some will be widely accepted and standard, they will be part of the >>> underlying systems that interpret the files. For instance if something is >>> declared as a TeX encoding, we know what to do. >>> >>>> >>>> I'm only asking because we (at the IUCr at least) will have to address this >>>> issue sooner rather than later when adopting CIF2, so I just want to make >>>> sure >>>> I understand base CIF2 correctly >>>> >>>> Cheers >>>> >>>> Simon >>>> >>>> >>>> >>>> From: Brian McMahon <bm@iucr.org> >>>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >>>> Sent: Wednesday, 25 November, 2009 13:34:05 >>>> Subject: Re: [ddlm-group] Use of elides in strings >>>> >>>> (I've switched the thread title to deal separately with line folding.) >>>> >>>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35 >>>> of the ITG bible). Currently, it invokes a special meaning for the >>>> backslash (reverse solidus) character, but only when it is the first >>>> non-blank after an opening semicolon or comment hash delimiter. We have >>>> yet to discuss whether to extend it to other string types (specifically >>>> the triple-quoted strings). >>>> >>>> It's quite easy these days to generate single strings that are longer >>>> than 2048 characters (or any other arbitrary line limit) - e.g. a >>>> protein or nucleic acid sequence. Many, many chemical names broke the old >>>> 80-character line length limit. >>>> >>>> We're very happy with CIF applications that do not interpret the >>>> line-folding protocol, so long as they preserve the existing backslashes. >>>> However, a fully-compliant CIF 1.1 parser should be able to return an >>>> unfolded string to an application that requests it. >>>> >>>> As Herbert says, if this were dropped as part of the CIF2 specification, >>>> we would need to think carefully about how else to retain this >>>> functionality. >>>> >>>> Regards >>>> Brian >>>> >>>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote: >>>>> The line folding protocol was discussed and adopted by COMCIFS and is >>>>> posted, aong with other "Common Semantic Features" at >>>>> >>>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics >>>>> >>>>> but that is neither here nor there. The point is that the IUCr uses CIF >>>>> to get work done. If we disable something they are using, we should offer >>>>> some equivalent functionality so they can use CIF 2 to do their work. >>>>> Otherwise, they will have to do the sensible thing, and continue to use >>>>> CIF 1, or, worse, create their own dialect of CIF 2. >>>>> >>>>> Now, I broke my nose yesterday morning and find myself a bit punchy today, >>>>> so I will drop out of this discussion for a while. Hopefully, when I >>>>> return to it, this whole matter will be settled in some way that will >>>>> allow people to actually use CIF 2, instead of it becoming what it seems >>>>> on its way to becoming -- something elegant but not terrible useful, a bit >>>>> like PL/I. >>>>> >>>>> Cheers, >>>>> Herbert >>>>> >>>>> ===================================================== >>>>> Herbert J. Bernstein, Professor of Computer Science >>>>> Dowling College, Kramer Science Center, KSC 121 >>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>> >>>>> +1-631-244-3035 >>>>> yaya@dowling.edu >>>>> ===================================================== >>>>> >>>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote: >>>>> >>>>>> I am with John. STAR has no line-folding protocol. As far as I can recall >>>>>> neither did CIF. Somewhere along the way line folding was discussed (or >>>>>> introduced?), but I am not sure it is formally part of any spec. >>>>>> >>>>>> None of my software handles anything about line folding. I can see no >>>>>> reason >>>>>> for it, since with a 2048 maximum record length, and a free format >>>>>> structure >>>>>> there is plenty of room to output your data. The only time it would be >>>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that >>>>>> ever going to happen? >>>>>> >>>>>> May be the desire for it comes from making the data "pretty" and read well >>>>>> in a text editor. Well that is the task of an application to read the CIF >>>>>> and present it appropriately. The CIF is strictly about CONTENT and not >>>>>> FORM. >>>>>> >>>>>> Since we have given up on elided characters being part of CIF syntax, and >>>>>> the belief by others that this not be a lexer issue, I think we should >>>>>> absolutely consistent. The lexer knows how to identify tokens and reads >>>>>> everything within them as a raw string. >>>>>> >>>>>> If your "encoding" for \n; strings includes characters that break the >>>>>> lexer, >>>>>> then protect it in some way so that when you pass that string back as raw >>>>>> in >>>>>> your software, somebody knows how to unprotect it back to the original (as >>>>>> with ALL string encoding). >>>>>> >>>>>> One concession I think we can consider is to change the delimiter from \n; >>>>>> to \n;\n. I don't see this as causing me any problems, since I handle >>>>>> >>>>>> ; stuff >>>>>> More stuff >>>>>> ; _newname >>>>>> >>>>>> routinely, but others don't. I believe most people do use (and probably >>>>>> think) the delimiter is \n;\n anyway. >>>>>> >>>>>> Two questions >>>>>> >>>>>> (1) Do you agree that line folding just another encoding and therefore not >>>>>> a >>>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not >>>>>> to >>>>>> break the lexer. >>>>>> (2) Do we think \n;\n is a better delimiter? >>>>>> >>>>>> On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote: >>>>>> >>>>>>> Hi James, >>>>>>> >>>>>>> My preference is avoid the elides in the syntax for the purpose of >>>>>>> escaping >>>>>>> terminators >>>>>>> in strings deferring interpretation to the application. >>>>>>> >>>>>>> I do not understand all of the issues related to line folding, which I >>>>>>> believe is an issue for Brian and Simon. >>>>>>> >>>>>>> John >>>>>>> >>>>>>> >>>>>>> James Hester wrote: >>>>>>>> Thanks for the quick reply over Thanksgiving, John. I take from your >>>>>>>> message that the PDB does not need any elide mechanism to be defined >>>>>>>> in the CIF2 syntax. Would you therefore be prepared to vote in favour >>>>>>>> of not defining any elides, or would you prefer to abstain? >>>>>>>> >>>>>>>> Votes so far: >>>>>>>> >>>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK >>>>>>>> Elides:? >>>>>>>> >>>>>>>> Unknown: John, Joe, David B., Brian, Simon >>>>>>>> >>>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook >>>>>>>> <jwest@pdb-mail.rutgers.edu> wrote: >>>>>>>>> I confess that I am having difficulty keeping up with all aspects >>>>>>>>> of this discussion. Following Herb's suggestion I will try to >>>>>>>>> summarize the quoting issues from the PDB perspective. >>>>>>>>> >>>>>>>>> 1. As there are multiple ways of quoting a string our tools and files >>>>>>>>> surround embedded quotes with quotes of the opposite sense or with >>>>>>>>> semicolons in the mixed case. I think that this point has been >>>>>>>>> covered a number of times now and I believe that Nick has suggested >>>>>>>>> that all reasonable cases can be handled by using this approach. >>>>>>>>> >>>>>>>>> 2. I too was not aware that original definition of terminators >>>>>>>>> had changed and did not include either a leading or trailing >>>>>>>>> whitespace. Certainly this must still be the case for single >>>>>>>>> and double quotes. I cannot recall ever seeing an example >>>>>>>>> where the terminator \n; was following by a whitespace character, >>>>>>>>> but about half of the codes that I am familiar with would >>>>>>>>> fall over on \n;next_token. >>>>>>>>> >>>>>>>>> 3. Line folding has never been an issue for PDB nor has line length. >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> John >>>>>>>>> >>>>>>>>> >>>>>>>>> Herbert J. Bernstein wrote: >>>>>>>>>> My major concern about anything we do is to be able to preserve >>>>>>>>>> the functionality of the practices that the IUCr is following in >>>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem >>>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is >>>>>>>>>> whether >>>>>>>>>> they will be negatively impacted by the change in string termination >>>>>>>>>> without any elide. If they can use CIF 2 with these changes, my >>>>>>>>>> objections are purely academic and irrelevant. -- Herberrt >>>>>>>>>> >>>>>>>>>> ===================================================== >>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>>> >>>>>>>>>> +1-631-244-3035 >>>>>>>>>> yaya@dowling.edu >>>>>>>>>> ===================================================== >>>>>>>>>> >>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote: >>>>>>>>>> >>>>>>>>>>> Herbert: I have the dubious advantage of not having participated in >>>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written >>>>>>>>>>> down to rely on. >>>>>>>>>>> >>>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides >>>>>>>>>>> in CIF2 syntax, as suggested by Nick? >>>>>>>>>>> >>>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein >>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote: >>>>>>>>>>>> Dear James, >>>>>>>>>>>> >>>>>>>>>>>> I started to write: >>>>>>>>>>>> "No, in CIF 1.1, none of the terminal quote marks, including the >>>>>>>>>>>> \n; >>>>>>>>>>>> are >>>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of >>>>>>>>>>>> file). >>>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec >>>>>>>>>>>> going back >>>>>>>>>>>> to 1990. That is why Nick had to explicitly specify that a terminal >>>>>>>>>>>> quote >>>>>>>>>>>> mark would be effective no matter what it was followed by." >>>>>>>>>>>> >>>>>>>>>>>> But the grammer currently on the IUCr web site is _not_ the one >>>>>>>>>>>> that >>>>>>>>>>>> I >>>>>>>>>>>> recall COMCIFs discussing and approving. It now explcitly removes >>>>>>>>>>>> the requirement for terminal white space in the special case of >>>>>>>>>>>> the \n; text field terminator. I don't recall when that change was >>>>>>>>>>>> adopted, >>>>>>>>>>>> but it appears that you are right under the current spec >>>>>>>>>>>> about the example I chose. Inasmuch as there is a lot of working >>>>>>>>>>>> code >>>>>>>>>>>> that enforces and uses the original whitespace handling and uses it >>>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do >>>>>>>>>>>> something to adapt to this change for CIFtbx 4. >>>>>>>>>>>> >>>>>>>>>>>> I guess we are just going to have yet another few dialects of CIF. >>>>>>>>>>>> >>>>>>>>>>>> Regards, >>>>>>>>>>>> Herbert >>>>>>>>>>>> ===================================================== >>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>>>>> >>>>>>>>>>>> +1-631-244-3035 >>>>>>>>>>>> yaya@dowling.edu >>>>>>>>>>>> ===================================================== >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote: >>>>>>>>>>>> >>>>>>>>>>>>> To be precise, we are not 'referring all elides to the application' >>>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest >>>>>>>>>>>>> suggestion, so there are no elides to refer to the application. >>>>>>>>>>>>> >>>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you >>>>>>>>>>>>> provide >>>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the >>>>>>>>>>>>> start >>>>>>>>>>>>> of the second line would terminate the string, and so whitespace >>>>>>>>>>>>> should then appear as the second character on the second line, >>>>>>>>>>>>> rather >>>>>>>>>>>>> than reverse solidus. >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein >>>>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote: >>>>>>>>>>>>> The only problem with referring all elisdes to the application is >>>>>>>>>>>>> that >>>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it >>>>>>>>>>>>> to be >>>>>>>>>>>>> effective, the line folding protocol develops a slight gap. The >>>>>>>>>>>>> case is as follows >>>>>>>>>>>>> >>>>>>>>>>>>> ;\ >>>>>>>>>>>>> ;\ >>>>>>>>>>>>> ; >>>>>>>>>>>>> >>>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with >>>>>>>>>>>>> the >>>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because >>>>>>>>>>>>> the >>>>>>>>>>>>> embedded ;\ is not a valid text terminator. If we require that >>>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; " >>>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed. >>>>>>>>>>>>> >>>>>>>>>>>>> ===================================================== >>>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>>>>>>>>> >>>>>>>>>>>>> +1-631-244-3035 >>>>>>>>>>>>> yaya@dowling.edu >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> cheers >>> >>> Nick >>> >>> -------------------------------- >>> Associate Professor N. Spadaccini, PhD >>> School of Computer Science & Software Engineering >>> >>> The University of Western Australia t: +61 (0)8 6488 3452 >>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> <http://www.csse.uwa.edu.au/%7Enick> >>> MBDP M002 >>> >>> CRICOS Provider Code: 00126G >>> >>> e: Nick.Spadaccini@uwa.edu.au >>> >>> >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):