[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: Brian McMahon <[email protected]>
Date: Wed, 25 Nov 2009 15:36:23 +0000
In-Reply-To: <[email protected]>
References: <C732C75F.125E9%[email protected]><[email protected]><[email protected]><[email protected]>
Very relevant to something I've just been discussing with the
Managing Editor. What would be the response of this group to the
suggestion that CIF 2 drops altogether the "CIF markup conventions"
of legacy CIF 1 (pp 35-36 of ITG,
http://www.iucr.org/resources/cif/spec/version1.1/semantics#markup )?

i.e. all the old \' = acute, \^ = circumflex etc. would be dropped
and replaced by Unicode characters.

Certainly "maximally disruptive", but there is a clean upgrade
mapping from the old codes to the new. What is gained is that there
is no ambiguity over the intent of these markup devices, as exists
presently (the specification begins with the rather opaque stanza
"If permitted by the relevant dictionary and if no other indication is
present, the contents of a text or character field are assumed to be
interpretable as text in English or some other human language. Certain
special codes are used to indicate special characters or accented letters
not available in the ASCII character set, as listed below.")

One needs then to be sure that the onject in hand is indeed a CIF 2.0 file,
and not a CIF 1.

I've introduced this idea here since it has some bearing on the question of
elides (where retention of the legacy markup meaning for \' was assumed),
but it might be best to take discussion off to yet another thread.

Cheers
Brian


On Wed, Nov 25, 2009 at 02:24:16PM +0000, SIMON WESTRIP wrote:
> What Brian has said here - specifically
> 
> "if this were dropped as part of the CIF2 specification,
> we would need to think carefully about how else to retain this
> functionality"
> 
> is also relevant to how we handle the CIF1.1 markup conventions.
> As I understand it in CIF1.1 these are the default conventions for
> text fields unless the dictionary prohibits them, but in CIF2 all such
> conventions will _not_ be part of the spec, and can only be interpretted at 
> the dictionary level.
> 
> Is this correct?
> 
> I'm only asking because we (at the IUCr at least) will have to address this
> issue sooner rather than later when adopting CIF2, so I just want to make sure
> I understand base CIF2 correctly
> 
> Cheers
> 
> Simon
> 
> ________________________________
> 
> From: Brian McMahon <[email protected]>
> To: Group finalising DDLm and associated dictionaries <[email protected]>
> Sent: Wednesday, 25 November, 2009 13:34:05
> Subject: Re: [ddlm-group] Use of elides in strings
> 
> (I've switched the thread title to deal separately with line folding.)
> 
> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
> of the ITG bible). Currently, it invokes a special meaning for the
> backslash (reverse solidus) character, but only when it is the first
> non-blank after an opening semicolon or comment hash delimiter. We have
> yet to discuss whether to extend it to other string types (specifically
> the triple-quoted strings).
> 
> It's quite easy these days to generate single strings that are longer
> than 2048 characters (or any other arbitrary line limit) - e.g. a
> protein or nucleic acid sequence. Many, many chemical names broke the old
> 80-character line length limit.
> 
> We're very happy with CIF applications that do not interpret the
> line-folding protocol, so long as they preserve the existing backslashes.
> However, a fully-compliant CIF 1.1 parser should be able to return an
> unfolded string to an application that requests it.
> 
> As Herbert says, if this were dropped as part of the CIF2 specification,
> we would need to think carefully about how else to retain this
> functionality.
> 
> Regards
> Brian
> 
> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
> > The line folding protocol was discussed and adopted by COMCIFS and is 
> > posted, aong with other "Common Semantic Features" at
> > 
> > http://www.iucr.org/resources/cif/spec/version1.1/semantics
> > 
> > but that is neither here nor there.  The point is that the IUCr uses CIF 
> > to get work done.  If we disable something they are using, we should offer 
> > some equivalent functionality so they can use CIF 2 to do their work. 
> > Otherwise, they will have to do the sensible thing, and continue to use 
> > CIF 1, or, worse, create their own dialect of CIF 2.
> > 
> > Now, I broke my nose yesterday morning and find myself a bit punchy today, 
> > so I will drop out of this discussion for a while.  Hopefully, when I 
> > return to it, this whole matter will be settled in some way that will 
> > allow people to actually use CIF 2, instead of it becoming what it seems 
> > on its way to becoming -- something elegant but not terrible useful, a bit 
> > like PL/I.
> > 
> > Cheers,
> >    Herbert
> > 
> > =====================================================
> >   Herbert J. Bernstein, Professor of Computer Science
> >     Dowling College, Kramer Science Center, KSC 121
> >          Idle Hour Blvd, Oakdale, NY, 11769
> > 
> >                   +1-631-244-3035
> >                  [email protected]
> > =====================================================
> > 
> > On Wed, 25 Nov 2009, Nick Spadaccini wrote:
> > 
> >> I am with John. STAR has no line-folding protocol. As far as I can recall
> >> neither did CIF. Somewhere along the way line folding was discussed (or
> >> introduced?), but I am not sure it is formally part of any spec.
> >>
> >> None of my software handles anything about line folding. I can see no reason
> >> for it, since with a 2048 maximum record length, and a free format structure
> >> there is plenty of room to output your data. The only time it would be
> >> necessary is when (dataname + space + datavalue)> 2048 and when is that
> >> ever going to happen?
> >>
> >> May be the desire for it comes from making the data "pretty" and read well
> >> in a text editor. Well that is the task of an application to read the CIF
> >> and present it appropriately. The CIF is strictly about CONTENT and not
> >> FORM.
> >>
> >> Since we have given up on elided characters being part of CIF syntax, and
> >> the belief by others that this not be a lexer issue, I think we should
> >> absolutely consistent. The lexer knows how to identify tokens and reads
> >> everything within them as a raw string.
> >>
> >> If your "encoding" for \n; strings includes characters that break the lexer,
> >> then protect it in some way so that when you pass that string back as raw in
> >> your software, somebody knows how to unprotect it back to the original (as
> >> with ALL string encoding).
> >>
> >> One concession I think we can consider is to change the delimiter from \n;
> >> to \n;\n. I don't see this as causing me any problems, since I handle
> >>
> >> ; stuff
> >> More stuff
> >> ; _newname
> >>
> >> routinely, but others don't. I believe most people do use (and probably
> >> think) the delimiter is \n;\n anyway.
> >>
> >> Two questions
> >>
> >> (1) Do you agree that line folding just another encoding and therefore not a
> >> STAR/CIF issue? Consequently it is the responsibility of the encoding not to
> >> break the lexer.
> >> (2) Do we think \n;\n is a better delimiter?
> >>
> >> On 25/11/09 10:33 AM, "John Westbrook" <[email protected]> wrote:
> >>
> >>> Hi James,
> >>>
> >>> My preference is avoid the elides in the syntax for the purpose of escaping
> >>> terminators
> >>> in strings deferring  interpretation to the application.
> >>>
> >>> I do not understand all of the issues related to line folding, which I
> >>> believe is an issue for Brian and Simon.
> >>>
> >>> John
> >>>
> >>>
> >>> James Hester wrote:
> >>>> Thanks for the quick reply over Thanksgiving, John.  I take from your
> >>>> message that the PDB does not need any elide mechanism to be defined
> >>>> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
> >>>> of not defining any elides, or would you prefer to abstain?
> >>>>
> >>>> Votes so far:
> >>>>
> >>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
> >>>> Elides:?
> >>>>
> >>>> Unknown: John, Joe, David B., Brian, Simon
> >>>>
> >>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
> >>>> <[email protected]> wrote:
> >>>>> I confess that I am having difficulty keeping up with all aspects
> >>>>> of this discussion.   Following Herb's suggestion I will try to
> >>>>> summarize the quoting issues from the PDB perspective.
> >>>>>
> >>>>> 1. As there are multiple ways of quoting a string our tools and files
> >>>>> surround embedded quotes with quotes of the opposite sense or with
> >>>>> semicolons in the mixed case.   I think that this point has been
> >>>>> covered a number of times now and I believe that Nick has suggested
> >>>>> that all reasonable cases can be handled by using this approach.
> >>>>>
> >>>>> 2. I too was not aware that original definition of terminators
> >>>>> had changed and did not include either a leading or trailing
> >>>>> whitespace.  Certainly this must still be the case for single
> >>>>> and double quotes.  I cannot recall ever seeing an example
> >>>>> where the terminator \n; was following by a whitespace character,
> >>>>> but about half of the codes that I am familiar with would
> >>>>> fall over on \n;next_token.
> >>>>>
> >>>>> 3. Line folding has never been an issue for PDB nor has line length.
> >>>>>
> >>>>> Regards,
> >>>>>
> >>>>> John
> >>>>>
> >>>>>
> >>>>> Herbert J. Bernstein wrote:
> >>>>>> My major concern about anything we do is to be able to preserve
> >>>>>> the functionality of the practices that the IUCr is following in
> >>>>>> journal publications and the PDB is following. Inasmuch as they seem
> >>>>>> able to cope with no elide in CIF 1.1, the remaining question is whether
> >>>>>> they will be negatively impacted by the change in string termination
> >>>>>> without any elide.  If they can use CIF 2 with these changes, my
> >>>>>> objections are purely academic and irrelevant.  -- Herberrt
> >>>>>>
> >>>>>> =====================================================
> >>>>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>>>    Dowling College, Kramer Science Center, KSC 121
> >>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
> >>>>>>
> >>>>>>                  +1-631-244-3035
> >>>>>>                  [email protected]
> >>>>>> =====================================================
> >>>>>>
> >>>>>> On Wed, 25 Nov 2009, James Hester wrote:
> >>>>>>
> >>>>>>> Herbert: I have the dubious advantage of not having participated in
> >>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
> >>>>>>> down to rely on.
> >>>>>>>
> >>>>>>> Anyway, how do you feel about abandoning any specification of elides
> >>>>>>> in CIF2 syntax, as suggested by Nick?
> >>>>>>>
> >>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
> >>>>>>> <[email protected]> wrote:
> >>>>>>>> Dear James,
> >>>>>>>>
> >>>>>>>>  I started to write:
> >>>>>>>>  "No, in CIF 1.1, none of the terminal quote marks, including the \n;
> >>>>>>>> are
> >>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
> >>>>>>>> file).
> >>>>>>>> This is a well-established, and very tricky part of the CIF spec
> >>>>>>>> going back
> >>>>>>>> to 1990.  That is why Nick had to explicitly specify that a terminal
> >>>>>>>> quote
> >>>>>>>> mark would be effective no matter what it was followed by."
> >>>>>>>>
> >>>>>>>>  But the grammer currently on the IUCr web site is _not_ the one that I
> >>>>>>>> recall COMCIFs discussing and approving.  It now explcitly removes
> >>>>>>>> the requirement for terminal white space in the special case of
> >>>>>>>> the \n; text field terminator.  I don't recall when that change was
> >>>>>>>> adopted,
> >>>>>>>> but it appears that you are right under the current spec
> >>>>>>>> about the example I chose.  Inasmuch as there is a lot of working code
> >>>>>>>> that enforces and uses the original whitespace handling and uses it
> >>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
> >>>>>>>> something to adapt to this change for CIFtbx 4.
> >>>>>>>>
> >>>>>>>>  I guess we are just going to have yet another few dialects of CIF.
> >>>>>>>>
> >>>>>>>>  Regards,
> >>>>>>>>    Herbert
> >>>>>>>> =====================================================
> >>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>>>>>   Dowling College, Kramer Science Center, KSC 121
> >>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
> >>>>>>>>
> >>>>>>>>                 +1-631-244-3035
> >>>>>>>>                [email protected]
> >>>>>>>> =====================================================
> >>>>>>>>
> >>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
> >>>>>>>>
> >>>>>>>>> To be precise, we are not 'referring all elides to the application'
> >>>>>>>>> because no elides are recognised by the lexer under Nick's latest
> >>>>>>>>> suggestion, so there are no elides to refer to the application.
> >>>>>>>>>
> >>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you provide
> >>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the start
> >>>>>>>>> of the second line would terminate the string, and so whitespace
> >>>>>>>>> should then appear as the second character on the second line, rather
> >>>>>>>>> than reverse solidus.
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
> >>>>>>>>> <[email protected]> wrote:
> >>>>>>>>>> The only problem with referring all elisdes to the application is that
> >>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
> >>>>>>>>>> to be
> >>>>>>>>>> effective, the line folding protocol develops a slight gap.  The
> >>>>>>>>>> case is as follows
> >>>>>>>>>>
> >>>>>>>>>> ;\
> >>>>>>>>>> ;\
> >>>>>>>>>> ;
> >>>>>>>>>>
> >>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with the
> >>>>>>>>>> line folding protocol translates to the equivalent of ';' because the
> >>>>>>>>>> embedded ;\ is not a valid text terminator.  If we require that
> >>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
> >>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
> >>>>>>>>>>
> >>>>>>>>>> =====================================================
> >>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
> >>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
> >>>>>>>>>>
> >>>>>>>>>>                 +1-631-244-3035
> >>>>>>>>>>                [email protected]
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

References:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Re: [ddlm-group] Use of elides in strings (Brian McMahon)

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Handling single string values longer than maximumline length

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Use of elides in strings