Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

I believe the IUCr could cope with *not* having the elision mechanism
we have been discussing. [Personally I would prefer 'escape' mechanism
rather than 'elide', since the latter implies to me that the following
character should be dropped - but that's just a pedantic comment.]

James and Nick (and others) make a compelling case for the scope for
confusion that arises in overloading the meaning of an elide character
within a string value.

The consequent prohibition of same-terminator characters within a
terminator-delimited string does require care in writing strings,
but I am sure that we can find ways around it.

In practical terms, the prohibition of the ' and " characters from
non-delimited strings (which I think we have already agreed) is more
of a nuisance, because many existing CIFs have atom labels such as
O1', Ti" (but I am told that according to some chemical naming
conventions use of both single and double quotes in the same label can
occur). Some important crystallographic software still has 4-character
limits on atom labels, so forcing the addition of opening and
terminating quote marks will rather reduce its effectiveness!

Of course, such a program may never become CIF2-compatible, so I'm
not using that as an argument for accommodating the new standards to
its limitations; but it's an interesting example of the real cases
for which we will need to create suitable handling procedures.

Regards
Brian


On Wed, Nov 25, 2009 at 01:06:46PM +1100, James Hester wrote:
> Thanks for the quick reply over Thanksgiving, John.  I take from your
> message that the PDB does not need any elide mechanism to be defined
> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
> of not defining any elides, or would you prefer to abstain?
> 
> Votes so far:
> 
> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
> Elides:?
> 
> Unknown: John, Joe, David B., Brian, Simon
> 
> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
> <jwest@pdb-mail.rutgers.edu> wrote:
> >
> > I confess that I am having difficulty keeping up with all aspects
> > of this discussion.   Following Herb's suggestion I will try to
> > summarize the quoting issues from the PDB perspective.
> >
> > 1. As there are multiple ways of quoting a string our tools and files
> > surround embedded quotes with quotes of the opposite sense or with
> > semicolons in the mixed case.   I think that this point has been
> > covered a number of times now and I believe that Nick has suggested
> > that all reasonable cases can be handled by using this approach.
> >
> > 2. I too was not aware that original definition of terminators
> > had changed and did not include either a leading or trailing
> > whitespace.  Certainly this must still be the case for single
> > and double quotes.  I cannot recall ever seeing an example
> > where the terminator \n; was following by a whitespace character,
> > but about half of the codes that I am familiar with would
> > fall over on \n;next_token.
> >
> > 3. Line folding has never been an issue for PDB nor has line length.
> >
> > Regards,
> >
> > John
> >
> >
> > Herbert J. Bernstein wrote:
> >> My major concern about anything we do is to be able to preserve
> >> the functionality of the practices that the IUCr is following in
> >> journal publications and the PDB is following. Inasmuch as they seem
> >> able to cope with no elide in CIF 1.1, the remaining question is whether
> >> they will be negatively impacted by the change in string termination
> >> without any elide.  If they can use CIF 2 with these changes, my
> >> objections are purely academic and irrelevant.  -- Herberrt
> >>
> >> =====================================================
> >>  Herbert J. Bernstein, Professor of Computer Science
> >>    Dowling College, Kramer Science Center, KSC 121
> >>         Idle Hour Blvd, Oakdale, NY, 11769
> >>
> >>                  +1-631-244-3035
> >>                  yaya@dowling.edu
> >> =====================================================
> >>
> >> On Wed, 25 Nov 2009, James Hester wrote:
> >>
> >>> Herbert: I have the dubious advantage of not having participated in
> >>> all those CIF1.0/1.1 discussions, so only have the spec as written
> >>> down to rely on.
> >>>
> >>> Anyway, how do you feel about abandoning any specification of elides
> >>> in CIF2 syntax, as suggested by Nick?
> >>>
> >>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
> >>> <yaya@bernstein-plus-sons.com> wrote:
> >>>> Dear James,
> >>>>
> >>>>  I started to write:
> >>>>  "No, in CIF 1.1, none of the terminal quote marks, including the \n;
> >>>> are
> >>>> effective unless followed by whitespace (\n, space, tab, of end of
> >>>> file).
> >>>> This is a well-established, and very tricky part of the CIF spec
> >>>> going back
> >>>> to 1990.  That is why Nick had to explicitly specify that a terminal
> >>>> quote
> >>>> mark would be effective no matter what it was followed by."
> >>>>
> >>>>  But the grammer currently on the IUCr web site is _not_ the one that I
> >>>> recall COMCIFs discussing and approving.  It now explcitly removes
> >>>> the requirement for terminal white space in the special case of
> >>>> the \n; text field terminator.  I don't recall when that change was
> >>>> adopted,
> >>>> but it appears that you are right under the current spec
> >>>> about the example I chose.  Inasmuch as there is a lot of working code
> >>>> that enforces and uses the original whitespace handling and uses it
> >>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
> >>>> something to adapt to this change for CIFtbx 4.
> >>>>
> >>>>  I guess we are just going to have yet another few dialects of CIF.
> >>>>
> >>>>  Regards,
> >>>>    Herbert
> >>>> =====================================================
> >>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>   Dowling College, Kramer Science Center, KSC 121
> >>>>        Idle Hour Blvd, Oakdale, NY, 11769
> >>>>
> >>>>                 +1-631-244-3035
> >>>>                 yaya@dowling.edu
> >>>> =====================================================
> >>>>
> >>>> On Wed, 25 Nov 2009, James Hester wrote:
> >>>>
> >>>>> To be precise, we are not 'referring all elides to the application'
> >>>>> because no elides are recognised by the lexer under Nick's latest
> >>>>> suggestion, so there are no elides to refer to the application.
> >>>>>
> >>>>> My understanding of CIF1.1 syntax suggests that the string you provide
> >>>>> would produce a syntax error in CIF1.1, as the semicolon at the start
> >>>>> of the second line would terminate the string, and so whitespace
> >>>>> should then appear as the second character on the second line, rather
> >>>>> than reverse solidus.
> >>>>>
> >>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
> >>>>> <yaya@bernstein-plus-sons.com> wrote:
> >>>>>>
> >>>>>> The only problem with referring all elisdes to the application is that
> >>>>>> with the removal of the requirement of a blank after a \n; for it
> >>>>>> to be
> >>>>>> effective, the line folding protocol develops a slight gap.  The
> >>>>>> case is as follows
> >>>>>>
> >>>>>> ;\
> >>>>>> ;\
> >>>>>> ;
> >>>>>>
> >>>>>> Is a valid single text field in CIF 1.1, which when handled with the
> >>>>>> line folding protocol translates to the equivalent of ';' because the
> >>>>>> embedded ;\ is not a valid text terminator.  If we require that
> >>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
> >>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
> >>>>>>
> >>>>>> =====================================================
> >>>>>>  Herbert J. Bernstein, Professor of Computer Science
> >>>>>>   Dowling College, Kramer Science Center, KSC 121
> >>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
> >>>>>>
> >>>>>>                 +1-631-244-3035
> >>>>>>                 yaya@dowling.edu
> >>>>>> =====================================================
> >>>>>>
> >>>>>> On Wed, 25 Nov 2009, James Hester wrote:
> >>>>>>
> >>>>>>> I wholeheartedly agree with Nick's suggestion.
> >>>>>>>
> >>>>>>> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini
> >>>>>>> <nick@csse.uwa.edu.au>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> It appears to me that we have spent far too long on a syntactic
> >>>>>>>> issue
> >>>>>>>> which
> >>>>>>>> can be avoided 99.9999% of the time. Quite simply given the 5
> >>>>>>>> ways to
> >>>>>>>> delimit strings, it is next to impossible to get a situation
> >>>>>>>> where you
> >>>>>>>> cannot choose one of those to make the problem go away.
> >>>>>>>>
> >>>>>>>> I think the RCSB systematically avoid it by choosing
> >>>>>>>>
> >>>>>>>> "ab'cd"
> >>>>>>>> 'ab"cd'
> >>>>>>>> ;ab'"cd
> >>>>>>>> ;
> >>>>>>>>
> >>>>>>>> But now we additionally have """ and ''' to choose from, making
> >>>>>>>> it even
> >>>>>>>> easier.
> >>>>>>>>
> >>>>>>>> So I propose in line with James' position there is NO eliding of
> >>>>>>>> terminator
> >>>>>>>> character at the CIF2 syntax level. ALL elides in the string are
> >>>>>>>> assumed
> >>>>>>>> to
> >>>>>>>> be user specific encoding (say TeX, IUCr \greek) which can be
> >>>>>>>> resolved
> >>>>>>>> at
> >>>>>>>> the dictionary level.
> >>>>>>>>
> >>>>>>>> This necessarily means NO terminator character can appear in a
> >>>>>>>> string
> >>>>>>>> delimited by the same terminator character. You will need to
> >>>>>>>> choose a
> >>>>>>>> different terminator character. That is
> >>>>>>>>
> >>>>>>>> No " in "strings"
> >>>>>>>> No ' in 'strings'
> >>>>>>>> No """ in """strings""" (but separable individual and doublet " are
> >>>>>>>> allowed)
> >>>>>>>> No ''' in '''strings''' (but separable individual and doublet ' are
> >>>>>>>> allowed)
> >>>>>>>>
> >>>>>>>> EVERYTHING in the string is returned as raw (except the
> >>>>>>>> initiating and
> >>>>>>>> terminating character).
> >>>>>>>>
> >>>>>>>> The only time you will not be able to encode anything in a delimited
> >>>>>>>> string
> >>>>>>>> is when you want to include ' " """ ''' and \n; in the one
> >>>>>>>> string. The
> >>>>>>>> likelihood of that is almost zero, unless you may want to include
> >>>>>>>> a CIF
> >>>>>>>> within a CIF (a silly thing to do IMHO). In that case the
> >>>>>>>> contents can
> >>>>>>>> be
> >>>>>>>> encoded in a dictionary driven way. I suggest it be declared as a
> >>>>>>>> BASE64
> >>>>>>>> type and then all the syntactic ambiguity disappears.
> >>>>>>>>
> >>>>>>>> Problem solved! No need to elide because of CIF2 syntax rules all
> >>>>>>>> elides
> >>>>>>>> are
> >>>>>>>> user driven, contents are returned raw.
> >>>>>>>>
> >>>>>>>> As for Herbs comment in a recent email what about line-folding, then
> >>>>>>>> the
> >>>>>>>> same holds. That is NOT a lexer issue and it has nothing to do
> >>>>>>>> with the
> >>>>>>>> parser, everything is read literally and returned raw and what to do
> >>>>>>>> with
> >>>>>>>> it
> >>>>>>>> is promulgated to the downstream application.
> >>>>>>>>
> >>>>>>>> Straw vote - No elides of terminator strings as described above -
> >>>>>>>> Nick
> >>>>>>>>
> >>>>>>>>
> >>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> T +61 (02) 9717 9907
> >>> F +61 (02) 9717 3145
> >>> M +61 (04) 0249 4148
> >>> _______________________________________________
> >>> ddlm-group mailing list
> >>> ddlm-group@iucr.org
> >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >>>
> >>
> >> ------------------------------------------------------------------------
> >>
> >> _______________________________________________
> >> ddlm-group mailing list
> >> ddlm-group@iucr.org
> >> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> 
> 
> 
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.