Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Handling single string values longer than maximumline length

Title: Re: [ddlm-group] Handling single string values longer than maximum line length
I have re-engineered this in to the correct thread.

Simon and I are agreed.  But so others understand clearly, a lexer, when in the appropriate context

(a) matches \n; it will initiate a semi-colon delimited string.
(b) it will gobble and retain as raw the contents until;
(c  it matches the next \n; at which point it terminates the semi-colon delimited string
(d) it returns what it gobbled up as a raw string minus the leading and terminating  \n;

That is it behaves exactly the same as for strings, and the lexer know nothing about the internal encoding (including line folding). All of that stuff is passed to the down stream application to handle. I am trying to ensure we are consistent with what a lexer is supposed to do with any delimited string.

On 25/11/09 9:26 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:

The IUCr has been implementing the line folding protocol in publCIF
because publCIF still trys to produce CIFs that respect 80 char line length.
However, there's been no need for this for some time now and the 80 char cutoff will be
dropped. In-house, line folding is employed if necessary to pass a CIF through some older
checking software, but the 'folded' CIF is subsequently discarded.
So my answer to

(1) Do you agree that line folding just another encoding and therefore not a
STAR/CIF issue? Consequently it is the responsibility of the encoding not to
break the lexer.

is Yes


Regarding question (2) Do we think \n;\n is a better delimiter?

We do encounter

 ; stuff
More stuff
; _newname

and

loop_
_item_stuff
; stuff
More stuff
; 'quoted stuff' unquoted_stuff

So dont require \n;\n and as CIF2 still requires (importantly) an "appropriate separator" between items, I dont think
there's any ambiguity.

However, adopting \n;\n may reduce the risk of unintentionally terminating a data value by e.g. a text editor
wrapping a semicolon to the start of a line, or someone pasting in a string that contains \n; (but equally the
same applies to \n;\n).

So I'd say no need to change \n; to \n;\n as it would require extra 'remediation' when converting between CIF1 and CIF2.

On the main issue of "no elides" - I'm OK with that.

Cheers

Simon

From:
Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Wednesday, 25 November, 2009 3:37:03
Subject: Re: [ddlm-group] Use of elides in strings

I am with John. STAR has no line-folding protocol. As far as I can recall
neither did CIF. Somewhere along the way line folding was discussed (or
introduced?), but I am not sure it is formally part of any spec.

None of my software handles anything about line folding. I can see no reason
for it, since with a 2048 maximum record length, and a free format structure
there is plenty of room to output your data. The only time it would be
necessary is when (dataname + space + datavalue) > 2048 and when is that
ever going to happen?

May be the desire for it comes from making the data "pretty" and read well
in a text editor. Well that is the task of an application to read the CIF
and present it appropriately. The CIF is strictly about CONTENT and not
FORM.

Since we have given up on elided characters being part of CIF syntax, and
the belief by others that this not be a lexer issue, I think we should
absolutely consistent. The lexer knows how to identify tokens and reads
everything within them as a raw string.

If your "encoding" for \n; strings includes characters that break the lexer,
then protect it in some way so that when you pass that string back as raw in
your software, somebody knows how to unprotect it back to the original (as
with ALL string encoding).

One concession I think we can consider is to change the delimiter from \n;
to \n;\n. I don't see this as causing me any problems, since I handle

; stuff
More stuff
; _newname
 
routinely, but others don't. I believe most people do use (and probably
think) the delimiter is \n;\n anyway.

Two questions

(1) Do you agree that line folding just another encoding and therefore not a
STAR/CIF issue? Consequently it is the responsibility of the encoding not to
break the lexer.
(2) Do we think \n;\n is a better delimiter?

On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote:

> Hi James,
>
> My preference is avoid the elides in the syntax for the purpose of escaping
> terminators
> in strings deferring  interpretation to the application.
>
> I do not understand all of the issues related to line folding, which I
> believe is an issue for Brian and Simon.
>
> John
>
>
> James Hester wrote:
>> Thanks for the quick reply over Thanksgiving, John.  I take from your
>> message that the PDB does not need any elide mechanism to be defined
>> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
>> of not defining any elides, or would you prefer to abstain?
>>
>> Votes so far:
>>
>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>> Elides:?
>>
>> Unknown: John, Joe, David B., Brian, Simon
>>
>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>> <jwest@pdb-mail.rutgers.edu> wrote:
>>> I confess that I am having difficulty keeping up with all aspects
>>> of this discussion.   Following Herb's suggestion I will try to
>>> summarize the quoting issues from the PDB perspective.
>>>
>>> 1. As there are multiple ways of quoting a string our tools and files
>>> surround embedded quotes with quotes of the opposite sense or with
>>> semicolons in the mixed case.   I think that this point has been
>>> covered a number of times now and I believe that Nick has suggested
>>> that all reasonable cases can be handled by using this approach.
>>>
>>> 2. I too was not aware that original definition of terminators
>>> had changed and did not include either a leading or trailing
>>> whitespace.  Certainly this must still be the case for single
>>> and double quotes.  I cannot recall ever seeing an example
>>> where the terminator \n; was following by a whitespace character,
>>> but about half of the codes that I am familiar with would
>>> fall over on \n;next_token.
>>>
>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>
>>> Regards,
>>>
>>> John
>>>
>>>
>>> Herbert J. Bernstein wrote:
>>>> My major concern about anything we do is to be able to preserve
>>>> the functionality of the practices that the IUCr is following in
>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>> able to cope with no elide in CIF 1.1, the remaining question is whether
>>>> they will be negatively impacted by the change in string termination
>>>> without any elide.  If they can use CIF 2 with these changes, my
>>>> objections are purely academic and irrelevant.  -- Herberrt
>>>>
>>>> =====================================================
>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>
>>>>                  +1-631-244-3035
>>>>                  yaya@dowling.edu
>>>> =====================================================
>>>>
>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>
>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>> down to rely on.
>>>>>
>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>
>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>> Dear James,
>>>>>>
>>>>>>  I started to write:
>>>>>>  "No, in CIF 1.1, none of the terminal quote marks, including the \n;
>>>>>> are
>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>> file).
>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>> going back
>>>>>> to 1990.  That is why Nick had to explicitly specify that a terminal
>>>>>> quote
>>>>>> mark would be effective no matter what it was followed by."
>>>>>>
>>>>>>  But the grammer currently on the IUCr web site is _not_ the one that I
>>>>>> recall COMCIFs discussing and approving.  It now explcitly removes
>>>>>> the requirement for terminal white space in the special case of
>>>>>> the \n; text field terminator.  I don't recall when that change was
>>>>>> adopted,
>>>>>> but it appears that you are right under the current spec
>>>>>> about the example I chose.  Inasmuch as there is a lot of working code
>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>
>>>>>>  I guess we are just going to have yet another few dialects of CIF.
>>>>>>
>>>>>>  Regards,
>>>>>>    Herbert
>>>>>> =====================================================
>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>>                 +1-631-244-3035
>>>>>>                 yaya@dowling.edu
>>>>>> =====================================================
>>>>>>
>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>
>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>
>>>>>>> My understanding of CIF1.1 syntax suggests that the string you provide
>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the start
>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>> should then appear as the second character on the second line, rather
>>>>>>> than reverse solidus.
>>>>>>>
>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>> The only problem with referring all elisdes to the application is that
>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>> to be
>>>>>>>> effective, the line folding protocol develops a slight gap.  The
>>>>>>>> case is as follows
>>>>>>>>
>>>>>>>> ;\
>>>>>>>> ;\
>>>>>>>> ;
>>>>>>>>
>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with the
>>>>>>>> line folding protocol translates to the equivalent of ';' because the
>>>>>>>> embedded ;\ is not a valid text terminator.  If we require that
>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>
>>>>>>>> =====================================================
>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>
>>>>>>>>                 +1-631-244-3035
>>>>>>>>                 yaya@dowling.edu
>>>>>>>> =====================================================
>>>>>>>>
>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>
>>>>>>>>> I wholeheartedly agree with Nick's suggestion.
>>>>>>>>>
>>>>>>>>> On Tue, Nov 24, 2009 at 6:30 PM, Nick Spadaccini
>>>>>>>>> <nick@csse.uwa.edu.au>
>>>>>>>>> wrote:
>>>>>>>>>> It appears to me that we have spent far too long on a syntactic
>>>>>>>>>> issue
>>>>>>>>>> which
>>>>>>>>>> can be avoided 99.9999% of the time. Quite simply given the 5
>>>>>>>>>> ways to
>>>>>>>>>> delimit strings, it is next to impossible to get a situation
>>>>>>>>>> where you
>>>>>>>>>> cannot choose one of those to make the problem go away.
>>>>>>>>>>
>>>>>>>>>> I think the RCSB systematically avoid it by choosing
>>>>>>>>>>
>>>>>>>>>> "ab'cd"
>>>>>>>>>> 'ab"cd'
>>>>>>>>>> ;ab'"cd
>>>>>>>>>> ;
>>>>>>>>>>
>>>>>>>>>> But now we additionally have """ and ''' to choose from, making
>>>>>>>>>> it even
>>>>>>>>>> easier.
>>>>>>>>>>
>>>>>>>>>> So I propose in line with James' position there is NO eliding of
>>>>>>>>>> terminator
>>>>>>>>>> character at the CIF2 syntax level. ALL elides in the string are
>>>>>>>>>> assumed
>>>>>>>>>> to
>>>>>>>>>> be user specific encoding (say TeX, IUCr \greek) which can be
>>>>>>>>>> resolved
>>>>>>>>>> at
>>>>>>>>>> the dictionary level.
>>>>>>>>>>
>>>>>>>>>> This necessarily means NO terminator character can appear in a
>>>>>>>>>> string
>>>>>>>>>> delimited by the same terminator character. You will need to
>>>>>>>>>> choose a
>>>>>>>>>> different terminator character. That is
>>>>>>>>>>
>>>>>>>>>> No " in "strings"
>>>>>>>>>> No ' in 'strings'
>>>>>>>>>> No """ in """strings""" (but separable individual and doublet " are
>>>>>>>>>> allowed)
>>>>>>>>>> No ''' in '''strings''' (but separable individual and doublet ' are
>>>>>>>>>> allowed)
>>>>>>>>>>
>>>>>>>>>> EVERYTHING in the string is returned as raw (except the
>>>>>>>>>> initiating and
>>>>>>>>>> terminating character).
>>>>>>>>>>
>>>>>>>>>> The only time you will not be able to encode anything in a delimited
>>>>>>>>>> string
>>>>>>>>>> is when you want to include ' " """ ''' and \n; in the one
>>>>>>>>>> string. The
>>>>>>>>>> likelihood of that is almost zero, unless you may want to include
>>>>>>>>>> a CIF
>>>>>>>>>> within a CIF (a silly thing to do IMHO). In that case the
>>>>>>>>>> contents can
>>>>>>>>>> be
>>>>>>>>>> encoded in a dictionary driven way. I suggest it be declared as a
>>>>>>>>>> BASE64
>>>>>>>>>> type and then all the syntactic ambiguity disappears.
>>>>>>>>>>
>>>>>>>>>> Problem solved! No need to elide because of CIF2 syntax rules all
>>>>>>>>>> elides
>>>>>>>>>> are
>>>>>>>>>> user driven, contents are returned raw.
>>>>>>>>>>
>>>>>>>>>> As for Herbs comment in a recent email what about line-folding, then
>>>>>>>>>> the
>>>>>>>>>> same holds. That is NOT a lexer issue and it has nothing to do
>>>>>>>>>> with the
>>>>>>>>>> parser, everything is read literally and returned raw and what to do
>>>>>>>>>> with
>>>>>>>>>> it
>>>>>>>>>> is promulgated to the downstream application.
>>>>>>>>>>
>>>>>>>>>> Straw vote - No elides of terminator strings as described above -
>>>>>>>>>> Nick
>>>>>>>>>>
>>>>>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>
>>>> ------------------------------------------------------------------------
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>
>>
>>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick <http://www.csse.uwa.edu.au/%7Enick>
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.