[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings




On 25/11/09 10:24 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:

> What Brian has said here - specifically
> 
> "if this were dropped as part of the CIF2 specification,
> we would need to think carefully about how else to retain this
> functionality"
> 
> is also relevant to how we handle the CIF1.1 markup conventions.
> As I understand it in CIF1.1 these are the default conventions for
> text fields unless the dictionary prohibits them, but in CIF2 all such
> conventions will _not_ be part of the spec, and can only be interpretted at
> the dictionary level.
> 
> Is this correct?

Yes, this is my understanding. There will be many different conventions I
presume, some will be widely accepted and standard, they will be part of the
underlying systems that interpret the files. For instance if something is
declared as a TeX encoding, we know what to do.

> 
> I'm only asking because we (at the IUCr at least) will have to address this
> issue sooner rather than later when adopting CIF2, so I just want to make sure
> I understand base CIF2 correctly
> 
> Cheers
> 
> Simon
> 
> 
> 
> From: Brian McMahon <bm@iucr.org>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Wednesday, 25 November, 2009 13:34:05
> Subject: Re: [ddlm-group] Use of elides in strings
> 
> (I've switched the thread title to deal separately with line folding.)
> 
> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
> of the ITG bible). Currently, it invokes a special meaning for the
> backslash (reverse solidus) character, but only when it is the first
> non-blank after an opening semicolon or comment hash delimiter. We have
> yet to discuss whether to extend it to other string types (specifically
> the triple-quoted strings).
> 
> It's quite easy these days to generate single strings that are longer
> than 2048 characters (or any other arbitrary line limit) - e.g. a
> protein or nucleic acid sequence. Many, many chemical names broke the old
> 80-character line length limit.
> 
> We're very happy with CIF applications that do not interpret the
> line-folding protocol, so long as they preserve the existing backslashes.
> However, a fully-compliant CIF 1.1 parser should be able to return an
> unfolded string to an application that requests it.
> 
> As Herbert says, if this were dropped as part of the CIF2 specification,
> we would need to think carefully about how else to retain this
> functionality.
> 
> Regards
> Brian
> 
> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
>> The line folding protocol was discussed and adopted by COMCIFS and is
>> posted, aong with other "Common Semantic Features" at
>> 
>> http://www.iucr.org/resources/cif/spec/version1.1/semantics
>> 
>> but that is neither here nor there.  The point is that the IUCr uses CIF
>> to get work done.  If we disable something they are using, we should offer
>> some equivalent functionality so they can use CIF 2 to do their work.
>> Otherwise, they will have to do the sensible thing, and continue to use
>> CIF 1, or, worse, create their own dialect of CIF 2.
>> 
>> Now, I broke my nose yesterday morning and find myself a bit punchy today,
>> so I will drop out of this discussion for a while.  Hopefully, when I
>> return to it, this whole matter will be settled in some way that will
>> allow people to actually use CIF 2, instead of it becoming what it seems
>> on its way to becoming -- something elegant but not terrible useful, a bit
>> like PL/I.
>> 
>> Cheers,
>>    Herbert
>> 
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>> 
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>> 
>> On Wed, 25 Nov 2009, Nick Spadaccini wrote:
>> 
>>> I am with John. STAR has no line-folding protocol. As far as I can recall
>>> neither did CIF. Somewhere along the way line folding was discussed (or
>>> introduced?), but I am not sure it is formally part of any spec.
>>> 
>>> None of my software handles anything about line folding. I can see no reason
>>> for it, since with a 2048 maximum record length, and a free format structure
>>> there is plenty of room to output your data. The only time it would be
>>> necessary is when (dataname + space + datavalue)> 2048 and when is that
>>> ever going to happen?
>>> 
>>> May be the desire for it comes from making the data "pretty" and read well
>>> in a text editor. Well that is the task of an application to read the CIF
>>> and present it appropriately. The CIF is strictly about CONTENT and not
>>> FORM.
>>> 
>>> Since we have given up on elided characters being part of CIF syntax, and
>>> the belief by others that this not be a lexer issue, I think we should
>>> absolutely consistent. The lexer knows how to identify tokens and reads
>>> everything within them as a raw string.
>>> 
>>> If your "encoding" for \n; strings includes characters that break the lexer,
>>> then protect it in some way so that when you pass that string back as raw in
>>> your software, somebody knows how to unprotect it back to the original (as
>>> with ALL string encoding).
>>> 
>>> One concession I think we can consider is to change the delimiter from \n;
>>> to \n;\n. I don't see this as causing me any problems, since I handle
>>> 
>>> ; stuff
>>> More stuff
>>> ; _newname
>>> 
>>> routinely, but others don't. I believe most people do use (and probably
>>> think) the delimiter is \n;\n anyway.
>>> 
>>> Two questions
>>> 
>>> (1) Do you agree that line folding just another encoding and therefore not a
>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not to
>>> break the lexer.
>>> (2) Do we think \n;\n is a better delimiter?
>>> 
>>> On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote:
>>> 
>>>> Hi James,
>>>> 
>>>> My preference is avoid the elides in the syntax for the purpose of escaping
>>>> terminators
>>>> in strings deferring  interpretation to the application.
>>>> 
>>>> I do not understand all of the issues related to line folding, which I
>>>> believe is an issue for Brian and Simon.
>>>> 
>>>> John
>>>> 
>>>> 
>>>> James Hester wrote:
>>>>> Thanks for the quick reply over Thanksgiving, John.  I take from your
>>>>> message that the PDB does not need any elide mechanism to be defined
>>>>> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
>>>>> of not defining any elides, or would you prefer to abstain?
>>>>> 
>>>>> Votes so far:
>>>>> 
>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>>>>> Elides:?
>>>>> 
>>>>> Unknown: John, Joe, David B., Brian, Simon
>>>>> 
>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>>>>> <jwest@pdb-mail.rutgers.edu> wrote:
>>>>>> I confess that I am having difficulty keeping up with all aspects
>>>>>> of this discussion.   Following Herb's suggestion I will try to
>>>>>> summarize the quoting issues from the PDB perspective.
>>>>>> 
>>>>>> 1. As there are multiple ways of quoting a string our tools and files
>>>>>> surround embedded quotes with quotes of the opposite sense or with
>>>>>> semicolons in the mixed case.   I think that this point has been
>>>>>> covered a number of times now and I believe that Nick has suggested
>>>>>> that all reasonable cases can be handled by using this approach.
>>>>>> 
>>>>>> 2. I too was not aware that original definition of terminators
>>>>>> had changed and did not include either a leading or trailing
>>>>>> whitespace.  Certainly this must still be the case for single
>>>>>> and double quotes.  I cannot recall ever seeing an example
>>>>>> where the terminator \n; was following by a whitespace character,
>>>>>> but about half of the codes that I am familiar with would
>>>>>> fall over on \n;next_token.
>>>>>> 
>>>>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>>>> 
>>>>>> Regards,
>>>>>> 
>>>>>> John
>>>>>> 
>>>>>> 
>>>>>> Herbert J. Bernstein wrote:
>>>>>>> My major concern about anything we do is to be able to preserve
>>>>>>> the functionality of the practices that the IUCr is following in
>>>>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is whether
>>>>>>> they will be negatively impacted by the change in string termination
>>>>>>> without any elide.  If they can use CIF 2 with these changes, my
>>>>>>> objections are purely academic and irrelevant.  -- Herberrt
>>>>>>> 
>>>>>>> =====================================================
>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>> 
>>>>>>>                  +1-631-244-3035
>>>>>>>                  yaya@dowling.edu
>>>>>>> =====================================================
>>>>>>> 
>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>> 
>>>>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>>>>> down to rely on.
>>>>>>>> 
>>>>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>>>> 
>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>> Dear James,
>>>>>>>>> 
>>>>>>>>>  I started to write:
>>>>>>>>>  "No, in CIF 1.1, none of the terminal quote marks, including the \n;
>>>>>>>>> are
>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>>>>> file).
>>>>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>>>>> going back
>>>>>>>>> to 1990.  That is why Nick had to explicitly specify that a terminal
>>>>>>>>> quote
>>>>>>>>> mark would be effective no matter what it was followed by."
>>>>>>>>> 
>>>>>>>>>  But the grammer currently on the IUCr web site is _not_ the one that
>>>>>>>>> I
>>>>>>>>> recall COMCIFs discussing and approving.  It now explcitly removes
>>>>>>>>> the requirement for terminal white space in the special case of
>>>>>>>>> the \n; text field terminator.  I don't recall when that change was
>>>>>>>>> adopted,
>>>>>>>>> but it appears that you are right under the current spec
>>>>>>>>> about the example I chose.  Inasmuch as there is a lot of working code
>>>>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>>>> 
>>>>>>>>>  I guess we are just going to have yet another few dialects of CIF.
>>>>>>>>> 
>>>>>>>>>  Regards,
>>>>>>>>>    Herbert
>>>>>>>>> =====================================================
>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>> 
>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>                 yaya@dowling.edu
>>>>>>>>> =====================================================
>>>>>>>>> 
>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>> 
>>>>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>>>> 
>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you
>>>>>>>>>> provide
>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the start
>>>>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>>>>> should then appear as the second character on the second line, rather
>>>>>>>>>> than reverse solidus.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>>>> The only problem with referring all elisdes to the application is
>>>>>>>>>>> that
>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>>>>> to be
>>>>>>>>>>> effective, the line folding protocol develops a slight gap.  The
>>>>>>>>>>> case is as follows
>>>>>>>>>>> 
>>>>>>>>>>> ;\
>>>>>>>>>>> ;\
>>>>>>>>>>> ;
>>>>>>>>>>> 
>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with the
>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because
>>>>>>>>>>> the
>>>>>>>>>>> embedded ;\ is not a valid text terminator.  If we require that
>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>>>> 
>>>>>>>>>>> =====================================================
>>>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>> 
>>>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>>>                 yaya@dowling.edu
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]