[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Use of elides in strings
From: SIMON WESTRIP <simonwestrip@btinternet.com>
Date: Thu, 26 Nov 2009 13:59:40 +0000 (GMT)
In-Reply-To: <alpine.BSF.2.00.0911260722550.33247@epsilon.pair.com>
References: <C73434A0.1262E%nick@csse.uwa.edu.au><alpine.BSF.2.00.0911260722550.33247@epsilon.pair.com>

Just as a distraction from trying to understand modulated structure CIFs,
here goes:

I'd use semicolon delimiters (see long arrows <------ below)
and if I didnt know the definition of the item, I would
respect the whitespace.

Actually, I'd probably bung in a couple of extra newlines for good measure
if I knew what I were dealing with - i.e.

;
O'"
;

Funnily enough, this is actually easier to read using my eyes
than "O'"" :-)

From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Thursday, 26 November, 2009 12:41:46
Subject: Re: [ddlm-group] Use of elides in strings

I am trying to get some CIF-2 related software done, so please advise
me on some specific cases:

How should the following C-style strings followed by their CIF-1.1
representations be presented in a CIF 2 document? I've only put
in CIF 2 cases where I think there is no question, but feel free
to correct those.

C-style CIF-1.1 style CIF-2

"O'" "O'" or 'O'' "O'"
"O\"" "O"" or 'O"' 'O"'
"O'\"" "O'"" or 'O'"' ? <------------------------ \n;O'"\n;
"''O''" "''O''" or '''O''' "''O''"
"'''O'''" "'''O'''" or ''''O'''' ? <------------------------------- \n;'''O'''\n;
"\"\"O'\"\"" """O'""" or '""O'""' ? <---------------------- \n;""O'""\n;
"\"\"\"O'\"\"\"" """"O'"""" or '"""O'"""' ? <---------------------- \n;"""O'"""\n;

and for semi-colon delimited string, is the last new-line part of
the string or part of the delimiter, i.e. if the string is
"abc\n" is the CIF-2 version

;abc
; <-------------------- if newlines are not required by the items definition, I'd be tempted to strip the whitespace

or

;abc
<-------------------- without knowing the items definition, I'd be tempted to respect the whitespace
;

=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

+1-631-244-3035
yaya@dowling.edu
=====================================================

On Thu, 26 Nov 2009, Nick Spadaccini wrote:

>
>
>
> On 25/11/09 10:24 PM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:
>
>> What Brian has said here - specifically
>>
>> "if this were dropped as part of the CIF2 specification,
>> we would need to think carefully about how else to retain this
>> functionality"
>>
>> is also relevant to how we handle the CIF1.1 markup conventions.
>> As I understand it in CIF1.1 these are the default conventions for
>> text fields unless the dictionary prohibits them, but in CIF2 all such
>> conventions will _not_ be part of the spec, and can only be interpretted at
>> the dictionary level.
>>
>> Is this correct?
>
> Yes, this is my understanding. There will be many different conventions I
> presume, some will be widely accepted and standard, they will be part of the
> underlying systems that interpret the files. For instance if something is
> declared as a TeX encoding, we know what to do.
>
>>
>> I'm only asking because we (at the IUCr at least) will have to address this
>> issue sooner rather than later when adopting CIF2, so I just want to make sure
>> I understand base CIF2 correctly
>>
>> Cheers
>>
>> Simon
>>
>>
>>
>> From: Brian McMahon <bm@iucr.org>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Wednesday, 25 November, 2009 13:34:05
>> Subject: Re: [ddlm-group] Use of elides in strings
>>
>> (I've switched the thread title to deal separately with line folding.)
>>
>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
>> of the ITG bible). Currently, it invokes a special meaning for the
>> backslash (reverse solidus) character, but only when it is the first
>> non-blank after an opening semicolon or comment hash delimiter. We have
>> yet to discuss whether to extend it to other string types (specifically
>> the triple-quoted strings).
>>
>> It's quite easy these days to generate single strings that are longer
>> than 2048 characters (or any other arbitrary line limit) - e.g. a
>> protein or nucleic acid sequence. Many, many chemical names broke the old
>> 80-character line length limit.
>>
>> We're very happy with CIF applications that do not interpret the
>> line-folding protocol, so long as they preserve the existing backslashes.
>> However, a fully-compliant CIF 1.1 parser should be able to return an
>> unfolded string to an application that requests it.
>>
>> As Herbert says, if this were dropped as part of the CIF2 specification,
>> we would need to think carefully about how else to retain this
>> functionality.
>>
>> Regards
>> Brian
>>
>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
>>> The line folding protocol was discussed and adopted by COMCIFS and is
>>> posted, aong with other "Common Semantic Features" at
>>>
>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics
>>>
>>> but that is neither here nor there. The point is that the IUCr uses CIF
>>> to get work done. If we disable something they are using, we should offer
>>> some equivalent functionality so they can use CIF 2 to do their work.
>>> Otherwise, they will have to do the sensible thing, and continue to use
>>> CIF 1, or, worse, create their own dialect of CIF 2.
>>>
>>> Now, I broke my nose yesterday morning and find myself a bit punchy today,
>>> so I will drop out of this discussion for a while. Hopefully, when I
>>> return to it, this whole matter will be settled in some way that will
>>> allow people to actually use CIF 2, instead of it becoming what it seems
>>> on its way to becoming -- something elegant but not terrible useful, a bit
>>> like PL/I.
>>>
>>> Cheers,
>>> Herbert
>>>
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>> Dowling College, Kramer Science Center, KSC 121
>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>> +1-631-244-3035
>>> yaya@dowling.edu
>>> =====================================================
>>>
>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote:
>>>
>>>> I am with John. STAR has no line-folding protocol. As far as I can recall
>>>> neither did CIF. Somewhere along the way line folding was discussed (or
>>>> introduced?), but I am not sure it is formally part of any spec.
>>>>
>>>> None of my software handles anything about line folding. I can see no reason
>>>> for it, since with a 2048 maximum record length, and a free format structure
>>>> there is plenty of room to output your data. The only time it would be
>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that
>>>> ever going to happen?
>>>>
>>>> May be the desire for it comes from making the data "pretty" and read well
>>>> in a text editor. Well that is the task of an application to read the CIF
>>>> and present it appropriately. The CIF is strictly about CONTENT and not
>>>> FORM.
>>>>
>>>> Since we have given up on elided characters being part of CIF syntax, and
>>>> the belief by others that this not be a lexer issue, I think we should
>>>> absolutely consistent. The lexer knows how to identify tokens and reads
>>>> everything within them as a raw string.
>>>>
>>>> If your "encoding" for \n; strings includes characters that break the lexer,
>>>> then protect it in some way so that when you pass that string back as raw in
>>>> your software, somebody knows how to unprotect it back to the original (as
>>>> with ALL string encoding).
>>>>
>>>> One concession I think we can consider is to change the delimiter from \n;
>>>> to \n;\n. I don't see this as causing me any problems, since I handle
>>>>
>>>> ; stuff
>>>> More stuff
>>>> ; _newname
>>>>
>>>> routinely, but others don't. I believe most people do use (and probably
>>>> think) the delimiter is \n;\n anyway.
>>>>
>>>> Two questions
>>>>
>>>> (1) Do you agree that line folding just another encoding and therefore not a
>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not to
>>>> break the lexer.
>>>> (2) Do we think \n;\n is a better delimiter?
>>>>
>>>> On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote:
>>>>
>>>>> Hi James,
>>>>>
>>>>> My preference is avoid the elides in the syntax for the purpose of escaping
>>>>> terminators
>>>>> in strings deferring interpretation to the application.
>>>>>
>>>>> I do not understand all of the issues related to line folding, which I
>>>>> believe is an issue for Brian and Simon.
>>>>>
>>>>> John
>>>>>
>>>>>
>>>>> James Hester wrote:
>>>>>> Thanks for the quick reply over Thanksgiving, John. I take from your
>>>>>> message that the PDB does not need any elide mechanism to be defined
>>>>>> in the CIF2 syntax. Would you therefore be prepared to vote in favour
>>>>>> of not defining any elides, or would you prefer to abstain?
>>>>>>
>>>>>> Votes so far:
>>>>>>
>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>>>>>> Elides:?
>>>>>>
>>>>>> Unknown: John, Joe, David B., Brian, Simon
>>>>>>
>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>>>>>> <jwest@pdb-mail.rutgers.edu> wrote:
>>>>>>> I confess that I am having difficulty keeping up with all aspects
>>>>>>> of this discussion. Following Herb's suggestion I will try to
>>>>>>> summarize the quoting issues from the PDB perspective.
>>>>>>>
>>>>>>> 1. As there are multiple ways of quoting a string our tools and files
>>>>>>> surround embedded quotes with quotes of the opposite sense or with
>>>>>>> semicolons in the mixed case. I think that this point has been
>>>>>>> covered a number of times now and I believe that Nick has suggested
>>>>>>> that all reasonable cases can be handled by using this approach.
>>>>>>>
>>>>>>> 2. I too was not aware that original definition of terminators
>>>>>>> had changed and did not include either a leading or trailing
>>>>>>> whitespace. Certainly this must still be the case for single
>>>>>>> and double quotes. I cannot recall ever seeing an example
>>>>>>> where the terminator \n; was following by a whitespace character,
>>>>>>> but about half of the codes that I am familiar with would
>>>>>>> fall over on \n;next_token.
>>>>>>>
>>>>>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> Herbert J. Bernstein wrote:
>>>>>>>> My major concern about anything we do is to be able to preserve
>>>>>>>> the functionality of the practices that the IUCr is following in
>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is whether
>>>>>>>> they will be negatively impacted by the change in string termination
>>>>>>>> without any elide. If they can use CIF 2 with these changes, my
>>>>>>>> objections are purely academic and irrelevant. -- Herberrt
>>>>>>>>
>>>>>>>> =====================================================
>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>
>>>>>>>> +1-631-244-3035
>>>>>>>> yaya@dowling.edu
>>>>>>>> =====================================================
>>>>>>>>
>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>
>>>>>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>>>>>> down to rely on.
>>>>>>>>>
>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>>>>>
>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>>> Dear James,
>>>>>>>>>>
>>>>>>>>>> I started to write:
>>>>>>>>>> "No, in CIF 1.1, none of the terminal quote marks, including the \n;
>>>>>>>>>> are
>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>>>>>> file).
>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>>>>>> going back
>>>>>>>>>> to 1990. That is why Nick had to explicitly specify that a terminal
>>>>>>>>>> quote
>>>>>>>>>> mark would be effective no matter what it was followed by."
>>>>>>>>>>
>>>>>>>>>> But the grammer currently on the IUCr web site is _not_ the one that
>>>>>>>>>> I
>>>>>>>>>> recall COMCIFs discussing and approving. It now explcitly removes
>>>>>>>>>> the requirement for terminal white space in the special case of
>>>>>>>>>> the \n; text field terminator. I don't recall when that change was
>>>>>>>>>> adopted,
>>>>>>>>>> but it appears that you are right under the current spec
>>>>>>>>>> about the example I chose. Inasmuch as there is a lot of working code
>>>>>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>>>>>
>>>>>>>>>> I guess we are just going to have yet another few dialects of CIF.
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Herbert
>>>>>>>>>> =====================================================
>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>
>>>>>>>>>> +1-631-244-3035
>>>>>>>>>> yaya@dowling.edu
>>>>>>>>>> =====================================================
>>>>>>>>>>
>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>>>
>>>>>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>>>>>
>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you
>>>>>>>>>>> provide
>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the start
>>>>>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>>>>>> should then appear as the second character on the second line, rather
>>>>>>>>>>> than reverse solidus.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>>>>> The only problem with referring all elisdes to the application is
>>>>>>>>>>>> that
>>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>>>>>> to be
>>>>>>>>>>>> effective, the line folding protocol develops a slight gap. The
>>>>>>>>>>>> case is as follows
>>>>>>>>>>>>
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;
>>>>>>>>>>>>
>>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with the
>>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because
>>>>>>>>>>>> the
>>>>>>>>>>>> embedded ;\ is not a valid text terminator. If we require that
>>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>>>>>
>>>>>>>>>>>> =====================================================
>>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>>>
>>>>>>>>>>>> +1-631-244-3035
>>>>>>>>>>>> yaya@dowling.edu
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia t: +61 (0)8 6488 3452
> 35 Stirling Highway f: +61 (0)8 6488 1089
> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
> MBDP M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

References:

Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Use of elides in strings

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Use of elides in strings