[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Use of elides in strings

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Use of elides in strings
From: Nick Spadaccini <[email protected]>
Date: Thu, 26 Nov 2009 23:10:27 +0800
Authentication-Results: postfix;
In-Reply-To: <[email protected]>



On 26/11/09 9:59 PM, "SIMON WESTRIP" <[email protected]> wrote:

> Just as a distraction from trying to understand modulated structure CIFs,
> here goes:
> 
> I'd use semicolon delimiters (see long arrows <------ below)
> and if I didnt know the definition of the item, I would
> respect the whitespace.
> 
> Actually, I'd probably bung in a couple of extra newlines for good measure
> if I knew what I were dealing with - i.e.
> 
> ;
> O'"
> ;
> 
> Funnily enough, this is actually easier to read using my eyes
> than "O'""  :-)

Strictly the string is

;O'"
;

Since there is a desire that everything has to be returned as a raw string.
Looking at it as a byte stream we have \n;O'"\n; and once you strip off the
string delimiters (\n;) you get O'". Voila!

Read on I have inserted additional comments.

> From: Herbert J. Bernstein <[email protected]>
> To: [email protected]; Group finalising DDLm and associated
> dictionaries <[email protected]>
> Sent: Thursday, 26 November, 2009 12:41:46
> Subject: Re: [ddlm-group] Use of elides in strings
> 
> I am trying to get some CIF-2 related software done, so please advise
> me on some specific cases:
> 
> How should the following C-style strings followed by their CIF-1.1
> representations be presented in a CIF 2 document?  I've only put
> in CIF 2 cases where I think there is no question, but feel free
> to correct those.
> 
> C-style            CIF-1.1 style                   CIF-2
> 
> "O'"               "O'" or 'O''                    "O'"

> "O\""              "O"" or 'O"'                    'O"'

> "O'\""             "O'"" or 'O'"'                   ?
> <------------------------ \n;O'"\n;
Or '''O'"''' but not with """ because the terminator is corrupted.

> "''O''"            "''O''" or '''O'''              "''O''"

> "'''O'''"          "'''O'''" or ''''O''''          ?
> <------------------------------- \n;'''O'''\n;
Or """'''O'''"""

> "\"\"O'\"\""       """O'""" or '""O'""'            ?
> <---------------------- \n;""O'""\n;
Or '''""O'""'''

> "\"\"\"O'\"\"\""   """"O'"""" or '"""O'"""'        ?
> <---------------------- \n;"""O'"""\n;
Or '''"""O'"""'''

> and for semi-colon delimited string, is the last new-line part of
> the string or part of the delimiter, i.e. if the string is
> "abc\n" is the CIF-2 version

My reading of it has always been given by the definition of the delimiter,
which is \n;. These are what I strip off.

When we speak of stripping off the delimiters at both ends, then just as we
strip the """ trigram from both ends, the same is true of \n; digram. Hence
I say the second of the two examples \n;abc\n\n;
> ;abc
> ;            <-------------------- if newlines are not required by the items
> definition, I'd be tempted to strip the whitespace

The above is the string "abc"

> or
> 
> ;abc
>             <-------------------- without knowing the items definition, I'd be
> tempted to respect the whitespace
> ;
>

This is "abc\n"
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   [email protected]
> =====================================================
> 
> On Thu, 26 Nov 2009, Nick Spadaccini wrote:
> 
>> 
>> 
>> 
>> On 25/11/09 10:24 PM, "SIMON WESTRIP" <[email protected]> wrote:
>> 
>>> What Brian has said here - specifically
>>> 
>>> "if this were dropped as part of the CIF2 specification,
>>> we would need to think carefully about how else to retain this
>>> functionality"
>>> 
>>> is also relevant to how we handle the CIF1.1 markup conventions.
>>> As I understand it in CIF1.1 these are the default conventions for
>>> text fields unless the dictionary prohibits them, but in CIF2 all such
>>> conventions will _not_ be part of the spec, and can only be interpretted at
>>> the dictionary level.
>>> 
>>> Is this correct?
>> 
>> Yes, this is my understanding. There will be many different conventions I
>> presume, some will be widely accepted and standard, they will be part of the
>> underlying systems that interpret the files. For instance if something is
>> declared as a TeX encoding, we know what to do.
>> 
>>> 
>>> I'm only asking because we (at the IUCr at least) will have to address this
>>> issue sooner rather than later when adopting CIF2, so I just want to make
>>> sure
>>> I understand base CIF2 correctly
>>> 
>>> Cheers
>>> 
>>> Simon
>>> 
>>> 
>>> 
>>> From: Brian McMahon <[email protected]>
>>> To: Group finalising DDLm and associated dictionaries <[email protected]>
>>> Sent: Wednesday, 25 November, 2009 13:34:05
>>> Subject: Re: [ddlm-group] Use of elides in strings
>>> 
>>> (I've switched the thread title to deal separately with line folding.)
>>> 
>>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
>>> of the ITG bible). Currently, it invokes a special meaning for the
>>> backslash (reverse solidus) character, but only when it is the first
>>> non-blank after an opening semicolon or comment hash delimiter. We have
>>> yet to discuss whether to extend it to other string types (specifically
>>> the triple-quoted strings).
>>> 
>>> It's quite easy these days to generate single strings that are longer
>>> than 2048 characters (or any other arbitrary line limit) - e.g. a
>>> protein or nucleic acid sequence. Many, many chemical names broke the old
>>> 80-character line length limit.
>>> 
>>> We're very happy with CIF applications that do not interpret the
>>> line-folding protocol, so long as they preserve the existing backslashes.
>>> However, a fully-compliant CIF 1.1 parser should be able to return an
>>> unfolded string to an application that requests it.
>>> 
>>> As Herbert says, if this were dropped as part of the CIF2 specification,
>>> we would need to think carefully about how else to retain this
>>> functionality.
>>> 
>>> Regards
>>> Brian
>>> 
>>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
>>>> The line folding protocol was discussed and adopted by COMCIFS and is
>>>> posted, aong with other "Common Semantic Features" at
>>>> 
>>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics
>>>> 
>>>> but that is neither here nor there.  The point is that the IUCr uses CIF
>>>> to get work done.  If we disable something they are using, we should offer
>>>> some equivalent functionality so they can use CIF 2 to do their work.
>>>> Otherwise, they will have to do the sensible thing, and continue to use
>>>> CIF 1, or, worse, create their own dialect of CIF 2.
>>>> 
>>>> Now, I broke my nose yesterday morning and find myself a bit punchy today,
>>>> so I will drop out of this discussion for a while.  Hopefully, when I
>>>> return to it, this whole matter will be settled in some way that will
>>>> allow people to actually use CIF 2, instead of it becoming what it seems
>>>> on its way to becoming -- something elegant but not terrible useful, a bit
>>>> like PL/I.
>>>> 
>>>> Cheers,
>>>>    Herbert
>>>> 
>>>> =====================================================
>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>> 
>>>>                   +1-631-244-3035
>>>>                   [email protected]
>>>> =====================================================
>>>> 
>>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote:
>>>> 
>>>>> I am with John. STAR has no line-folding protocol. As far as I can recall
>>>>> neither did CIF. Somewhere along the way line folding was discussed (or
>>>>> introduced?), but I am not sure it is formally part of any spec.
>>>>> 
>>>>> None of my software handles anything about line folding. I can see no
>>>>> reason
>>>>> for it, since with a 2048 maximum record length, and a free format
>>>>> structure
>>>>> there is plenty of room to output your data. The only time it would be
>>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that
>>>>> ever going to happen?
>>>>> 
>>>>> May be the desire for it comes from making the data "pretty" and read well
>>>>> in a text editor. Well that is the task of an application to read the CIF
>>>>> and present it appropriately. The CIF is strictly about CONTENT and not
>>>>> FORM.
>>>>> 
>>>>> Since we have given up on elided characters being part of CIF syntax, and
>>>>> the belief by others that this not be a lexer issue, I think we should
>>>>> absolutely consistent. The lexer knows how to identify tokens and reads
>>>>> everything within them as a raw string.
>>>>> 
>>>>> If your "encoding" for \n; strings includes characters that break the
>>>>> lexer,
>>>>> then protect it in some way so that when you pass that string back as raw
>>>>> in
>>>>> your software, somebody knows how to unprotect it back to the original (as
>>>>> with ALL string encoding).
>>>>> 
>>>>> One concession I think we can consider is to change the delimiter from \n;
>>>>> to \n;\n. I don't see this as causing me any problems, since I handle
>>>>> 
>>>>> ; stuff
>>>>> More stuff
>>>>> ; _newname
>>>>> 
>>>>> routinely, but others don't. I believe most people do use (and probably
>>>>> think) the delimiter is \n;\n anyway.
>>>>> 
>>>>> Two questions
>>>>> 
>>>>> (1) Do you agree that line folding just another encoding and therefore not
>>>>> a
>>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not
>>>>> to
>>>>> break the lexer.
>>>>> (2) Do we think \n;\n is a better delimiter?
>>>>> 
>>>>> On 25/11/09 10:33 AM, "John Westbrook" <[email protected]> wrote:
>>>>> 
>>>>>> Hi James,
>>>>>> 
>>>>>> My preference is avoid the elides in the syntax for the purpose of
>>>>>> escaping
>>>>>> terminators
>>>>>> in strings deferring  interpretation to the application.
>>>>>> 
>>>>>> I do not understand all of the issues related to line folding, which I
>>>>>> believe is an issue for Brian and Simon.
>>>>>> 
>>>>>> John
>>>>>> 
>>>>>> 
>>>>>> James Hester wrote:
>>>>>>> Thanks for the quick reply over Thanksgiving, John.  I take from your
>>>>>>> message that the PDB does not need any elide mechanism to be defined
>>>>>>> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
>>>>>>> of not defining any elides, or would you prefer to abstain?
>>>>>>> 
>>>>>>> Votes so far:
>>>>>>> 
>>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>>>>>>> Elides:?
>>>>>>> 
>>>>>>> Unknown: John, Joe, David B., Brian, Simon
>>>>>>> 
>>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>>>>>>> <[email protected]> wrote:
>>>>>>>> I confess that I am having difficulty keeping up with all aspects
>>>>>>>> of this discussion.   Following Herb's suggestion I will try to
>>>>>>>> summarize the quoting issues from the PDB perspective.
>>>>>>>> 
>>>>>>>> 1. As there are multiple ways of quoting a string our tools and files
>>>>>>>> surround embedded quotes with quotes of the opposite sense or with
>>>>>>>> semicolons in the mixed case.   I think that this point has been
>>>>>>>> covered a number of times now and I believe that Nick has suggested
>>>>>>>> that all reasonable cases can be handled by using this approach.
>>>>>>>> 
>>>>>>>> 2. I too was not aware that original definition of terminators
>>>>>>>> had changed and did not include either a leading or trailing
>>>>>>>> whitespace.  Certainly this must still be the case for single
>>>>>>>> and double quotes.  I cannot recall ever seeing an example
>>>>>>>> where the terminator \n; was following by a whitespace character,
>>>>>>>> but about half of the codes that I am familiar with would
>>>>>>>> fall over on \n;next_token.
>>>>>>>> 
>>>>>>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>>>>>> 
>>>>>>>> Regards,
>>>>>>>> 
>>>>>>>> John
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Herbert J. Bernstein wrote:
>>>>>>>>> My major concern about anything we do is to be able to preserve
>>>>>>>>> the functionality of the practices that the IUCr is following in
>>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is
>>>>>>>>> whether
>>>>>>>>> they will be negatively impacted by the change in string termination
>>>>>>>>> without any elide.  If they can use CIF 2 with these changes, my
>>>>>>>>> objections are purely academic and irrelevant.  -- Herberrt
>>>>>>>>> 
>>>>>>>>> =====================================================
>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>> 
>>>>>>>>>                  +1-631-244-3035
>>>>>>>>>                  [email protected]
>>>>>>>>> =====================================================
>>>>>>>>> 
>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>> 
>>>>>>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>>>>>>> down to rely on.
>>>>>>>>>> 
>>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>>>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>>>>>> 
>>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>> Dear James,
>>>>>>>>>>> 
>>>>>>>>>>>  I started to write:
>>>>>>>>>>>  "No, in CIF 1.1, none of the terminal quote marks, including the
>>>>>>>>>>> \n;
>>>>>>>>>>> are
>>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>>>>>>> file).
>>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>>>>>>> going back
>>>>>>>>>>> to 1990.  That is why Nick had to explicitly specify that a terminal
>>>>>>>>>>> quote
>>>>>>>>>>> mark would be effective no matter what it was followed by."
>>>>>>>>>>> 
>>>>>>>>>>>  But the grammer currently on the IUCr web site is _not_ the one
>>>>>>>>>>> that
>>>>>>>>>>> I
>>>>>>>>>>> recall COMCIFs discussing and approving.  It now explcitly removes
>>>>>>>>>>> the requirement for terminal white space in the special case of
>>>>>>>>>>> the \n; text field terminator.  I don't recall when that change was
>>>>>>>>>>> adopted,
>>>>>>>>>>> but it appears that you are right under the current spec
>>>>>>>>>>> about the example I chose.  Inasmuch as there is a lot of working
>>>>>>>>>>> code
>>>>>>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>>>>>> 
>>>>>>>>>>>  I guess we are just going to have yet another few dialects of CIF.
>>>>>>>>>>> 
>>>>>>>>>>>  Regards,
>>>>>>>>>>>    Herbert
>>>>>>>>>>> =====================================================
>>>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>> 
>>>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>>>                 [email protected]
>>>>>>>>>>> =====================================================
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>>>>>> 
>>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you
>>>>>>>>>>>> provide
>>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the
>>>>>>>>>>>> start
>>>>>>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>>>>>>> should then appear as the second character on the second line,
>>>>>>>>>>>> rather
>>>>>>>>>>>> than reverse solidus.
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> The only problem with referring all elisdes to the application is
>>>>>>>>>>>> that
>>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>>>>>> to be
>>>>>>>>>>>> effective, the line folding protocol develops a slight gap.  The
>>>>>>>>>>>> case is as follows
>>>>>>>>>>>> 
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;
>>>>>>>>>>>> 
>>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with
>>>>>>>>>>>> the
>>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because
>>>>>>>>>>>> the
>>>>>>>>>>>> embedded ;\ is not a valid text terminator.  If we require that
>>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>>>>> 
>>>>>>>>>>>> =====================================================
>>>>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>>> 
>>>>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>>>>                 [email protected]
>>> _______________________________________________
>>> ddlm-group mailing list
>>> [email protected]
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> 
>>> 
>>> _______________________________________________
>>> ddlm-group mailing list
>>> [email protected]
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>> cheers
>> 
>> Nick
>> 
>> --------------------------------
>> Associate Professor N. Spadaccini, PhD
>> School of Computer Science & Software Engineering
>> 
>> The University of Western Australia    t: +61 (0)8 6488 3452
>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>> <http://www.csse.uwa.edu.au/%7Enick>
>> MBDP  M002
>> 
>> CRICOS Provider Code: 00126G
>> 
>> e: [email protected]
>> 
>> 
>> 
>> 
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: [email protected]




_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

Re: [ddlm-group] Use of elides in strings (Herbert J. Bernstein)

References:

Re: [ddlm-group] Use of elides in strings (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] Use of elides in strings

Next by Date: Re: [ddlm-group] Use of elides in strings

Prev by thread: Re: [ddlm-group] Use of elides in strings

Next by thread: Re: [ddlm-group] Use of elides in strings

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Use of elides in strings