[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Use of elides in strings
- To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] Use of elides in strings
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Thu, 26 Nov 2009 10:16:47 -0500 (EST)
- In-Reply-To: <C734BB63.12644%[email protected]>
- References: <C734BB63.12644%[email protected]>
Thank you. I will code to that spec. -- Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Thu, 26 Nov 2009, Nick Spadaccini wrote:
>
>
>
> On 26/11/09 9:59 PM, "SIMON WESTRIP" <[email protected]> wrote:
>
>> Just as a distraction from trying to understand modulated structure CIFs,
>> here goes:
>>
>> I'd use semicolon delimiters (see long arrows <------ below)
>> and if I didnt know the definition of the item, I would
>> respect the whitespace.
>>
>> Actually, I'd probably bung in a couple of extra newlines for good measure
>> if I knew what I were dealing with - i.e.
>>
>> ;
>> O'"
>> ;
>>
>> Funnily enough, this is actually easier to read using my eyes
>> than "O'"" :-)
>
> Strictly the string is
>
> ;O'"
> ;
>
> Since there is a desire that everything has to be returned as a raw string.
> Looking at it as a byte stream we have \n;O'"\n; and once you strip off the
> string delimiters (\n;) you get O'". Voila!
>
> Read on I have inserted additional comments.
>
>> From: Herbert J. Bernstein <[email protected]>
>> To: [email protected]; Group finalising DDLm and associated
>> dictionaries <[email protected]>
>> Sent: Thursday, 26 November, 2009 12:41:46
>> Subject: Re: [ddlm-group] Use of elides in strings
>>
>> I am trying to get some CIF-2 related software done, so please advise
>> me on some specific cases:
>>
>> How should the following C-style strings followed by their CIF-1.1
>> representations be presented in a CIF 2 document? I've only put
>> in CIF 2 cases where I think there is no question, but feel free
>> to correct those.
>>
>> C-style CIF-1.1 style CIF-2
>>
>> "O'" "O'" or 'O'' "O'"
>
>> "O\"" "O"" or 'O"' 'O"'
>
>> "O'\"" "O'"" or 'O'"' ?
>> <------------------------ \n;O'"\n;
> Or '''O'"''' but not with """ because the terminator is corrupted.
>
>> "''O''" "''O''" or '''O''' "''O''"
>
>> "'''O'''" "'''O'''" or ''''O'''' ?
>> <------------------------------- \n;'''O'''\n;
> Or """'''O'''"""
>
>> "\"\"O'\"\"" """O'""" or '""O'""' ?
>> <---------------------- \n;""O'""\n;
> Or '''""O'""'''
>
>> "\"\"\"O'\"\"\"" """"O'"""" or '"""O'"""' ?
>> <---------------------- \n;"""O'"""\n;
> Or '''"""O'"""'''
>
>> and for semi-colon delimited string, is the last new-line part of
>> the string or part of the delimiter, i.e. if the string is
>> "abc\n" is the CIF-2 version
>
> My reading of it has always been given by the definition of the delimiter,
> which is \n;. These are what I strip off.
>
> When we speak of stripping off the delimiters at both ends, then just as we
> strip the """ trigram from both ends, the same is true of \n; digram. Hence
> I say the second of the two examples \n;abc\n\n;
>> ;abc
>> ; <-------------------- if newlines are not required by the items
>> definition, I'd be tempted to strip the whitespace
>
> The above is the string "abc"
>
>> or
>>
>> ;abc
>> <-------------------- without knowing the items definition, I'd be
>> tempted to respect the whitespace
>> ;
>>
>
> This is "abc\n"
>>
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>> Dowling College, Kramer Science Center, KSC 121
>> Idle Hour Blvd, Oakdale, NY, 11769
>>
>> +1-631-244-3035
>> [email protected]
>> =====================================================
>>
>> On Thu, 26 Nov 2009, Nick Spadaccini wrote:
>>
>>>
>>>
>>>
>>> On 25/11/09 10:24 PM, "SIMON WESTRIP" <[email protected]> wrote:
>>>
>>>> What Brian has said here - specifically
>>>>
>>>> "if this were dropped as part of the CIF2 specification,
>>>> we would need to think carefully about how else to retain this
>>>> functionality"
>>>>
>>>> is also relevant to how we handle the CIF1.1 markup conventions.
>>>> As I understand it in CIF1.1 these are the default conventions for
>>>> text fields unless the dictionary prohibits them, but in CIF2 all such
>>>> conventions will _not_ be part of the spec, and can only be interpretted at
>>>> the dictionary level.
>>>>
>>>> Is this correct?
>>>
>>> Yes, this is my understanding. There will be many different conventions I
>>> presume, some will be widely accepted and standard, they will be part of the
>>> underlying systems that interpret the files. For instance if something is
>>> declared as a TeX encoding, we know what to do.
>>>
>>>>
>>>> I'm only asking because we (at the IUCr at least) will have to address this
>>>> issue sooner rather than later when adopting CIF2, so I just want to make
>>>> sure
>>>> I understand base CIF2 correctly
>>>>
>>>> Cheers
>>>>
>>>> Simon
>>>>
>>>>
>>>>
>>>> From: Brian McMahon <[email protected]>
>>>> To: Group finalising DDLm and associated dictionaries <[email protected]>
>>>> Sent: Wednesday, 25 November, 2009 13:34:05
>>>> Subject: Re: [ddlm-group] Use of elides in strings
>>>>
>>>> (I've switched the thread title to deal separately with line folding.)
>>>>
>>>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
>>>> of the ITG bible). Currently, it invokes a special meaning for the
>>>> backslash (reverse solidus) character, but only when it is the first
>>>> non-blank after an opening semicolon or comment hash delimiter. We have
>>>> yet to discuss whether to extend it to other string types (specifically
>>>> the triple-quoted strings).
>>>>
>>>> It's quite easy these days to generate single strings that are longer
>>>> than 2048 characters (or any other arbitrary line limit) - e.g. a
>>>> protein or nucleic acid sequence. Many, many chemical names broke the old
>>>> 80-character line length limit.
>>>>
>>>> We're very happy with CIF applications that do not interpret the
>>>> line-folding protocol, so long as they preserve the existing backslashes.
>>>> However, a fully-compliant CIF 1.1 parser should be able to return an
>>>> unfolded string to an application that requests it.
>>>>
>>>> As Herbert says, if this were dropped as part of the CIF2 specification,
>>>> we would need to think carefully about how else to retain this
>>>> functionality.
>>>>
>>>> Regards
>>>> Brian
>>>>
>>>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
>>>>> The line folding protocol was discussed and adopted by COMCIFS and is
>>>>> posted, aong with other "Common Semantic Features" at
>>>>>
>>>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics
>>>>>
>>>>> but that is neither here nor there. The point is that the IUCr uses CIF
>>>>> to get work done. If we disable something they are using, we should offer
>>>>> some equivalent functionality so they can use CIF 2 to do their work.
>>>>> Otherwise, they will have to do the sensible thing, and continue to use
>>>>> CIF 1, or, worse, create their own dialect of CIF 2.
>>>>>
>>>>> Now, I broke my nose yesterday morning and find myself a bit punchy today,
>>>>> so I will drop out of this discussion for a while. Hopefully, when I
>>>>> return to it, this whole matter will be settled in some way that will
>>>>> allow people to actually use CIF 2, instead of it becoming what it seems
>>>>> on its way to becoming -- something elegant but not terrible useful, a bit
>>>>> like PL/I.
>>>>>
>>>>> Cheers,
>>>>> Herbert
>>>>>
>>>>> =====================================================
>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>
>>>>> +1-631-244-3035
>>>>> [email protected]
>>>>> =====================================================
>>>>>
>>>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote:
>>>>>
>>>>>> I am with John. STAR has no line-folding protocol. As far as I can recall
>>>>>> neither did CIF. Somewhere along the way line folding was discussed (or
>>>>>> introduced?), but I am not sure it is formally part of any spec.
>>>>>>
>>>>>> None of my software handles anything about line folding. I can see no
>>>>>> reason
>>>>>> for it, since with a 2048 maximum record length, and a free format
>>>>>> structure
>>>>>> there is plenty of room to output your data. The only time it would be
>>>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that
>>>>>> ever going to happen?
>>>>>>
>>>>>> May be the desire for it comes from making the data "pretty" and read well
>>>>>> in a text editor. Well that is the task of an application to read the CIF
>>>>>> and present it appropriately. The CIF is strictly about CONTENT and not
>>>>>> FORM.
>>>>>>
>>>>>> Since we have given up on elided characters being part of CIF syntax, and
>>>>>> the belief by others that this not be a lexer issue, I think we should
>>>>>> absolutely consistent. The lexer knows how to identify tokens and reads
>>>>>> everything within them as a raw string.
>>>>>>
>>>>>> If your "encoding" for \n; strings includes characters that break the
>>>>>> lexer,
>>>>>> then protect it in some way so that when you pass that string back as raw
>>>>>> in
>>>>>> your software, somebody knows how to unprotect it back to the original (as
>>>>>> with ALL string encoding).
>>>>>>
>>>>>> One concession I think we can consider is to change the delimiter from \n;
>>>>>> to \n;\n. I don't see this as causing me any problems, since I handle
>>>>>>
>>>>>> ; stuff
>>>>>> More stuff
>>>>>> ; _newname
>>>>>>
>>>>>> routinely, but others don't. I believe most people do use (and probably
>>>>>> think) the delimiter is \n;\n anyway.
>>>>>>
>>>>>> Two questions
>>>>>>
>>>>>> (1) Do you agree that line folding just another encoding and therefore not
>>>>>> a
>>>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not
>>>>>> to
>>>>>> break the lexer.
>>>>>> (2) Do we think \n;\n is a better delimiter?
>>>>>>
>>>>>> On 25/11/09 10:33 AM, "John Westbrook" <[email protected]> wrote:
>>>>>>
>>>>>>> Hi James,
>>>>>>>
>>>>>>> My preference is avoid the elides in the syntax for the purpose of
>>>>>>> escaping
>>>>>>> terminators
>>>>>>> in strings deferring interpretation to the application.
>>>>>>>
>>>>>>> I do not understand all of the issues related to line folding, which I
>>>>>>> believe is an issue for Brian and Simon.
>>>>>>>
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> James Hester wrote:
>>>>>>>> Thanks for the quick reply over Thanksgiving, John. I take from your
>>>>>>>> message that the PDB does not need any elide mechanism to be defined
>>>>>>>> in the CIF2 syntax. Would you therefore be prepared to vote in favour
>>>>>>>> of not defining any elides, or would you prefer to abstain?
>>>>>>>>
>>>>>>>> Votes so far:
>>>>>>>>
>>>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>>>>>>>> Elides:?
>>>>>>>>
>>>>>>>> Unknown: John, Joe, David B., Brian, Simon
>>>>>>>>
>>>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>>>>>>>> <[email protected]> wrote:
>>>>>>>>> I confess that I am having difficulty keeping up with all aspects
>>>>>>>>> of this discussion. Following Herb's suggestion I will try to
>>>>>>>>> summarize the quoting issues from the PDB perspective.
>>>>>>>>>
>>>>>>>>> 1. As there are multiple ways of quoting a string our tools and files
>>>>>>>>> surround embedded quotes with quotes of the opposite sense or with
>>>>>>>>> semicolons in the mixed case. I think that this point has been
>>>>>>>>> covered a number of times now and I believe that Nick has suggested
>>>>>>>>> that all reasonable cases can be handled by using this approach.
>>>>>>>>>
>>>>>>>>> 2. I too was not aware that original definition of terminators
>>>>>>>>> had changed and did not include either a leading or trailing
>>>>>>>>> whitespace. Certainly this must still be the case for single
>>>>>>>>> and double quotes. I cannot recall ever seeing an example
>>>>>>>>> where the terminator \n; was following by a whitespace character,
>>>>>>>>> but about half of the codes that I am familiar with would
>>>>>>>>> fall over on \n;next_token.
>>>>>>>>>
>>>>>>>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>> John
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Herbert J. Bernstein wrote:
>>>>>>>>>> My major concern about anything we do is to be able to preserve
>>>>>>>>>> the functionality of the practices that the IUCr is following in
>>>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is
>>>>>>>>>> whether
>>>>>>>>>> they will be negatively impacted by the change in string termination
>>>>>>>>>> without any elide. If they can use CIF 2 with these changes, my
>>>>>>>>>> objections are purely academic and irrelevant. -- Herberrt
>>>>>>>>>>
>>>>>>>>>> =====================================================
>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>
>>>>>>>>>> +1-631-244-3035
>>>>>>>>>> [email protected]
>>>>>>>>>> =====================================================
>>>>>>>>>>
>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>>>
>>>>>>>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>>>>>>>> down to rely on.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>>>>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>> Dear James,
>>>>>>>>>>>>
>>>>>>>>>>>> I started to write:
>>>>>>>>>>>> "No, in CIF 1.1, none of the terminal quote marks, including the
>>>>>>>>>>>> \n;
>>>>>>>>>>>> are
>>>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>>>>>>>> file).
>>>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>>>>>>>> going back
>>>>>>>>>>>> to 1990. That is why Nick had to explicitly specify that a terminal
>>>>>>>>>>>> quote
>>>>>>>>>>>> mark would be effective no matter what it was followed by."
>>>>>>>>>>>>
>>>>>>>>>>>> But the grammer currently on the IUCr web site is _not_ the one
>>>>>>>>>>>> that
>>>>>>>>>>>> I
>>>>>>>>>>>> recall COMCIFs discussing and approving. It now explcitly removes
>>>>>>>>>>>> the requirement for terminal white space in the special case of
>>>>>>>>>>>> the \n; text field terminator. I don't recall when that change was
>>>>>>>>>>>> adopted,
>>>>>>>>>>>> but it appears that you are right under the current spec
>>>>>>>>>>>> about the example I chose. Inasmuch as there is a lot of working
>>>>>>>>>>>> code
>>>>>>>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>>>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>>>>>>>
>>>>>>>>>>>> I guess we are just going to have yet another few dialects of CIF.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Herbert
>>>>>>>>>>>> =====================================================
>>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>>>
>>>>>>>>>>>> +1-631-244-3035
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>> =====================================================
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>>>>>>>
>>>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you
>>>>>>>>>>>>> provide
>>>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the
>>>>>>>>>>>>> start
>>>>>>>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>>>>>>>> should then appear as the second character on the second line,
>>>>>>>>>>>>> rather
>>>>>>>>>>>>> than reverse solidus.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>> The only problem with referring all elisdes to the application is
>>>>>>>>>>>>> that
>>>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>>>>>>> to be
>>>>>>>>>>>>> effective, the line folding protocol develops a slight gap. The
>>>>>>>>>>>>> case is as follows
>>>>>>>>>>>>>
>>>>>>>>>>>>> ;\
>>>>>>>>>>>>> ;\
>>>>>>>>>>>>> ;
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with
>>>>>>>>>>>>> the
>>>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because
>>>>>>>>>>>>> the
>>>>>>>>>>>>> embedded ;\ is not a valid text terminator. If we require that
>>>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> =====================================================
>>>>>>>>>>>>> Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>>> Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>>>>
>>>>>>>>>>>>> +1-631-244-3035
>>>>>>>>>>>>> [email protected]
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>> cheers
>>>
>>> Nick
>>>
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>>
>>> The University of Western Australia t: +61 (0)8 6488 3452
>>> 35 Stirling Highway f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
>>> <http://www.csse.uwa.edu.au/%7Enick>
>>> MBDP M002
>>>
>>> CRICOS Provider Code: 00126G
>>>
>>> e: [email protected]
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> [email protected]
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia t: +61 (0)8 6488 3452
> 35 Stirling Highway f: +61 (0)8 6488 1089
> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
> MBDP M002
>
> CRICOS Provider Code: 00126G
>
> e: [email protected]
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] Use of elides in strings (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] Use of elides in strings
- Next by Date: Re: [ddlm-group] Use of elides in strings
- Prev by thread: Re: [ddlm-group] Use of elides in strings
- Next by thread: Re: [ddlm-group] Use of elides in strings
- Index(es):

