Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Use of elides in strings

On 25/11/09 11:36 PM, "Brian McMahon" <bm@iucr.org> wrote:

> Very relevant to something I've just been discussing with the
> Managing Editor. What would be the response of this group to the
> suggestion that CIF 2 drops altogether the "CIF markup conventions"
> of legacy CIF 1 (pp 35-36 of ITG,
> http://www.iucr.org/resources/cif/spec/version1.1/semantics#markup )?
> 
> i.e. all the old \' = acute, \^ = circumflex etc. would be dropped
> and replaced by Unicode characters.

The response of this individual would be to quote the "words of the old
Negro spiritual: Free at last! Free at last! Thank God Almighty, we are free
at last!"

That would make things easier.

> One needs then to be sure that the onject in hand is indeed a CIF 2.0 file,
> and not a CIF 1.

We have guaranteed a need for that to be done (whether the user does or
doesn't is another issue).

> I've introduced this idea here since it has some bearing on the question of
> elides (where retention of the legacy markup meaning for \' was assumed),
> but it might be best to take discussion off to yet another thread.

Not really since \' exists in TeX so we will need to cater for that anyway.
But reducing the number of markups/encoding to handle simplifies things.

> 
> Cheers
> Brian
> 
> 
> On Wed, Nov 25, 2009 at 02:24:16PM +0000, SIMON WESTRIP wrote:
>> What Brian has said here - specifically
>> 
>> "if this were dropped as part of the CIF2 specification,
>> we would need to think carefully about how else to retain this
>> functionality"
>> 
>> is also relevant to how we handle the CIF1.1 markup conventions.
>> As I understand it in CIF1.1 these are the default conventions for
>> text fields unless the dictionary prohibits them, but in CIF2 all such
>> conventions will _not_ be part of the spec, and can only be interpretted at
>> the dictionary level.
>> 
>> Is this correct?
>> 
>> I'm only asking because we (at the IUCr at least) will have to address this
>> issue sooner rather than later when adopting CIF2, so I just want to make
>> sure
>> I understand base CIF2 correctly
>> 
>> Cheers
>> 
>> Simon
>> 
>> ________________________________
>> 
>> From: Brian McMahon <bm@iucr.org>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Wednesday, 25 November, 2009 13:34:05
>> Subject: Re: [ddlm-group] Use of elides in strings
>> 
>> (I've switched the thread title to deal separately with line folding.)
>> 
>> As Herbert says, line folding is part of the CIF 1.1 spec (pages 34-35
>> of the ITG bible). Currently, it invokes a special meaning for the
>> backslash (reverse solidus) character, but only when it is the first
>> non-blank after an opening semicolon or comment hash delimiter. We have
>> yet to discuss whether to extend it to other string types (specifically
>> the triple-quoted strings).
>> 
>> It's quite easy these days to generate single strings that are longer
>> than 2048 characters (or any other arbitrary line limit) - e.g. a
>> protein or nucleic acid sequence. Many, many chemical names broke the old
>> 80-character line length limit.
>> 
>> We're very happy with CIF applications that do not interpret the
>> line-folding protocol, so long as they preserve the existing backslashes.
>> However, a fully-compliant CIF 1.1 parser should be able to return an
>> unfolded string to an application that requests it.
>> 
>> As Herbert says, if this were dropped as part of the CIF2 specification,
>> we would need to think carefully about how else to retain this
>> functionality.
>> 
>> Regards
>> Brian
>> 
>> On Wed, Nov 25, 2009 at 07:54:51AM -0500, Herbert J. Bernstein wrote:
>>> The line folding protocol was discussed and adopted by COMCIFS and is
>>> posted, aong with other "Common Semantic Features" at
>>> 
>>> http://www.iucr.org/resources/cif/spec/version1.1/semantics
>>> 
>>> but that is neither here nor there.  The point is that the IUCr uses CIF
>>> to get work done.  If we disable something they are using, we should offer
>>> some equivalent functionality so they can use CIF 2 to do their work.
>>> Otherwise, they will have to do the sensible thing, and continue to use
>>> CIF 1, or, worse, create their own dialect of CIF 2.
>>> 
>>> Now, I broke my nose yesterday morning and find myself a bit punchy today,
>>> so I will drop out of this discussion for a while.  Hopefully, when I
>>> return to it, this whole matter will be settled in some way that will
>>> allow people to actually use CIF 2, instead of it becoming what it seems
>>> on its way to becoming -- something elegant but not terrible useful, a bit
>>> like PL/I.
>>> 
>>> Cheers,
>>>    Herbert
>>> 
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>> 
>>>                   +1-631-244-3035
>>>                  yaya@dowling.edu
>>> =====================================================
>>> 
>>> On Wed, 25 Nov 2009, Nick Spadaccini wrote:
>>> 
>>>> I am with John. STAR has no line-folding protocol. As far as I can recall
>>>> neither did CIF. Somewhere along the way line folding was discussed (or
>>>> introduced?), but I am not sure it is formally part of any spec.
>>>> 
>>>> None of my software handles anything about line folding. I can see no
>>>> reason
>>>> for it, since with a 2048 maximum record length, and a free format
>>>> structure
>>>> there is plenty of room to output your data. The only time it would be
>>>> necessary is when (dataname + space + datavalue)> 2048 and when is that
>>>> ever going to happen?
>>>> 
>>>> May be the desire for it comes from making the data "pretty" and read well
>>>> in a text editor. Well that is the task of an application to read the CIF
>>>> and present it appropriately. The CIF is strictly about CONTENT and not
>>>> FORM.
>>>> 
>>>> Since we have given up on elided characters being part of CIF syntax, and
>>>> the belief by others that this not be a lexer issue, I think we should
>>>> absolutely consistent. The lexer knows how to identify tokens and reads
>>>> everything within them as a raw string.
>>>> 
>>>> If your "encoding" for \n; strings includes characters that break the
>>>> lexer,
>>>> then protect it in some way so that when you pass that string back as raw
>>>> in
>>>> your software, somebody knows how to unprotect it back to the original (as
>>>> with ALL string encoding).
>>>> 
>>>> One concession I think we can consider is to change the delimiter from \n;
>>>> to \n;\n. I don't see this as causing me any problems, since I handle
>>>> 
>>>> ; stuff
>>>> More stuff
>>>> ; _newname
>>>> 
>>>> routinely, but others don't. I believe most people do use (and probably
>>>> think) the delimiter is \n;\n anyway.
>>>> 
>>>> Two questions
>>>> 
>>>> (1) Do you agree that line folding just another encoding and therefore not
>>>> a
>>>> STAR/CIF issue? Consequently it is the responsibility of the encoding not
>>>> to
>>>> break the lexer.
>>>> (2) Do we think \n;\n is a better delimiter?
>>>> 
>>>> On 25/11/09 10:33 AM, "John Westbrook" <jwest@pdb-mail.rutgers.edu> wrote:
>>>> 
>>>>> Hi James,
>>>>> 
>>>>> My preference is avoid the elides in the syntax for the purpose of
>>>>> escaping
>>>>> terminators
>>>>> in strings deferring  interpretation to the application.
>>>>> 
>>>>> I do not understand all of the issues related to line folding, which I
>>>>> believe is an issue for Brian and Simon.
>>>>> 
>>>>> John
>>>>> 
>>>>> 
>>>>> James Hester wrote:
>>>>>> Thanks for the quick reply over Thanksgiving, John.  I take from your
>>>>>> message that the PDB does not need any elide mechanism to be defined
>>>>>> in the CIF2 syntax.  Would you therefore be prepared to vote in favour
>>>>>> of not defining any elides, or would you prefer to abstain?
>>>>>> 
>>>>>> Votes so far:
>>>>>> 
>>>>>> No elides: James, Nick, Herbert if the IUCr + PDB say it is OK
>>>>>> Elides:?
>>>>>> 
>>>>>> Unknown: John, Joe, David B., Brian, Simon
>>>>>> 
>>>>>> On Wed, Nov 25, 2009 at 12:03 PM, John Westbrook
>>>>>> <jwest@pdb-mail.rutgers.edu> wrote:
>>>>>>> I confess that I am having difficulty keeping up with all aspects
>>>>>>> of this discussion.   Following Herb's suggestion I will try to
>>>>>>> summarize the quoting issues from the PDB perspective.
>>>>>>> 
>>>>>>> 1. As there are multiple ways of quoting a string our tools and files
>>>>>>> surround embedded quotes with quotes of the opposite sense or with
>>>>>>> semicolons in the mixed case.   I think that this point has been
>>>>>>> covered a number of times now and I believe that Nick has suggested
>>>>>>> that all reasonable cases can be handled by using this approach.
>>>>>>> 
>>>>>>> 2. I too was not aware that original definition of terminators
>>>>>>> had changed and did not include either a leading or trailing
>>>>>>> whitespace.  Certainly this must still be the case for single
>>>>>>> and double quotes.  I cannot recall ever seeing an example
>>>>>>> where the terminator \n; was following by a whitespace character,
>>>>>>> but about half of the codes that I am familiar with would
>>>>>>> fall over on \n;next_token.
>>>>>>> 
>>>>>>> 3. Line folding has never been an issue for PDB nor has line length.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> 
>>>>>>> John
>>>>>>> 
>>>>>>> 
>>>>>>> Herbert J. Bernstein wrote:
>>>>>>>> My major concern about anything we do is to be able to preserve
>>>>>>>> the functionality of the practices that the IUCr is following in
>>>>>>>> journal publications and the PDB is following. Inasmuch as they seem
>>>>>>>> able to cope with no elide in CIF 1.1, the remaining question is
>>>>>>>> whether
>>>>>>>> they will be negatively impacted by the change in string termination
>>>>>>>> without any elide.  If they can use CIF 2 with these changes, my
>>>>>>>> objections are purely academic and irrelevant.  -- Herberrt
>>>>>>>> 
>>>>>>>> =====================================================
>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>> 
>>>>>>>>                  +1-631-244-3035
>>>>>>>>                  yaya@dowling.edu
>>>>>>>> =====================================================
>>>>>>>> 
>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>> 
>>>>>>>>> Herbert: I have the dubious advantage of not having participated in
>>>>>>>>> all those CIF1.0/1.1 discussions, so only have the spec as written
>>>>>>>>> down to rely on.
>>>>>>>>> 
>>>>>>>>> Anyway, how do you feel about abandoning any specification of elides
>>>>>>>>> in CIF2 syntax, as suggested by Nick?
>>>>>>>>> 
>>>>>>>>> On Wed, Nov 25, 2009 at 10:53 AM, Herbert J. Bernstein
>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>>> Dear James,
>>>>>>>>>> 
>>>>>>>>>>  I started to write:
>>>>>>>>>>  "No, in CIF 1.1, none of the terminal quote marks, including the \n;
>>>>>>>>>> are
>>>>>>>>>> effective unless followed by whitespace (\n, space, tab, of end of
>>>>>>>>>> file).
>>>>>>>>>> This is a well-established, and very tricky part of the CIF spec
>>>>>>>>>> going back
>>>>>>>>>> to 1990.  That is why Nick had to explicitly specify that a terminal
>>>>>>>>>> quote
>>>>>>>>>> mark would be effective no matter what it was followed by."
>>>>>>>>>> 
>>>>>>>>>>  But the grammer currently on the IUCr web site is _not_ the one that
>>>>>>>>>> I
>>>>>>>>>> recall COMCIFs discussing and approving.  It now explcitly removes
>>>>>>>>>> the requirement for terminal white space in the special case of
>>>>>>>>>> the \n; text field terminator.  I don't recall when that change was
>>>>>>>>>> adopted,
>>>>>>>>>> but it appears that you are right under the current spec
>>>>>>>>>> about the example I chose.  Inasmuch as there is a lot of working
>>>>>>>>>> code
>>>>>>>>>> that enforces and uses the original whitespace handling and uses it
>>>>>>>>>> in line-folding, I will not revise CIFtbx 3, but I will try to do
>>>>>>>>>> something to adapt to this change for CIFtbx 4.
>>>>>>>>>> 
>>>>>>>>>>  I guess we are just going to have yet another few dialects of CIF.
>>>>>>>>>> 
>>>>>>>>>>  Regards,
>>>>>>>>>>    Herbert
>>>>>>>>>> =====================================================
>>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>> 
>>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>>                yaya@dowling.edu
>>>>>>>>>> =====================================================
>>>>>>>>>> 
>>>>>>>>>> On Wed, 25 Nov 2009, James Hester wrote:
>>>>>>>>>> 
>>>>>>>>>>> To be precise, we are not 'referring all elides to the application'
>>>>>>>>>>> because no elides are recognised by the lexer under Nick's latest
>>>>>>>>>>> suggestion, so there are no elides to refer to the application.
>>>>>>>>>>> 
>>>>>>>>>>> My understanding of CIF1.1 syntax suggests that the string you
>>>>>>>>>>> provide
>>>>>>>>>>> would produce a syntax error in CIF1.1, as the semicolon at the
>>>>>>>>>>> start
>>>>>>>>>>> of the second line would terminate the string, and so whitespace
>>>>>>>>>>> should then appear as the second character on the second line,
>>>>>>>>>>> rather
>>>>>>>>>>> than reverse solidus.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Nov 25, 2009 at 9:23 AM, Herbert J. Bernstein
>>>>>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>>>>> The only problem with referring all elisdes to the application is
>>>>>>>>>>>> that
>>>>>>>>>>>> with the removal of the requirement of a blank after a \n; for it
>>>>>>>>>>>> to be
>>>>>>>>>>>> effective, the line folding protocol develops a slight gap.  The
>>>>>>>>>>>> case is as follows
>>>>>>>>>>>> 
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;\
>>>>>>>>>>>> ;
>>>>>>>>>>>> 
>>>>>>>>>>>> Is a valid single text field in CIF 1.1, which when handled with
>>>>>>>>>>>> the
>>>>>>>>>>>> line folding protocol translates to the equivalent of ';' because
>>>>>>>>>>>> the
>>>>>>>>>>>> embedded ;\ is not a valid text terminator.  If we require that
>>>>>>>>>>>> a text field the begins with "\n;\\" must be terminated by "\n; "
>>>>>>>>>>>> or "\n;\n" or "\n;\t" that problem would be fixed.
>>>>>>>>>>>> 
>>>>>>>>>>>> =====================================================
>>>>>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>>>>> 
>>>>>>>>>>>>                 +1-631-244-3035
>>>>>>>>>>>>                yaya@dowling.edu
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au




_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.