[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD TRIPLE QUOTES - Specification

Herb presses on a number of issues here that require quite a bit of
rethinking about the way things have been done versus how they will be done.

Firstly I preface what I say below with the recognition that under the new
DDLm and syntax there will have to be a large degree of remediation of
archived cifs. That is a given because in the past un-delimited strings have
been submitted which cannot be allowed under the new changes.

I want people to think about how we can do things and not necessarily think
about how does what we have done fit into what we want to do. I don't want
to get in to the situation where Microsoft continually builds holes in its
operating systems simply because it wants to be backwards compatible all the
way back to DOS 1.0.

Firstly un-delimited strings are a bad thing right and were from day one. I
can understand they look "nice" because you read them the way your would in
a journal paper - which is actually the underlying reason for why CIF is the
way it is. Back in 1987 the dominant view of data was through the eyes of a
journal article. But un-delimited strings are a problem. We should
increasingly "push" delimited strings.

(1) A previously perfectly legal string like
1,1-trans-,[Bis]Carbonato-Whatever is no longer acceptable. Why? Because in
a allowing for compound data structures Lists, Tables etc the , indicates
the next element is being defined, and the [] indicates a recursive form.

I favour not allowing un-delimited strings, however these have a foothold
and I am unlikely to win that one. So un-delimited strings will HAVE TO HAVE
a restricted alphabet. The details will come in a follow-up thread (I am in
an airport lounge ready to head home, so they will come in the next day).

I will push for the restriction in un-delimited strings to be held for
datanames also. We have gone through all existing small molecule cifs and
with these restrictions we will be breaking few data names or values. mmcif
will have more because of their use of [] in data names. But these are easy
work a-rounds.

(2) The parsing of CIF should be very easy to define both through and
LALR(k) or LL(k) formalism. However the original proposal that every
character be available in a string, even the token delimiters has made it
much more difficult. In any other language a quote delimited string is
initiated by a " and terminated by a " (not with standing support for any
elides). Not so in STAR/CIF they are <whitespace(s)>" and "<whitespace(s)>.
So wheras any other language considers whitespace as separator of tokens and
meaningless. In STAR?CIF the whitespace IS the token (along with ").

This becomes silly when you define a recursive compound data structure (or
even its BNF). A list is initiated by a <whitespace(s)>[ and is recursive,
BUT don't use the same definition for lists that you just used because
recursively that list DOESN'T need a leading whitespace(s).

Part of the follow-up threads will propose establishing token delimits which
are single characters or multiple characters. The only one I can't get
around is the <newline>; construct. But that is enshrined and has to be
supported.

What I will propose will make things much easier, both in defining in a BNF
and in building a parser using compiler-compiler tools. We will want people
to continue to use a space before the token delimiter, but if they don't
there is a recovery mechanism if we choose to invoke it.

James knows what is coming because I spent the last 2 weeks in a room with
him banging out these issues. The three weeks I spent in Chester "jelled"
some new ideas about how to use DDLm in quite different discipline domains.
The ANSTO stay has made me think about how to be more rigorous to simplify
building things.

Nick.

Those threads will be out shortly. They will be longish because I will try
to cover all our thinking over the last several weeks.

On 12/09/09 3:17 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
wrote:

> Dear Colleagues,
> 
>     Inasmuch as we have already generated many CIFs on under the existing
> handling of the reverse solidus, which was to treat it as subordinate
> to the existing quoting schemes, I would suggest that we retain the
> same ordering of lexical scan for the treble quotes.  This would
> allow us to keep the handling of embedded treble quotes within treble
> quotes without much special handling at all -- just break up any
> embedded treble quotes with a reverse solidus.  To be more specific,
> here is what I would suggest:
> 
> 1.  On reads:
>      1.1.  The reverse solidus is an ordinary character in the lexical
> scan of a quoted string;
>      1.2.  At the level of a CIF we retain the rule that no terminal
> quote mark is recognized unless followed by whitespace
>      1.3.  At the level of a CIF we strip trailing whitespace from all
> lines prior to the lexical scan
>      1.4.  That, on read, recognition of the reverse solidus is an
> optional semantic interpretation (perhaps handled by a second level
> lexical scan or handled in the application) following the same
> rules as Brian laid out for comments, semi-colon delimited strings
> and, now, for treble quoted strings.
> 
> 2.  On writes:
>      In writing a treble quoted string, if a treble quote is
> encounterd as part of the quoted text, a reverse-solidus-newline
> digraph would be inserted after the third quote mark, i.e.
> 
> """This is an example
> of a treble-quoted
> string"""
> 
> might be written as
> 
> """"""\
> This is an example of a treble-quoted
> string"""\
> """
> 
> This interesting case to then consider is whether we need to
> do any quoting to protect the reverse soliduses (solidii?)
> in the example when quoting it
> 
> The main advantage of this approach is the ordinary quoted
> cif-strings such as
> 
> _mugwump "muddy big water"
> 
> could then be treble quoted very directly as
> """_mugwump "muddy big water" """
> 
> rather than as
> 
> """_mugwump " muddy big water" """
> 
> as would be required under Nick's suggestion
> 
> Regards,
>    Herbert
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> 
> On Fri, 11 Sep 2009, Herbert J. Bernstein wrote:
> 
>> The main value of the treble quoted string is that it allows a much
>> neater presentation of examples of chunks of CIFS and text in which
>> presenting such information within semi-colon quoted strings gets
>> somewhat confusing.
>> 
>> For this reason, I would suggest that the most important test of
>> Nick's suggestions would be how faithfully a semi-colon delimited
>> example could be included with _no_ added or subtracted characters,
>> so that people reading dictionaries by eye will reproduce those
>> examples correctly.
>> 
>> For that reason, I hope we can stay as close to """ and ''' delimiting
>> truly raw data as possible.
>> 
>> Regards,
>>  Herbert
>> 
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>> 
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>> 
>> On Fri, 11 Sep 2009, Nick Spadaccini wrote:
>> 
>>> Our last discussion on the implementation of triple quoted strings resulted
>>> in much to-ing and fro-ing and in the end the conclusion was that its
>>> behaviour was to be identical to the semi-colon delimited strings. I
>>> preferred a greater degree of parsing of the string but this was not
>>> popular.
>>> 
>>> Now our illustrious chair, who sits next to me now, asks the question "what
>>> is the point of the triple quoted string", to which I can only shrug my
>>> shoulders. The triple quote string will be useful for containing strings
>>> that include ", ' and ; (in the first character in a record). They will of
>>> course fail when you attempt to include the sequence """. Hence they are no
>>> different to a ; delimited string that cannot include a ; as the first
>>> character of a line.
>>> 
>>> Here is a suggestion. The triple quote string (delimited by """) will treat
>>> its contents a raw, except that
>>> 
>>> (a) When writing the string, ALL quotes contained within will have a space
>>> inserted immediately after the " character. This will allow the triple
>>> quote
>>> to be contained within the string by breaking the sequence with spaces so
>>> the tokeniser is not fooled in to terminating the string. Clearly the
>>> reverse operation is required in reading the string. I this way is is
>>> possible to include all manner of text, markup and programming scripts
>>> within a triple quoted string.
>>> 
>>> (b) We will formally accept in this string the "eliding" of the newline
>>> character. Hence a reverse solidus (\) immediately prior to the record
>>> terminating character(s) will imply the \ and the record terminating
>>> characters are deleted from the stream, and the next line is wrapped
>>> around.
>>> 
>>> To allow for the odd case when one want's to literally include the \ and
>>> the
>>> record terminating character(s) in the string then the required \ will be
>>> elided.
>>> 
>>> In parsing the contents of a string the only things required are
>>> (1) delete one of the spaces after every "
>>> (2) treat \<newline> as a wrap around
>>> (3) treat \\<newline> as the raw string \<newline>
>>> 
>>> All other characters are left as is.
>>> 
>>> 
>>> cheers
>>> 
>>> Nick
>>> 
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>> 
>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> MBDP  M002
>>> 
>>> CRICOS Provider Code: 00126G
>>> 
>>> e: Nick.Spadaccini@uwa.edu.au
>>> 
>>> 
>>> 
>>> 
>>> 
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> 
>> 

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au





_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]