[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Title: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Sorry, I have been loose in my use of the term “remediate”. I do mean our applications will have to handle the different contexts of incoming DDL1/DDL2/DDLm files. This can be achieved by context sensitivity in our parsers BUT much easier if the existing archives were tagged with a leading comment string indicating the DDL type it adheres to. This will mean changing the archive yes, by adding a comment, which we can do via a simple (trivial) script. But it is not critical that it has to happen this way.

The IUCr will have to decide on that one.

On 23/09/09 12:25 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:

Dear all

I can't speak for the IUCr, but I can't foresee them 'remediating all existing CIFs' in their archives, which after all, represent 'published' data that strictly shouldn't be altered once published?

Rather, I suspect that any new software based on DDLm will have to be aware that it might have to deal with such CIFs and, where possible, 'quietly' correct any violations (this approach is already being used for CIFs submitted to the IUCr that have 'minor' syntax and dictionary violations).

Cheers

Simon P. Westrip



From:
Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 22 September, 2009 10:55:54 AM
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   I think it urgent to at least hear from the PDB and the IUCr journals
operation on the subject of remediating all existing CIFs, as well as
from the managers of the major graphics and data processng packages very
early in the discussion.

   However, never one to fail to rush in where angels fear to tread, here
are my comments on substance:

   I would prefer to retain the current CIF approach of recognizing
anything that can be whitespace delimited and which is not an a small
list of reserved items as a whitespace delimited value.  I would suggest
that the reserved items be:

   Any item beginning with an underscore ('_')
   Any item beginning with "data_" or "save_" (case insensitive)
   Any item consisting of "global_", "loop_", "stop_" (case insensitive)
   Any item beginning with of the quote marks:
      '"' (double quote)
      '\'" (single quote)
      '\n;' (newline-semicolon) (where newline is system dependent)
      '[', '{', '(' (the three bracket constructs in the original DDLm
         proposal)
      '\'\'\'' or '"""' (the two treble quote marks used in other languages

When an item begins with one of the quote marks it would then have to
conform to the conventions specified for those quote marks, but in general
at the top level, the mating terminal quote mark would not be recognized
as a terminal quote mark unless followed by whitespace.

   I would prefer to handle the elides one level down, i.e. not treating
'"""\\\n' as a terminal treble quote mark because the last '"' is followed
by a reverse solidus rather than by whitespace.

   I would prefer to accept all UTF-8 text.

   I believe that this approach would reduce the impact of the remediation
on existing CIFs and existing software and would also allow recusion to
be handled with minimal confusion.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 21 Sep 2009, Nick Spadaccini wrote:

> I am for wider discussion so long as we converge on something. But I
> re-iterate whatever we do we need to remediate existing cifs anyway.
>
> The changes are drastic on paper only. There aren't that many existing cifs
> that violate the restrictions I am suggesting. However in order to allow for
> truly recursive data structures, some restrictions need to be put in place,
> otherwise scanners will have to be built according to a large number of
> exceptions rather than any rules.
>
> Having supported Herb suggestion I will also state that I would prefer some
> discussion on this list BEFORE it goes out to the wider community. James and
> I have thought these issues out over 2 weeks together, so we have fairly
> deep reasoning behind our proposals. We would like discussion here to draw
> out those reasons so we are all on the same page.
>
>
> On 18/09/09 2:45 PM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
> wrote:
>
>> This would seem to create a serious divergence between valid DDLm CIFs and
>> valid DDL1 and DDL2 CIFs.  I would suggest putting this specific proposal
>> out to the wider community for comments.  Up until now we had been trying
>> to assure people that the change to DDLm would not invalidate existing
>> CIFS.  I for one would hope we could do something less drastic.
>>    -- Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Fri, 18 Sep 2009, Nick Spadaccini wrote:
>>
>>> As I have written before, non-delimited string (non-DS) that are not of the
>>> Number or Measured types cause problems. Everything you need to include in a
>>> string can be handled by the delimited string types. With the introduction
>>> of compound data structures, restrictions have to be imposed on the allowed
>>> alphabet of non-delimited strings so the scanner is not "fooled".
>>>
>>> If you HAVE to use non-delimited strings then the alphabet is restricted to,
>>>
>>> non-DS = [A_Za-z0-9./-()+?][A_Za-z0-9_./-()+?]*
>>>
>>> (Allowing for / is in deference to James, I don't see a great need for it.)
>>>
>>> The square brackets [] are part of the regexp and not allowed characters.
>>> This will cover all numerics including Measured, and decline the first
>>> character as _. None of the token delimiters are included in the alphabet.
>>> Note also the classic example of symop is x,y+1/2,z IS NOT allowed, though a
>>> quick scan of the IUCr cif archive shows many submissions already quite
>>> sensibly use "x,y+1/2,z".
>>>
>>> One level of simplification will be in the definition of datanames (DN). We
>>> could simply define a data name as
>>>
>>> DN = _{non-DS}
>>>
>>> All CIF data names in the new DDLm dictionaries are consistent with this
>>> restriction. A small amount of remediation (which has to be undertaken
>>> anyway) will need to be done for existing domain dictionaries written in
>>> either DDL1 or DDL2.
>>>
>>> A further simplification is that one can write the scanner to look for token
>>> terminating characters, rather than DEMANDING it be followed or preceded by
>>> a whitespace.
>>>
>>>
>>> cheers
>>>
>>> Nick
>>>
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>>
>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick <http://www.csse.uwa.edu.au/%7Enick>
>>> MBDP  M002
>>>
>>> CRICOS Provider Code: 00126G
>>>
>>> e: Nick.Spadaccini@uwa.edu.au
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick <http://www.csse.uwa.edu.au/%7Enick>
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]