[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   I sense a certain strong emotion in this.  I don't think that is the
way to resolve this.  Nick has his views.  I have mine.  Neither of us
has the final say.  I suggest that these matters be put to a straw
vote, tell the community the outcome, and then move on to more 
substantive issues.

   Issue1:  Removing the requirement for a trailing whitespace after
quoted strings outside of bracketed constructs.
   Options:  1.1. Preserve the current convention as is
             1.2. Terminate all quoted strings on the occurance of the
trailing quoted delimiter without consideration of the next character

   Issue2:  Restriction of the character set for non-delimited strings
outside of bracketed constructs
   Options  2.1.  Preserve the current convention as is
            2.2.  Modify the current convention to deprecate use of
                  any characters other than a strictly limited set
                  of characters, adding a warning oon reads and
                  defaulting to add quote marks on write
            2.3.  Modify the current convention to forbid the use of
                  any characters other than a strctly limited set
                  of characters, making it an error to read a non-delimited
                  string that does not comply even if the intention
                  can be inferred from context

    Issue 3:  Use of UTF-8
    Options:  3.1.  Do not use UTF-8
              3.2.  Use UTF-8

My votes would be 1.1, 2.2, 3.2

Whatever the outcome of the vote, I will code at least one variant of a 
parser to comply, but it will take longer if the vote goes for 1.2 and 
2.3.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 9 Oct 2009, Nick Spadaccini wrote:

> Ok. Back on board. I am proposing some old and some new stuff here. From the
> beginning,
>
> (1) restricting the character set of non-delimited strings is
> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data
> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should
> drop it now and stick with its current DDL.
>
> IUCr needs to make that decision now.
>
> I have built a new lexer for the current syntax specification and checked
> for cases where
>
> (1) a double-quote-delimited string contains a double quote.
> (2) a single-quote-delimited string contains a single quote.
> (3) a non-delimited string contains any of " ' , : { }
> (4) a data name contains any of (3)
>
> The contents of (3) are sufficient I think) restriction to non-delimited
> strings to enable us to move forward.
>
> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
> results are
>
> (1) 0 of the 3.4M (M = million) data values failed the test.
>
> (2) 4 of the 1.3M data values failed the test.
> When I pointed these out to John he said these SHOULD have been in
> semi-colon delimited text because at the PDB they have been systematically
> dealing with quotes within quotes to avoid parsing problems.
>
> HENCE not allowing a string delimiter character within the string delimited
> by the same character poses very little or no problem in mmCIF.
>
> (3) 138,733 of the 2,009M data values failed the test (.007%)
>
> Again the magnitude of the problem has been exaggerated. The restrictions
> will not affect many of the archived data items. All the failures were
> limited to 3-5 data names. These were those with embedded : which includes
> the specification of a URL, and those with embedded , to which Herb has
> already alluded. John has stipulated that those restrictions we are
> suggesting can be quickly and efficiently implemented (I am here and looked
> at their systems and the changes are a single change to dictionary entry and
> all software handles the change immediately). I believe the PDB has a
> remediation process that will resolve all legacy issues (at least for them).
>
> Conclusion: This restriction has minimal (.007%) impact on how things have
> been done, and can be easily implemented for files from here on.
>
> (4) 0 data names contain these characters.
>
> I will not comment further on this point until I have done the same analysis
> for the IUCr archive. I suspect the problem will be bigger for those files
> because they represent a more lackadaisical period in CIFs evolution where
> we suggested you could do whatever you want etc, and also there are IUCr
> mark ups that likely cause problems. Once I get my hands on that archive I
> will let people know.
>
> Now guess what? If we don't allow a ' within a '..' and a " within a ".."
> and any "',:{} within a non-delimited string or a data name WE DON'T NEED A
> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
> importantly NORMALIZES the grammar.
>
> I don't accept the argument that the new parser is so much more difficult
> that existing parsers. Currently you have (if you are inside a double quote
> delimited string)
>
> if (char == \") {
>  tmpchar=lookahead(1);
>  if (tmpchar == " ") return END_OF_STRING;
>  else continue;
>  }
>
> In the new parser you will have
>
> if (char == \") return END_OF_STRING
>
> YOU WILL NOTE:
>
> I have note included the [] characters in the restriction. There is too much
> legacy associated with their existence in data names in both small and mm
> CIFs.
>
> I am going to suggest a single token to represent lists, lists of lists and
> associative arrays, namely  {...}. These are new, and don't present a
> problem.
>
> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
> originally up to 6 bytes providing 31 bits of representation). It is a
> binary representation. The encoding algorithm is not brain busting, but
> neither is it trivial. Having a CIF file not editable by a bog standard
> editor will upset some people. I propose the introduction of a new string
> type within the DDLm semantics that allows one to define it to be Unicode.
> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to
> represent the character. Equally we could go with the HTML approach of
> � (ie 1-6 HEX characters).
>
> I also strongly propose support fort the UNICODE string within """ strings
> ONLY. Lets's start from a restrictive stance from the outset.
>
> I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my
> boxing gloves, Frances can referee :)
>
> Nick
>
> On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>
>> Dear All:
>>
>> As a result of the discussion with Herbert I can see two differing
>> approaches to these CIF syntax changes:
>>
>> 1. Any changes to CIF syntax should be such that earlier syntax
>> versions form a subset of the new syntax, i.e. files in the older
>> syntax will also conform to the new syntax
>>
>> or
>>
>> 2. When making changes to the standard, the opportunity should be
>> taken to simplify and streamline syntax as much as possible.
>>
>> Advantages of (1): a single CIF parser can be maintained for all
>> syntax versions; a CIF writer is always conformant to the latest
>> version and only needs changing if new syntax features are to be used;
>> the existing CIF software ecosystem is minimally affected
>>
>> Advantages of (2): implementation of CIF readers/writers from scratch
>> is easier; the standard is easier to define formally and more
>> aesthetically pleasing; mistakes in previous versions can be fixed,
>> warts do not accumulate
>>
>> I would like to suggest we act as follows: in essence, we deprecate
>> rather than exclude.  In detail:
>>
>> 1. For this edition of the standard (1.2) we follow Herbert's line,
>> leaving everything currently defined untouched.  We simply add triple
>> quote delimited strings and bracket expressions.  The content of
>> non-delimited strings in bracket expressions will be as proposed by
>> Nick.
>>
>> 2. In the documents associated with the new standard we strongly
>> suggest that all non-delimited strings use the same character set as
>> for non-delimited strings in bracket expressions (i.e. Nick's original
>> proposal).  We might point out that this simplifies code for writing
>> CIFs, and perhaps (if all agree) we add that using the CIF1.1
>> non-delimited string character set is deprecated, darkly foreshadowing
>> that a future version of the syntax standard will adopt this character
>> set for all non-delimited strings.
>>
>> 3. We also deprecate including string delimiters inside strings,
>> regardless of whitespace issues.
>>
>> 4. In all dictionaries we adopt the restricted character set for
>> non-delimited strings and exclusion of string delimiters in strings.
>>
>> 5. We ask that CheckCIF emit a warning about use of deprecated
>> characters in non-delimited strings
>>
>> 6. When (say in 10 years' time) a sufficiently large proportion of
>> incoming CIFs conform to the new non-delimited string character set,
>> we promulgate the 1.3 version of the standard.
>>
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]