[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: Brian McMahon <[email protected]>
Date: Sat, 10 Oct 2009 17:50:35 +0100
In-Reply-To: <C6F46C08.11FE9%[email protected]>
References: <[email protected]><C6F46C08.11FE9%[email protected]>

Nick

I am still mulling over your proposals for restricting character
sets within quote-delimited strings. Just so I have my thinking
straight, in what way (or ways) in your latest proposal can you
legally express the string data value
          O1'
(an atom label) in a CIF?

Thanks
Brian

On Fri, Oct 09, 2009 at 04:26:48AM +0800, Nick Spadaccini wrote:
> Ok. Back on board. I am proposing some old and some new stuff here. From the
> beginning,
> 
> (1) restricting the character set of non-delimited strings is
> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data
> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should
> drop it now and stick with its current DDL.
> 
> IUCr needs to make that decision now.
> 
> I have built a new lexer for the current syntax specification and checked
> for cases where
> 
> (1) a double-quote-delimited string contains a double quote.
> (2) a single-quote-delimited string contains a single quote.
> (3) a non-delimited string contains any of " ' , : { }
> (4) a data name contains any of (3)
> 
> The contents of (3) are sufficient I think) restriction to non-delimited
> strings to enable us to move forward.
> 
> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
> results are
> 
> (1) 0 of the 3.4M (M = million) data values failed the test.
> 
> (2) 4 of the 1.3M data values failed the test.
> When I pointed these out to John he said these SHOULD have been in
> semi-colon delimited text because at the PDB they have been systematically
> dealing with quotes within quotes to avoid parsing problems.
> 
> HENCE not allowing a string delimiter character within the string delimited
> by the same character poses very little or no problem in mmCIF.
> 
> (3) 138,733 of the 2,009M data values failed the test (.007%)
> 
> Again the magnitude of the problem has been exaggerated. The restrictions
> will not affect many of the archived data items. All the failures were
> limited to 3-5 data names. These were those with embedded : which includes
> the specification of a URL, and those with embedded , to which Herb has
> already alluded. John has stipulated that those restrictions we are
> suggesting can be quickly and efficiently implemented (I am here and looked
> at their systems and the changes are a single change to dictionary entry and
> all software handles the change immediately). I believe the PDB has a
> remediation process that will resolve all legacy issues (at least for them).
> 
> Conclusion: This restriction has minimal (.007%) impact on how things have
> been done, and can be easily implemented for files from here on.
> 
> (4) 0 data names contain these characters.
> 
> I will not comment further on this point until I have done the same analysis
> for the IUCr archive. I suspect the problem will be bigger for those files
> because they represent a more lackadaisical period in CIFs evolution where
> we suggested you could do whatever you want etc, and also there are IUCr
> mark ups that likely cause problems. Once I get my hands on that archive I
> will let people know.
> 
> Now guess what? If we don't allow a ' within a '..' and a " within a ".."
> and any "',:{} within a non-delimited string or a data name WE DON'T NEED A
> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
> importantly NORMALIZES the grammar.
> 
> I don't accept the argument that the new parser is so much more difficult
> that existing parsers. Currently you have (if you are inside a double quote
> delimited string)
> 
> if (char == \") {
>   tmpchar=lookahead(1);
>   if (tmpchar == " ") return END_OF_STRING;
>   else continue;
>   }
> 
> In the new parser you will have
> 
> if (char == \") return END_OF_STRING
> 
> YOU WILL NOTE:
> 
> I have note included the [] characters in the restriction. There is too much
> legacy associated with their existence in data names in both small and mm
> CIFs.
> 
> I am going to suggest a single token to represent lists, lists of lists and
> associative arrays, namely  {...}. These are new, and don't present a
> problem.
> 
> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
> originally up to 6 bytes providing 31 bits of representation). It is a
> binary representation. The encoding algorithm is not brain busting, but
> neither is it trivial. Having a CIF file not editable by a bog standard
> editor will upset some people. I propose the introduction of a new string
> type within the DDLm semantics that allows one to define it to be Unicode.
> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to
> represent the character. Equally we could go with the HTML approach of
> &#xABCDEF; (ie 1-6 HEX characters).
> 
> I also strongly propose support fort the UNICODE string within """ strings
> ONLY. Lets's start from a restrictive stance from the outset.
> 
> I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my
> boxing gloves, Frances can referee :)
> 
> Nick
> 
> On 6/10/09 11:01 PM, "James Hester" <[email protected]> wrote:
> 
> > Dear All:
> > 
> > As a result of the discussion with Herbert I can see two differing
> > approaches to these CIF syntax changes:
> > 
> > 1. Any changes to CIF syntax should be such that earlier syntax
> > versions form a subset of the new syntax, i.e. files in the older
> > syntax will also conform to the new syntax
> > 
> > or
> > 
> > 2. When making changes to the standard, the opportunity should be
> > taken to simplify and streamline syntax as much as possible.
> > 
> > Advantages of (1): a single CIF parser can be maintained for all
> > syntax versions; a CIF writer is always conformant to the latest
> > version and only needs changing if new syntax features are to be used;
> > the existing CIF software ecosystem is minimally affected
> > 
> > Advantages of (2): implementation of CIF readers/writers from scratch
> > is easier; the standard is easier to define formally and more
> > aesthetically pleasing; mistakes in previous versions can be fixed,
> > warts do not accumulate
> > 
> > I would like to suggest we act as follows: in essence, we deprecate
> > rather than exclude.  In detail:
> > 
> > 1. For this edition of the standard (1.2) we follow Herbert's line,
> > leaving everything currently defined untouched.  We simply add triple
> > quote delimited strings and bracket expressions.  The content of
> > non-delimited strings in bracket expressions will be as proposed by
> > Nick.
> > 
> > 2. In the documents associated with the new standard we strongly
> > suggest that all non-delimited strings use the same character set as
> > for non-delimited strings in bracket expressions (i.e. Nick's original
> > proposal).  We might point out that this simplifies code for writing
> > CIFs, and perhaps (if all agree) we add that using the CIF1.1
> > non-delimited string character set is deprecated, darkly foreshadowing
> > that a future version of the syntax standard will adopt this character
> > set for all non-delimited strings.
> > 
> > 3. We also deprecate including string delimiters inside strings,
> > regardless of whitespace issues.
> > 
> > 4. In all dictionaries we adopt the restricted character set for
> > non-delimited strings and exclusion of string delimiters in strings.
> > 
> > 5. We ask that CheckCIF emit a warning about use of deprecated
> > characters in non-delimited strings
> > 
> > 6. When (say in 10 years' time) a sufficiently large proportion of
> > incoming CIFs conform to the new non-delimited string character set,
> > we promulgate the 1.3 version of the standard.
> > 
> 
> cheers
> 
> Nick
> 
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
> 
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
> 
> CRICOS Provider Code: 00126G
> 
> e: [email protected]
> 
> 
> 
> 
> 
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: [ddlm-group] THREAD 2: token delimiters

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.