[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: Brian McMahon <bm@iucr.org>
- Date: Sat, 10 Oct 2009 17:50:35 +0100
- In-Reply-To: <C6F46C08.11FE9%nick@csse.uwa.edu.au>
- References: <279aad2a0910060801g4afdebaep4f35d3fb0180c0b7@mail.gmail.com><C6F46C08.11FE9%nick@csse.uwa.edu.au>
Nick I am still mulling over your proposals for restricting character sets within quote-delimited strings. Just so I have my thinking straight, in what way (or ways) in your latest proposal can you legally express the string data value O1' (an atom label) in a CIF? Thanks Brian On Fri, Oct 09, 2009 at 04:26:48AM +0800, Nick Spadaccini wrote: > Ok. Back on board. I am proposing some old and some new stuff here. From the > beginning, > > (1) restricting the character set of non-delimited strings is > NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data > structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should > drop it now and stick with its current DDL. > > IUCr needs to make that decision now. > > I have built a new lexer for the current syntax specification and checked > for cases where > > (1) a double-quote-delimited string contains a double quote. > (2) a single-quote-delimited string contains a single quote. > (3) a non-delimited string contains any of " ' , : { } > (4) a data name contains any of (3) > > The contents of (3) are sufficient I think) restriction to non-delimited > strings to enable us to move forward. > > I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The > results are > > (1) 0 of the 3.4M (M = million) data values failed the test. > > (2) 4 of the 1.3M data values failed the test. > When I pointed these out to John he said these SHOULD have been in > semi-colon delimited text because at the PDB they have been systematically > dealing with quotes within quotes to avoid parsing problems. > > HENCE not allowing a string delimiter character within the string delimited > by the same character poses very little or no problem in mmCIF. > > (3) 138,733 of the 2,009M data values failed the test (.007%) > > Again the magnitude of the problem has been exaggerated. The restrictions > will not affect many of the archived data items. All the failures were > limited to 3-5 data names. These were those with embedded : which includes > the specification of a URL, and those with embedded , to which Herb has > already alluded. John has stipulated that those restrictions we are > suggesting can be quickly and efficiently implemented (I am here and looked > at their systems and the changes are a single change to dictionary entry and > all software handles the change immediately). I believe the PDB has a > remediation process that will resolve all legacy issues (at least for them). > > Conclusion: This restriction has minimal (.007%) impact on how things have > been done, and can be easily implemented for files from here on. > > (4) 0 data names contain these characters. > > I will not comment further on this point until I have done the same analysis > for the IUCr archive. I suspect the problem will be bigger for those files > because they represent a more lackadaisical period in CIFs evolution where > we suggested you could do whatever you want etc, and also there are IUCr > mark ups that likely cause problems. Once I get my hands on that archive I > will let people know. > > Now guess what? If we don't allow a ' within a '..' and a " within a ".." > and any "',:{} within a non-delimited string or a data name WE DON'T NEED A > SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more > importantly NORMALIZES the grammar. > > I don't accept the argument that the new parser is so much more difficult > that existing parsers. Currently you have (if you are inside a double quote > delimited string) > > if (char == \") { > tmpchar=lookahead(1); > if (tmpchar == " ") return END_OF_STRING; > else continue; > } > > In the new parser you will have > > if (char == \") return END_OF_STRING > > YOU WILL NOTE: > > I have note included the [] characters in the restriction. There is too much > legacy associated with their existence in data names in both small and mm > CIFs. > > I am going to suggest a single token to represent lists, lists of lists and > associative arrays, namely {...}. These are new, and don't present a > problem. > > UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually > originally up to 6 bytes providing 31 bits of representation). It is a > binary representation. The encoding algorithm is not brain busting, but > neither is it trivial. Having a CIF file not editable by a bog standard > editor will upset some people. I propose the introduction of a new string > type within the DDLm semantics that allows one to define it to be Unicode. > Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to > represent the character. Equally we could go with the HTML approach of > � (ie 1-6 HEX characters). > > I also strongly propose support fort the UNICODE string within """ strings > ONLY. Lets's start from a restrictive stance from the outset. > > I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my > boxing gloves, Frances can referee :) > > Nick > > On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote: > > > Dear All: > > > > As a result of the discussion with Herbert I can see two differing > > approaches to these CIF syntax changes: > > > > 1. Any changes to CIF syntax should be such that earlier syntax > > versions form a subset of the new syntax, i.e. files in the older > > syntax will also conform to the new syntax > > > > or > > > > 2. When making changes to the standard, the opportunity should be > > taken to simplify and streamline syntax as much as possible. > > > > Advantages of (1): a single CIF parser can be maintained for all > > syntax versions; a CIF writer is always conformant to the latest > > version and only needs changing if new syntax features are to be used; > > the existing CIF software ecosystem is minimally affected > > > > Advantages of (2): implementation of CIF readers/writers from scratch > > is easier; the standard is easier to define formally and more > > aesthetically pleasing; mistakes in previous versions can be fixed, > > warts do not accumulate > > > > I would like to suggest we act as follows: in essence, we deprecate > > rather than exclude. In detail: > > > > 1. For this edition of the standard (1.2) we follow Herbert's line, > > leaving everything currently defined untouched. We simply add triple > > quote delimited strings and bracket expressions. The content of > > non-delimited strings in bracket expressions will be as proposed by > > Nick. > > > > 2. In the documents associated with the new standard we strongly > > suggest that all non-delimited strings use the same character set as > > for non-delimited strings in bracket expressions (i.e. Nick's original > > proposal). We might point out that this simplifies code for writing > > CIFs, and perhaps (if all agree) we add that using the CIF1.1 > > non-delimited string character set is deprecated, darkly foreshadowing > > that a future version of the syntax standard will adopt this character > > set for all non-delimited strings. > > > > 3. We also deprecate including string delimiters inside strings, > > regardless of whitespace issues. > > > > 4. In all dictionaries we adopt the restricted character set for > > non-delimited strings and exclusion of string delimiters in strings. > > > > 5. We ask that CheckCIF emit a warning about use of deprecated > > characters in non-delimited strings > > > > 6. When (say in 10 years' time) a sufficiently large proportion of > > incoming CIFs conform to the new non-delimited string character set, > > we promulgate the 1.3 version of the standard. > > > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: [ddlm-group] THREAD 2: token delimiters
- Index(es):