[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 8 Oct 2009 20:45:01 -0400 (EDT)
- In-Reply-To: <C6F4915D.11FF0%nick@csse.uwa.edu.au>
- References: <C6F4915D.11FF0%nick@csse.uwa.edu.au>
Dear Nick, If you have persuaded the others to your view, then you will win on a straw vote. I hope you have not persuaded a majority, because I agree neither with your premises, nor your conclusions, but the only way to find out is to hear from the others. I still think the right way to resolve this is to put the items I have listed to a vote and then move on. Regards, Herbert P.S. From your comments about binary, it sounds as if you intend to "excommunicate" imgCIF from DDLm. I think that would be a mistake. imgCIF will benefit greatly from the use of methods, but at worst, I can always go back to the original name: imgNCIF, where the N stands for "not", and use methods without the blessing of it being officially a "CIF" dictionary. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 9 Oct 2009, Nick Spadaccini wrote: > > > > On 9/10/09 5:37 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> > wrote: > >> Dear Colleagues, >> >> I sense a certain strong emotion in this. I don't think that is the >> way to resolve this. Nick has his views. I have mine. Neither of us >> has the final say. I suggest that these matters be put to a straw >> vote, tell the community the outcome, and then move on to more >> substantive issues. > > There isn't emotion in this Herb, but when I say something is not negotiable > it is a statement of fact. > > At least we agree on the item you have probably viewed as emotional, my > statement of non-negotiability on Issue 2. I put it much stronger, 2.1 > simply is not an option. However it is not a strictly limited set of > characters. The only restriction I am suggesting are those 6 characters that > are token delimiters. > > The problem with your suggestions Herb is that is refers to deprecating and > not enforcing when we are trying to specify the standard. Standards tend to > be strict, though individual parsers can be liberal in what to do when error > states arise. That is fair enough, BUT the standard can't be liberal. > > As a standard I much prefer 2.3 with the added restrictions for "" and '' > strings. With that in place, 1.1 doesn't make sense so clearly I prefer 1.2. > > UTF-8 introduce strictly binary data in to the file. I don't think this is > the direction to take. Not withstanding most of us wouldn't know how to > encode in to UTF-8. So what are we going to do? We will probably identify > the characters we want to encode in some ascii presentation, likely unicode, > and then use a library function/method to encode it. > > To write utf-8 (binary) into the cif file you will have to execute something > like > > outputToCIF("\u1234".encode('utf-8)) > > To me it makes more sense to > > outputToCIF("\u1234") > > And then do the encoding once you read the string in from the CIF. That way > the CIF remains ascii readable. > > I think 1.2, 2.3 with the added restrictions on "" and '', and ascii-fied > unicode in strings. > >> Issue1: Removing the requirement for a trailing whitespace after >> quoted strings outside of bracketed constructs. >> Options: 1.1. Preserve the current convention as is >> 1.2. Terminate all quoted strings on the occurance of the >> trailing quoted delimiter without consideration of the next character >> >> Issue2: Restriction of the character set for non-delimited strings >> outside of bracketed constructs >> Options 2.1. Preserve the current convention as is >> 2.2. Modify the current convention to deprecate use of >> any characters other than a strictly limited set >> of characters, adding a warning oon reads and >> defaulting to add quote marks on write >> 2.3. Modify the current convention to forbid the use of >> any characters other than a strctly limited set >> of characters, making it an error to read a non-delimited >> string that does not comply even if the intention >> can be inferred from context >> >> Issue 3: Use of UTF-8 >> Options: 3.1. Do not use UTF-8 >> 3.2. Use UTF-8 >> >> My votes would be 1.1, 2.2, 3.2 >> >> Whatever the outcome of the vote, I will code at least one variant of a >> parser to comply, but it will take longer if the vote goes for 1.2 and >> 2.3. >> >> Regards, >> Herbert >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Fri, 9 Oct 2009, Nick Spadaccini wrote: >> >>> Ok. Back on board. I am proposing some old and some new stuff here. From the >>> beginning, >>> >>> (1) restricting the character set of non-delimited strings is >>> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data >>> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should >>> drop it now and stick with its current DDL. >>> >>> IUCr needs to make that decision now. >>> >>> I have built a new lexer for the current syntax specification and checked >>> for cases where >>> >>> (1) a double-quote-delimited string contains a double quote. >>> (2) a single-quote-delimited string contains a single quote. >>> (3) a non-delimited string contains any of " ' , : { } >>> (4) a data name contains any of (3) >>> >>> The contents of (3) are sufficient I think) restriction to non-delimited >>> strings to enable us to move forward. >>> >>> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The >>> results are >>> >>> (1) 0 of the 3.4M (M = million) data values failed the test. >>> >>> (2) 4 of the 1.3M data values failed the test. >>> When I pointed these out to John he said these SHOULD have been in >>> semi-colon delimited text because at the PDB they have been systematically >>> dealing with quotes within quotes to avoid parsing problems. >>> >>> HENCE not allowing a string delimiter character within the string delimited >>> by the same character poses very little or no problem in mmCIF. >>> >>> (3) 138,733 of the 2,009M data values failed the test (.007%) >>> >>> Again the magnitude of the problem has been exaggerated. The restrictions >>> will not affect many of the archived data items. All the failures were >>> limited to 3-5 data names. These were those with embedded : which includes >>> the specification of a URL, and those with embedded , to which Herb has >>> already alluded. John has stipulated that those restrictions we are >>> suggesting can be quickly and efficiently implemented (I am here and looked >>> at their systems and the changes are a single change to dictionary entry and >>> all software handles the change immediately). I believe the PDB has a >>> remediation process that will resolve all legacy issues (at least for them). >>> >>> Conclusion: This restriction has minimal (.007%) impact on how things have >>> been done, and can be easily implemented for files from here on. >>> >>> (4) 0 data names contain these characters. >>> >>> I will not comment further on this point until I have done the same analysis >>> for the IUCr archive. I suspect the problem will be bigger for those files >>> because they represent a more lackadaisical period in CIFs evolution where >>> we suggested you could do whatever you want etc, and also there are IUCr >>> mark ups that likely cause problems. Once I get my hands on that archive I >>> will let people know. >>> >>> Now guess what? If we don't allow a ' within a '..' and a " within a ".." >>> and any "',:{} within a non-delimited string or a data name WE DON'T NEED A >>> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more >>> importantly NORMALIZES the grammar. >>> >>> I don't accept the argument that the new parser is so much more difficult >>> that existing parsers. Currently you have (if you are inside a double quote >>> delimited string) >>> >>> if (char == \") { >>> tmpchar=lookahead(1); >>> if (tmpchar == " ") return END_OF_STRING; >>> else continue; >>> } >>> >>> In the new parser you will have >>> >>> if (char == \") return END_OF_STRING >>> >>> YOU WILL NOTE: >>> >>> I have note included the [] characters in the restriction. There is too much >>> legacy associated with their existence in data names in both small and mm >>> CIFs. >>> >>> I am going to suggest a single token to represent lists, lists of lists and >>> associative arrays, namely {...}. These are new, and don't present a >>> problem. >>> >>> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually >>> originally up to 6 bytes providing 31 bits of representation). It is a >>> binary representation. The encoding algorithm is not brain busting, but >>> neither is it trivial. Having a CIF file not editable by a bog standard >>> editor will upset some people. I propose the introduction of a new string >>> type within the DDLm semantics that allows one to define it to be Unicode. >>> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to >>> represent the character. Equally we could go with the HTML approach of >>> � (ie 1-6 HEX characters). >>> >>> I also strongly propose support fort the UNICODE string within """ strings >>> ONLY. Lets's start from a restrictive stance from the outset. >>> >>> I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my >>> boxing gloves, Frances can referee :) >>> >>> Nick >>> >>> On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote: >>> >>>> Dear All: >>>> >>>> As a result of the discussion with Herbert I can see two differing >>>> approaches to these CIF syntax changes: >>>> >>>> 1. Any changes to CIF syntax should be such that earlier syntax >>>> versions form a subset of the new syntax, i.e. files in the older >>>> syntax will also conform to the new syntax >>>> >>>> or >>>> >>>> 2. When making changes to the standard, the opportunity should be >>>> taken to simplify and streamline syntax as much as possible. >>>> >>>> Advantages of (1): a single CIF parser can be maintained for all >>>> syntax versions; a CIF writer is always conformant to the latest >>>> version and only needs changing if new syntax features are to be used; >>>> the existing CIF software ecosystem is minimally affected >>>> >>>> Advantages of (2): implementation of CIF readers/writers from scratch >>>> is easier; the standard is easier to define formally and more >>>> aesthetically pleasing; mistakes in previous versions can be fixed, >>>> warts do not accumulate >>>> >>>> I would like to suggest we act as follows: in essence, we deprecate >>>> rather than exclude. In detail: >>>> >>>> 1. For this edition of the standard (1.2) we follow Herbert's line, >>>> leaving everything currently defined untouched. We simply add triple >>>> quote delimited strings and bracket expressions. The content of >>>> non-delimited strings in bracket expressions will be as proposed by >>>> Nick. >>>> >>>> 2. In the documents associated with the new standard we strongly >>>> suggest that all non-delimited strings use the same character set as >>>> for non-delimited strings in bracket expressions (i.e. Nick's original >>>> proposal). We might point out that this simplifies code for writing >>>> CIFs, and perhaps (if all agree) we add that using the CIF1.1 >>>> non-delimited string character set is deprecated, darkly foreshadowing >>>> that a future version of the syntax standard will adopt this character >>>> set for all non-delimited strings. >>>> >>>> 3. We also deprecate including string delimiters inside strings, >>>> regardless of whitespace issues. >>>> >>>> 4. In all dictionaries we adopt the restricted character set for >>>> non-delimited strings and exclusion of string delimiters in strings. >>>> >>>> 5. We ask that CheckCIF emit a warning about use of deprecated >>>> characters in non-delimited strings >>>> >>>> 6. When (say in 10 years' time) a sufficiently large proportion of >>>> incoming CIFs conform to the new non-delimited string character set, >>>> we promulgate the 1.3 version of the standard. >>>> >>> >>> cheers >>> >>> Nick >>> >>> -------------------------------- >>> Associate Professor N. Spadaccini, PhD >>> School of Computer Science & Software Engineering >>> >>> The University of Western Australia t: +61 (0)8 6488 3452 >>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> MBDP M002 >>> >>> CRICOS Provider Code: 00126G >>> >>> e: Nick.Spadaccini@uwa.edu.au >>> >>> >>> >>> >>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):