[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Fri, 9 Oct 2009 18:05:55 -0400 (EDT)
- In-Reply-To: <C6F5BF24.1200E%nick@csse.uwa.edu.au>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au>
Dear Nick, imgCIF and CBFlib have supported UTF-8 and will continue to do so. Any application that supports ascii can trivially suport UTF-8. In addition, one of the encodings in imgCIF and CBFlib uses UTF-16/UCS-2. If these are not valid for a CIF, we can always go back to using the name imgNCIF. Most larger CIFs are no more readable than Postscript files, but there are editors that do a nice job of displaying UTF-8 properly. I use them for the multi-lingual strings for the message catalog for RasMol. The world has many languages, and it make sense for a data representation language to be able to handle them. Even for the western European languages, UTF-8 makes much more sense than using national code pages. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sat, 10 Oct 2009, Nick Spadaccini wrote: > I am willing to be convinced on 3.2 vs an ascii based representation. But I > need answers to several things. > > The UTF-8 removes text readability from CIF, which something many still hold > dear, but that may be a cost. > > However here is a practical example. A user wishes to add the author names > to an existing CIF. They fire up vim or emacs. The possible non-readability > of the file already presents a problem. But more importantly how do they > inject the utf-8 coding equivalent of what they need? > > On 10/10/09 3:34 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote: > >> Dear all >> >> Without having discussed this with the IUCr, my vote would be: >> >> 1.2 - delimiters lose trailing whitespace condition >> >> 2.3 - restricted char set of non-delimited strings >> >> Although I'm sure these two will 'invalidate' many archived CIFs (IUCr >> archives), just as our software is able to recognize the current specs, >> it could equally use the same ability to 'remediate' any offending items. >> Granted this is not an ideal situation, but I don't think the current >> use of delimiters is ideal either (based on experience of handling CIFs that >> were edited manually - though fortunately this is not such common practice >> these days). So if these changes are necessary to realize the potential of >> DDLm, I have no major objections. >> >> 3.2 - allow UTF-8 >> >> Though this would probably require far more effort from CIF developers than >> handling the first two changes, in the longer term I'm not sure this should be >> ruled out. Afterall, support for such encoding is growing (dare I mention >> xml?), and the rendering issues are far less of a problem than a few years >> back (widespread font support). >> >> That said, I have to confess that support for 3.2 is partly driven by the fact >> that a large part of the development of software to support the IUCr's CIF >> publishing activities involves translation from UTF-8 to ASCII CIF codes; >> furthermore, we are actively looking at ways to include 'richer' content in >> CIFs. >> So for my part, I would at least like to see support for an ASCII-based >> representation of a wider character set. >> >> I have to stress that these are my views (as someone who writes CIF >> applications for the IUCr) - I've yet to speak with Brian et al. regarding an >> 'official' view on these matters. >> >> Anyway, hope this helps in your deliberations. >> >> Cheers >> >> Simon >> >> Simon P. Westrip >> >> >> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >> To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated >> dictionaries <ddlm-group@iucr.org> >> Sent: Friday, 9 October, 2009 1:45:01 >> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. >> >> Dear Nick, >> >> If you have persuaded the others to your view, then you will win on a >> straw vote. I hope you have not persuaded a majority, because I agree >> neither with your premises, nor your conclusions, but the only way to find >> out is to hear from the others. >> >> I still think the right way to resolve this is to put the items I have >> listed to a vote and then move on. >> >> Regards, >> Herbert >> >> P.S. From your comments about binary, it sounds as if you intend to >> "excommunicate" imgCIF from DDLm. I think that would be a mistake. imgCIF >> will benefit greatly from the use of methods, but at worst, I can always >> go back to the original name: imgNCIF, where the N stands for "not", and >> use methods without the blessing of it being officially a "CIF" >> dictionary. >> >> >> ===================================================== >> Herbert J. Bernstein, Professor of Computer Science >> Dowling College, Kramer Science Center, KSC 121 >> Idle Hour Blvd, Oakdale, NY, 11769 >> >> +1-631-244-3035 >> yaya@dowling.edu >> ===================================================== >> >> On Fri, 9 Oct 2009, Nick Spadaccini wrote: >> >>>> >>>> >>>> >>>> On 9/10/09 5:37 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com> >>>> wrote: >>>> >>>>>> Dear Colleagues, >>>>>> >>>>>> I sense a certain strong emotion in this. I don't think that is the >>>>>> way to resolve this. Nick has his views. I have mine. Neither of us >>>>>> has the final say. I suggest that these matters be put to a straw >>>>>> vote, tell the community the outcome, and then move on to more >>>>>> substantive issues. >>>> >>>> There isn't emotion in this Herb, but when I say something is not >>> negotiable >>>> it is a statement of fact. >>>> >>>> At least we agree on the item you have probably viewed as emotional, my >>>> statement of non-negotiability on Issue 2. I put it much stronger, 2.1 >>>> simply is not an option. However it is not a strictly limited set of >>>> characters. The only restriction I am suggesting are those 6 characters >>> that >>>> are token delimiters. >>>> >>>> The problem with your suggestions Herb is that is refers to deprecating and >>>> not enforcing when we are trying to specify the standard. Standards tend to >>>> be strict, though individual parsers can be liberal in what to do when >>> error >>>> states arise. That is fair enough, BUT the standard can't be liberal. >>>> >>>> As a standard I much prefer 2.3 with the added restrictions for "" and '' >>>> strings. With that in place, 1.1 doesn't make sense so clearly I prefer >>> 1.2. >>>> >>>> UTF-8 introduce strictly binary data in to the file. I don't think this is >>>> the direction to take. Not withstanding most of us wouldn't know how to >>>> encode in to UTF-8. So what are we going to do? We will probably identify >>>> the characters we want to encode in some ascii presentation, likely >>> unicode, >>>> and then use a library function/method to encode it. >>>> >>>> To write utf-8 (binary) into the cif file you will have to execute >>> something >>>> like >>>> >>>> outputToCIF("\u1234".encode('utf-8)) >>>> >>>> To me it makes more sense to >>>> >>>> outputToCIF("\u1234") >>>> >>>> And then do the encoding once you read the string in from the CIF. That way >>>> the CIF remains ascii readable. >>>> >>>> I think 1.2, 2.3 with the added restrictions on "" and '', and ascii-fied >>>> unicode in strings. >>>> >>>>>> Issue1: Removing the requirement for a trailing whitespace after >>>>>> quoted strings outside of bracketed constructs. >>>>>> Options: 1.1. Preserve the current convention as is >>>>>> 1.2. Terminate all quoted strings on the occurance of the >>>>>> trailing quoted delimiter without consideration of the next character >>>>>> >>>>>> Issue2: Restriction of the character set for non-delimited strings >>>>>> outside of bracketed constructs >>>>>> Options 2.1. Preserve the current convention as is >>>>>> 2.2. Modify the current convention to deprecate use of >>>>>> any characters other than a strictly limited set >>>>>> of characters, adding a warning oon reads and >>>>>> defaulting to add quote marks on write >>>>>> 2.3. Modify the current convention to forbid the use of >>>>>> any characters other than a strctly limited set >>>>>> of characters, making it an error to read a >>>> non-delimited >>>>>> string that does not comply even if the intention >>>>>> can be inferred from context >>>>>> >>>>>> Issue 3: Use of UTF-8 >>>>>> Options: 3.1. Do not use UTF-8 >>>>>> 3.2. Use UTF-8 >>>>>> >>>>>> My votes would be 1.1, 2.2, 3.2 >>>>>> >>>>>> Whatever the outcome of the vote, I will code at least one variant of a >>>>>> parser to comply, but it will take longer if the vote goes for 1.2 and >>>>>> 2.3. >>>>>> >>>>>> Regards, >>>>>> Herbert >>>>>> >>>>>> ===================================================== >>>>>> Herbert J. Bernstein, Professor of Computer Science >>>>>> Dowling College, Kramer Science Center, KSC 121 >>>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>>> >>>>>> +1-631-244-3035 >>>>>> yaya@dowling.edu >>>>>> ===================================================== >>>>>> >>>>>> On Fri, 9 Oct 2009, Nick Spadaccini wrote: >>>>>> >>>>>>>> Ok. Back on board. I am proposing some old and some new stuff here. >>>>> From the >>>>>>>> beginning, >>>>>>>> >>>>>>>> (1) restricting the character set of non-delimited strings is >>>>>>>> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive >>>>> data >>>>>>>> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr >>>>> should >>>>>>>> drop it now and stick with its current DDL. >>>>>>>> >>>>>>>> IUCr needs to make that decision now. >>>>>>>> >>>>>>>> I have built a new lexer for the current syntax specification and >>>>> checked >>>>>>>> for cases where >>>>>>>> >>>>>>>> (1) a double-quote-delimited string contains a double quote. >>>>>>>> (2) a single-quote-delimited string contains a single quote. >>>>>>>> (3) a non-delimited string contains any of " ' , : { } >>>>>>>> (4) a data name contains any of (3) >>>>>>>> >>>>>>>> The contents of (3) are sufficient I think) restriction to >>>>> non-delimited >>>>>>>> strings to enable us to move forward. >>>>>>>> >>>>>>>> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The >>>>>>>> results are >>>>>>>> >>>>>>>> (1) 0 of the 3.4M (M = million) data values failed the test. >>>>>>>> >>>>>>>> (2) 4 of the 1.3M data values failed the test. >>>>>>>> When I pointed these out to John he said these SHOULD have been in >>>>>>>> semi-colon delimited text because at the PDB they have been >>>>> systematically >>>>>>>> dealing with quotes within quotes to avoid parsing problems. >>>>>>>> >>>>>>>> HENCE not allowing a string delimiter character within the string >>>>> delimited >>>>>>>> by the same character poses very little or no problem in mmCIF. >>>>>>>> >>>>>>>> (3) 138,733 of the 2,009M data values failed the test (.007%) >>>>>>>> >>>>>>>> Again the magnitude of the problem has been exaggerated. The >>>>> restrictions >>>>>>>> will not affect many of the archived data items. All the failures were >>>>>>>> limited to 3-5 data names. These were those with embedded : which >>>>> includes >>>>>>>> the specification of a URL, and those with embedded , to which Herb has >>>>>>>> already alluded. John has stipulated that those restrictions we are >>>>>>>> suggesting can be quickly and efficiently implemented (I am here and >>>>> looked >>>>>>>> at their systems and the changes are a single change to dictionary >>>>> entry and >>>>>>>> all software handles the change immediately). I believe the PDB has a >>>>>>>> remediation process that will resolve all legacy issues (at least for >>>>> them). >>>>>>>> >>>>>>>> Conclusion: This restriction has minimal (.007%) impact on how things >>>>> have >>>>>>>> been done, and can be easily implemented for files from here on. >>>>>>>> >>>>>>>> (4) 0 data names contain these characters. >>>>>>>> >>>>>>>> I will not comment further on this point until I have done the same >>>>> analysis >>>>>>>> for the IUCr archive. I suspect the problem will be bigger for those >>>>> files >>>>>>>> because they represent a more lackadaisical period in CIFs evolution >>>>> where >>>>>>>> we suggested you could do whatever you want etc, and also there are >>>>> IUCr >>>>>>>> mark ups that likely cause problems. Once I get my hands on that >>>>> archive I >>>>>>>> will let people know. >>>>>>>> >>>>>>>> Now guess what? If we don't allow a ' within a '..' and a " within a >>>>> ".." >>>>>>>> and any "',:{} within a non-delimited string or a data name WE DON'T >>>>> NEED A >>>>>>>> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more >>>>>>>> importantly NORMALIZES the grammar. >>>>>>>> >>>>>>>> I don't accept the argument that the new parser is so much more >>>>> difficult >>>>>>>> that existing parsers. Currently you have (if you are inside a double >>>>> quote >>>>>>>> delimited string) >>>>>>>> >>>>>>>> if (char == \") { >>>>>>>> tmpchar=lookahead(1); >>>>>>>> if (tmpchar == " ") return END_OF_STRING; >>>>>>>> else continue; >>>>>>>> } >>>>>>>> >>>>>>>> In the new parser you will have >>>>>>>> >>>>>>>> if (char == \") return END_OF_STRING >>>>>>>> >>>>>>>> YOU WILL NOTE: >>>>>>>> >>>>>>>> I have note included the [] characters in the restriction. There is too >>>>> much >>>>>>>> legacy associated with their existence in data names in both small and > mm >>>>>>>> CIFs. >>>>>>>> >>>>>>>> I am going to suggest a single token to represent lists, lists of lists > and >>>>>>>> associative arrays, namely {...}. These are new, and don't present a >>>>>>>> problem. >>>>>>>> >>>>>>>> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually >>>>>>>> originally up to 6 bytes providing 31 bits of representation). It is a >>>>>>>> binary representation. The encoding algorithm is not brain busting, but >>>>>>>> neither is it trivial. Having a CIF file not editable by a bog standard >>>>>>>> editor will upset some people. I propose the introduction of a new >>>>> string >>>>>>>> type within the DDLm semantics that allows one to define it to be >>>>> Unicode. >>>>>>>> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX >>>>> characters) to >>>>>>>> represent the character. Equally we could go with the HTML approach of >>>>>>>> � (ie 1-6 HEX characters). >>>>>>>> >>>>>>>> I also strongly propose support fort the UNICODE string within """ >>>>> strings >>>>>>>> ONLY. Lets's start from a restrictive stance from the outset. >>>>>>>> >>>>>>>> I will be arriving at Dowling at about noon on Wednesday Herb. I'll >>>>> bring my >>>>>>>> boxing gloves, Frances can referee :) >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote: >>>>>>>> >>>>>>>>>> Dear All: >>>>>>>>>> >>>>>>>>>> As a result of the discussion with Herbert I can see two differing >>>>>>>>>> approaches to these CIF syntax changes: >>>>>>>>>> >>>>>>>>>> 1. Any changes to CIF syntax should be such that earlier syntax >>>>>>>>>> versions form a subset of the new syntax, i.e. files in the older >>>>>>>>>> syntax will also conform to the new syntax >>>>>>>>>> >>>>>>>>>> or >>>>>>>>>> >>>>>>>>>> 2. When making changes to the standard, the opportunity should be >>>>>>>>>> taken to simplify and streamline syntax as much as possible. >>>>>>>>>> >>>>>>>>>> Advantages of (1): a single CIF parser can be maintained for all >>>>>>>>>> syntax versions; a CIF writer is always conformant to the latest >>>>>>>>>> version and only needs changing if new syntax features are to be >>>>>> used; >>>>>>>>>> the existing CIF software ecosystem is minimally affected >>>>>>>>>> >>>>>>>>>> Advantages of (2): implementation of CIF readers/writers from scratch >>>>>>>>>> is easier; the standard is easier to define formally and more >>>>>>>>>> aesthetically pleasing; mistakes in previous versions can be fixed, >>>>>>>>>> warts do not accumulate >>>>>>>>>> >>>>>>>>>> I would like to suggest we act as follows: in essence, we deprecate >>>>>>>>>> rather than exclude. In detail: >>>>>>>>>> >>>>>>>>>> 1. For this edition of the standard (1.2) we follow Herbert's line, >>>>>>>>>> leaving everything currently defined untouched. We simply add triple >>>>>>>>>> quote delimited strings and bracket expressions. The content of >>>>>>>>>> non-delimited strings in bracket expressions will be as proposed by >>>>>>>>>> Nick. >>>>>>>>>> >>>>>>>>>> 2. In the documents associated with the new standard we strongly >>>>>>>>>> suggest that all non-delimited strings use the same character set as >>>>>>>>>> for non-delimited strings in bracket expressions (i.e. Nick's >>>>>> original >>>>>>>>>> proposal). We might point out that this simplifies code for writing >>>>>>>>>> CIFs, and perhaps (if all agree) we add that using the CIF1.1 >>>>>>>>>> non-delimited string character set is deprecated, darkly >>>>>> foreshadowing >>>>>>>>>> that a future version of the syntax standard will adopt this >>>>>> character >>>>>>>>>> set for all non-delimited strings. >>>>>>>>>> >>>>>>>>>> 3. We also deprecate including string delimiters inside strings, >>>>>>>>>> regardless of whitespace issues. >>>>>>>>>> >>>>>>>>>> 4. In all dictionaries we adopt the restricted character set for >>>>>>>>>> non-delimited strings and exclusion of string delimiters in strings. >>>>>>>>>> >>>>>>>>>> 5. We ask that CheckCIF emit a warning about use of deprecated >>>>>>>>>> characters in non-delimited strings >>>>>>>>>> >>>>>>>>>> 6. When (say in 10 years' time) a sufficiently large proportion of >>>>>>>>>> incoming CIFs conform to the new non-delimited string character set, >>>>>>>>>> we promulgate the 1.3 version of the standard. >>>>>>>>>> >>>>>>>> >>>>>>>> cheers >>>>>>>> >>>>>>>> Nick >>>>>>>> >>>>>>>> -------------------------------- >>>>>>>> Associate Professor N. Spadaccini, PhD >>>>>>>> School of Computer Science & Software Engineering >>>>>>>> >>>>>>>> The University of Western Australia t: +61 (0)8 6488 3452 >>>>>>>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>>>>>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>>>> <http://www.csse.uwa.edu.au/%7Enick> >>>>>>>> MBDP M002 >>>>>>>> >>>>>>>> CRICOS Provider Code: 00126G >>>>>>>> >>>>>>>> e: Nick.Spadaccini@uwa.edu.au >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ddlm-group mailing list >>>>>>>> ddlm-group@iucr.org >>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>>>> >>>>>> _______________________________________________ >>>>>> ddlm-group mailing list >>>>>> ddlm-group@iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>>> cheers >>>> >>>> Nick >>>> >>>> -------------------------------- >>>> Associate Professor N. Spadaccini, PhD >>>> School of Computer Science & Software Engineering >>>> >>>> The University of Western Australia t: +61 (0)8 6488 3452 >>>> 35 Stirling Highway f: +61 (0)8 6488 1089 >>>> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick >>> <http://www.csse.uwa.edu.au/%7Enick> >>>> MBDP M002 >>>> >>>> CRICOS Provider Code: 00126G >>>> >>>> e: Nick.Spadaccini@uwa.edu.au >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):