[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: James Hester <jamesrhester@gmail.com>
- Date: Sat, 10 Oct 2009 15:13:36 +0300
- In-Reply-To: <279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com>
Following is my vote + opinion on the issues as presented by Herbert. I have summarized some of my previous arguments as well. Nick and Herb: I appreciate your frustration that we don't seem able to move past basic syntax, but remember that bracket structures are approved in any case and whatever we decide at the moment is simply fine-tuning by comparison. I think it is good that we are taking the time to think over the syntax issues as syntax is the crucial common element in all CIF programs - dictionaries and DDLs are far less widely processed in software. Issue1: Removing the requirement for a trailing whitespace after quoted strings outside of bracketed constructs. Options: 1.1. Preserve the current convention as is 1.2. Terminate all quoted strings on the occurance of the trailing quoted delimiter without consideration of the next character I would like to vote 1.2, however, a quick run over the IUCr CIF archive as of June 2008 gives 640 cases of embedded delimiters (out of 32,300 files i.e. 2%). Mostly these arise from accented characters in names and primes in chemical compound names. Therefore, for the same reasons as Issue 2, I vote 1.3 - deprecate first, then remove. And I would require that there was always whitespace between tokens, regardless of the status of embedded delimiters. Issue2: Restriction of the character set for non-delimited strings outside of bracketed constructs Options 2.1. Preserve the current convention as is 2.2. Modify the current convention to deprecate use of any characters other than a strictly limited set of characters, adding a warning oon reads and defaulting to add quote marks on write 2.3. Modify the current convention to forbid the use of any characters other than a strctly limited set of characters, making it an error to read a non-delimited string that does not comply even if the intention can be inferred from context Again, I go for 2.2, with the intention to move to 2.3 when writing software has caught up. I make a distinction between breaking CIF readers, and breaking CIF writers. The most important CIF readers are at a few locations: the IUCr journals, the PDB, the archives. As their software is written and deployed in house and in a programming sense the changes we are discussing are close to trivial, they can tolerate being forced to upgrade, especially if there is some potential benefit. There are some other readers of importance (CMPR, Cifedit come to mind). CIF writers, however, are mostly distributed via standard software packages far and wide in the various crystallographic labs. Even if all the software authors were prepared to immediately update their CIF output routines, the time it would take for those changes to propagate to most of the end-users is of the order of years. And let's not forget that the CIF writers are located where the data are produced and/or analysed and in that sense are key to the adoption of CIF. We can therefore split our changes into two categories: those that break writers, and those that break readers. Breaking readers is non-controversial, as it is easy to fix them (see above). Breaking writers is much more of an issue. Therefore I propose splitting our changes into those that break readers, adopting these immediately (UTF8, brackets), and into those that break writers, adopting a slower process of deprecation followed by adoption for these writer-related changes. Note that the period between deprecation and strict requirement could be quite short, if we actively encourage and work with CIF writing software authors (ie SHELX, PLATON, etc.) to update and distribute their revised programs. Perform a thought experiment of option 2.3: we rewrite the standard to make all current CIF writing software non-conforming. The IUCr journals, CCDC etc. continue to receive CIF1.1 files. They can't just reject them out of hand, as the labs that produced them have no way to fix the files (short of hand editing), so the IUCr etc. accept them. Where is the stimulus to upgrade? Everything works as before. So by deprecating and then enforcing, we at least provide a way for the labs to catch up and give the journals a reasonable basis to reject non-conforming CIFs once they know that the relevant software packages have been updated. Of course, if a CIF data file has #CIF1.1 in the header, I think it should always be accepted as such. Issue 3: Use of UTF-8 Options: 3.1. Do not use UTF-8 3.2. Use UTF-8 Definitely 3.2. As we are forcing CIF readers to update (by introducing bracket expressions) now is the time to introduce UTF8 as well. As I've written previously, on reading a UTF8 CIF, non-ASCII characters will at worst simply be represented as the wrong characters, and will not be interpreted as control characters, so the problem of reading a UTF8 CIF will typically be contained to a few isolated characters. As far as writing a UTF8 CIF goes, as UTF8 and Unicode become the standard for encoding non-ASCII characters (which I think they will), editors to work with Unicode will become more common, and certainly those people who regularly work with non-ASCII character sets will know how to edit a UTF8 text file (even if e.g. I personally don't know how right now). Note also that the current IUCr backslash encodings for accented characters and greek text can coexist with UTF8 versions of these characters. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):