[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sat, 10 Oct 2009 10:08:37 -0400 (EDT)
- In-Reply-To: <279aad2a0910100513u1e9ef18dua5f984cc20ac9a9b@mail.gmail.com>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com><279aad2a0910100513u1e9ef18dua5f984cc20ac9a9b@mail.gmail.com>
Dear Colleagues, I find what James has said quite reasonable. I have my doubts on whether 1.3 will ever manage to progress from deprecation to denial, but it cannot hurt to give it a try, so I change my vote to 1.3, 2.2 and 3.2 Many thanks to James for a very clear presentation. Regards, Herbert P.S. I am not frustrated. I think this to be a very useful discussion. It has certainly helped me to clarify the issues in my own head. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sat, 10 Oct 2009, James Hester wrote: > Following is my vote + opinion on the issues as presented by Herbert. > I have summarized some of my previous arguments as well. Nick and > Herb: I appreciate your frustration that we don't seem able to move > past basic syntax, but remember that bracket structures are approved > in any case and whatever we decide at the moment is simply fine-tuning > by comparison. I think it is good that we are taking the time to > think over the syntax issues as syntax is the crucial common element > in all CIF programs - dictionaries and DDLs are far less widely > processed in software. > > Issue1: Removing the requirement for a trailing whitespace after > quoted strings outside of bracketed constructs. > Options: 1.1. Preserve the current convention as is > 1.2. Terminate all quoted strings on the occurance of the > trailing quoted delimiter without consideration of the next character > > I would like to vote 1.2, however, a quick run over the IUCr CIF > archive as of June 2008 gives 640 cases of embedded delimiters (out of > 32,300 files i.e. 2%). Mostly these arise from accented characters in > names and primes in chemical compound names. Therefore, for the same > reasons as Issue 2, I vote 1.3 - deprecate first, then remove. And I > would require that there was always whitespace between tokens, > regardless of the status of embedded delimiters. > > Issue2: Restriction of the character set for non-delimited strings > outside of bracketed constructs > Options 2.1. Preserve the current convention as is > 2.2. Modify the current convention to deprecate use of > any characters other than a strictly limited set > of characters, adding a warning oon reads and > defaulting to add quote marks on write > 2.3. Modify the current convention to forbid the use of > any characters other than a strctly limited set > of characters, making it an error to read a non-delimited > string that does not comply even if the intention > can be inferred from context > > Again, I go for 2.2, with the intention to move to 2.3 when writing > software has caught up. I make a distinction between breaking CIF > readers, and breaking CIF writers. The most important CIF readers are > at a few locations: the IUCr journals, the PDB, the archives. As their > software is written and deployed in house and in a programming sense > the changes we are discussing are close to trivial, they can tolerate > being forced to upgrade, especially if there is some potential > benefit. There are some other readers of importance (CMPR, Cifedit > come to mind). > > CIF writers, however, are mostly distributed via standard software > packages far and wide in the various crystallographic labs. Even if > all the software authors were prepared to immediately update their CIF > output routines, the time it would take for those changes to propagate > to most of the end-users is of the order of years. And let's not > forget that the CIF writers are located where the data are produced > and/or analysed and in that sense are key to the adoption of CIF. > > We can therefore split our changes into two categories: those that > break writers, and those that break readers. Breaking readers is > non-controversial, as it is easy to fix them (see above). Breaking > writers is much more of an issue. Therefore I propose splitting our > changes into those that break readers, adopting these immediately > (UTF8, brackets), and into those that break writers, adopting a slower > process of deprecation followed by adoption for these writer-related > changes. > > Note that the period between deprecation and strict requirement could > be quite short, if we actively encourage and work with CIF writing > software authors (ie SHELX, PLATON, etc.) to update and distribute > their revised programs. > > Perform a thought experiment of option 2.3: we rewrite the standard > to make all current CIF writing software non-conforming. The IUCr > journals, CCDC etc. continue to receive CIF1.1 files. They can't just > reject them out of hand, as the labs that produced them have no way to > fix the files (short of hand editing), so the IUCr etc. accept them. > Where is the stimulus to upgrade? Everything works as before. So by > deprecating and then enforcing, we at least provide a way for the labs > to catch up and give the journals a reasonable basis to reject > non-conforming CIFs once they know that the relevant software packages > have been updated. Of course, if a CIF data file has #CIF1.1 in the > header, I think it should always be accepted as such. > > Issue 3: Use of UTF-8 > Options: 3.1. Do not use UTF-8 > 3.2. Use UTF-8 > > Definitely 3.2. As we are forcing CIF readers to update (by > introducing bracket expressions) now is the time to introduce UTF8 as > well. As I've written previously, on reading a UTF8 CIF, non-ASCII > characters will at worst simply be represented as the wrong > characters, and will not be interpreted as control characters, so the > problem of reading a UTF8 CIF will typically be contained to a few > isolated characters. As far as writing a UTF8 CIF goes, as UTF8 and > Unicode become the standard for encoding non-ASCII characters (which I > think they will), editors to work with Unicode will become more > common, and certainly those people who regularly work with non-ASCII > character sets will know how to edit a UTF8 text file (even if e.g. I > personally don't know how right now). Note also that the current IUCr > backslash encodings for accented characters and greek text can coexist > with UTF8 versions of these characters. > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):