[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: James Hester <jamesrhester@gmail.com>
Date: Sat, 10 Oct 2009 15:13:36 +0300
In-Reply-To: <279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com>
References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com><279aad2a0910100249o2c09897anb767ab28b06cbdcf@mail.gmail.com>

Following is my vote + opinion on the issues as presented by Herbert.
I have summarized some of my previous arguments as well.  Nick and
Herb: I appreciate your frustration that we don't seem able to move
past basic syntax, but remember that bracket structures are approved
in any case and whatever we decide at the moment is simply fine-tuning
by comparison.  I think it is good that we are taking the time to
think over the syntax issues as syntax is the crucial common element
in all CIF programs - dictionaries and DDLs are far less widely
processed in software.

Issue1:  Removing the requirement for a trailing whitespace after
quoted strings outside of bracketed constructs.
  Options:  1.1. Preserve the current convention as is
            1.2. Terminate all quoted strings on the occurance of the
trailing quoted delimiter without consideration of the next character

I would like to vote 1.2, however, a quick run over the IUCr CIF
archive as of June 2008 gives 640 cases of embedded delimiters (out of
32,300 files i.e. 2%).  Mostly these arise from accented characters in
names and primes in chemical compound names.  Therefore, for the same
reasons as Issue 2, I vote 1.3 - deprecate first, then remove.  And I
would require that there was always whitespace between tokens,
regardless of the status of embedded delimiters.

  Issue2:  Restriction of the character set for non-delimited strings
outside of bracketed constructs
  Options  2.1.  Preserve the current convention as is
           2.2.  Modify the current convention to deprecate use of
                 any characters other than a strictly limited set
                 of characters, adding a warning oon reads and
                 defaulting to add quote marks on write
           2.3.  Modify the current convention to forbid the use of
                 any characters other than a strctly limited set
                 of characters, making it an error to read a non-delimited
                 string that does not comply even if the intention
                 can be inferred from context

Again, I go for 2.2, with the intention to move to 2.3 when writing
software has caught up.  I make a distinction between breaking CIF
readers, and breaking CIF writers.  The most important CIF readers are
at a few locations: the IUCr journals, the PDB, the archives. As their
software is written and deployed in house and in a programming sense
the changes we are discussing are close to trivial, they can tolerate
being forced to upgrade, especially if there is some potential
benefit.  There are some other readers of importance (CMPR, Cifedit
come to mind).

CIF writers, however, are mostly distributed via standard software
packages far and wide in the various crystallographic labs.  Even if
all the software authors were prepared to immediately update their CIF
output routines, the time it would take for those changes to propagate
to most of the end-users is of the order of years.  And let's not
forget that the CIF writers are located where the data are produced
and/or analysed and in that sense are key to the adoption of CIF.

We can therefore split our changes into two categories: those that
break writers, and those that break readers.  Breaking readers is
non-controversial, as it is easy to fix them (see above).  Breaking
writers is much more of an issue.  Therefore I propose splitting our
changes into those that break readers, adopting these immediately
(UTF8, brackets), and into those that break writers, adopting a slower
process of deprecation followed by adoption for these writer-related
changes.

Note that the period between deprecation and strict requirement could
be quite short, if we actively encourage and work with CIF writing
software authors (ie SHELX, PLATON, etc.) to update and distribute
their revised programs.

Perform a thought experiment of option 2.3: we rewrite the standard
to make all current CIF writing software non-conforming.  The IUCr
journals, CCDC etc. continue to receive CIF1.1 files. They can't just
reject them out of hand, as the labs that produced them have no way to
fix the files (short of hand editing), so the IUCr etc. accept them.
Where is the stimulus to upgrade?  Everything works as before.  So by
deprecating and then enforcing, we at least provide a way for the labs
to catch up and give the journals a reasonable basis to reject
non-conforming CIFs once they know that the relevant software packages
have been updated.  Of course, if a CIF data file has #CIF1.1 in the
header, I think it should always be accepted as such.

   Issue 3:  Use of UTF-8
   Options:  3.1.  Do not use UTF-8
             3.2.  Use UTF-8

Definitely 3.2.  As we are forcing CIF readers to update (by
introducing bracket expressions) now is the time to introduce UTF8 as
well. As I've written previously, on reading a UTF8 CIF, non-ASCII
characters will at worst simply be represented as the wrong
characters, and will not be interpreted as control characters, so the
problem of reading a UTF8 CIF will typically be contained to a few
isolated characters.  As far as writing a UTF8 CIF goes, as UTF8 and
Unicode become the standard for encoding non-ASCII characters (which I
think they will), editors to work with Unicode will become more
common, and certainly those people who regularly work with non-ASCII
character sets will know how to edit a UTF8 text file (even if e.g. I
personally don't know how right now).  Note also that the current IUCr
backslash encodings for accented characters and greek text can coexist
with UTF8 versions of these characters.


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Brian McMahon)

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.