[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   I find what James has said quite reasonable.  I have my doubts on 
whether 1.3 will ever manage to progress from deprecation to denial,
but it cannot hurt to give it a try, so I change my vote to

   1.3, 2.2 and 3.2

   Many thanks to James for a very clear presentation.

   Regards,
     Herbert

P.S.  I am not frustrated.  I think this to be a very useful discussion.
It has certainly helped me to clarify the issues in my own head.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 10 Oct 2009, James Hester wrote:

> Following is my vote + opinion on the issues as presented by Herbert.
> I have summarized some of my previous arguments as well.  Nick and
> Herb: I appreciate your frustration that we don't seem able to move
> past basic syntax, but remember that bracket structures are approved
> in any case and whatever we decide at the moment is simply fine-tuning
> by comparison.  I think it is good that we are taking the time to
> think over the syntax issues as syntax is the crucial common element
> in all CIF programs - dictionaries and DDLs are far less widely
> processed in software.
>
> Issue1:  Removing the requirement for a trailing whitespace after
> quoted strings outside of bracketed constructs.
>  Options:  1.1. Preserve the current convention as is
>            1.2. Terminate all quoted strings on the occurance of the
> trailing quoted delimiter without consideration of the next character
>
> I would like to vote 1.2, however, a quick run over the IUCr CIF
> archive as of June 2008 gives 640 cases of embedded delimiters (out of
> 32,300 files i.e. 2%).  Mostly these arise from accented characters in
> names and primes in chemical compound names.  Therefore, for the same
> reasons as Issue 2, I vote 1.3 - deprecate first, then remove.  And I
> would require that there was always whitespace between tokens,
> regardless of the status of embedded delimiters.
>
>  Issue2:  Restriction of the character set for non-delimited strings
> outside of bracketed constructs
>  Options  2.1.  Preserve the current convention as is
>           2.2.  Modify the current convention to deprecate use of
>                 any characters other than a strictly limited set
>                 of characters, adding a warning oon reads and
>                 defaulting to add quote marks on write
>           2.3.  Modify the current convention to forbid the use of
>                 any characters other than a strctly limited set
>                 of characters, making it an error to read a non-delimited
>                 string that does not comply even if the intention
>                 can be inferred from context
>
> Again, I go for 2.2, with the intention to move to 2.3 when writing
> software has caught up.  I make a distinction between breaking CIF
> readers, and breaking CIF writers.  The most important CIF readers are
> at a few locations: the IUCr journals, the PDB, the archives. As their
> software is written and deployed in house and in a programming sense
> the changes we are discussing are close to trivial, they can tolerate
> being forced to upgrade, especially if there is some potential
> benefit.  There are some other readers of importance (CMPR, Cifedit
> come to mind).
>
> CIF writers, however, are mostly distributed via standard software
> packages far and wide in the various crystallographic labs.  Even if
> all the software authors were prepared to immediately update their CIF
> output routines, the time it would take for those changes to propagate
> to most of the end-users is of the order of years.  And let's not
> forget that the CIF writers are located where the data are produced
> and/or analysed and in that sense are key to the adoption of CIF.
>
> We can therefore split our changes into two categories: those that
> break writers, and those that break readers.  Breaking readers is
> non-controversial, as it is easy to fix them (see above).  Breaking
> writers is much more of an issue.  Therefore I propose splitting our
> changes into those that break readers, adopting these immediately
> (UTF8, brackets), and into those that break writers, adopting a slower
> process of deprecation followed by adoption for these writer-related
> changes.
>
> Note that the period between deprecation and strict requirement could
> be quite short, if we actively encourage and work with CIF writing
> software authors (ie SHELX, PLATON, etc.) to update and distribute
> their revised programs.
>
> Perform a thought experiment of option 2.3: we rewrite the standard
> to make all current CIF writing software non-conforming.  The IUCr
> journals, CCDC etc. continue to receive CIF1.1 files. They can't just
> reject them out of hand, as the labs that produced them have no way to
> fix the files (short of hand editing), so the IUCr etc. accept them.
> Where is the stimulus to upgrade?  Everything works as before.  So by
> deprecating and then enforcing, we at least provide a way for the labs
> to catch up and give the journals a reasonable basis to reject
> non-conforming CIFs once they know that the relevant software packages
> have been updated.  Of course, if a CIF data file has #CIF1.1 in the
> header, I think it should always be accepted as such.
>
>   Issue 3:  Use of UTF-8
>   Options:  3.1.  Do not use UTF-8
>             3.2.  Use UTF-8
>
> Definitely 3.2.  As we are forcing CIF readers to update (by
> introducing bracket expressions) now is the time to introduce UTF8 as
> well. As I've written previously, on reading a UTF8 CIF, non-ASCII
> characters will at worst simply be represented as the wrong
> characters, and will not be interpreted as control characters, so the
> problem of reading a UTF8 CIF will typically be contained to a few
> isolated characters.  As far as writing a UTF8 CIF goes, as UTF8 and
> Unicode become the standard for encoding non-ASCII characters (which I
> think they will), editors to work with Unicode will become more
> common, and certainly those people who regularly work with non-ASCII
> character sets will know how to edit a UTF8 text file (even if e.g. I
> personally don't know how right now).  Note also that the current IUCr
> backslash encodings for accented characters and greek text can coexist
> with UTF8 versions of these characters.
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]