Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   I find what James has said quite reasonable.  I have my doubts on 
whether 1.3 will ever manage to progress from deprecation to denial,
but it cannot hurt to give it a try, so I change my vote to

   1.3, 2.2 and 3.2

   Many thanks to James for a very clear presentation.

   Regards,
     Herbert

P.S.  I am not frustrated.  I think this to be a very useful discussion.
It has certainly helped me to clarify the issues in my own head.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 10 Oct 2009, James Hester wrote:

> Following is my vote + opinion on the issues as presented by Herbert.
> I have summarized some of my previous arguments as well.  Nick and
> Herb: I appreciate your frustration that we don't seem able to move
> past basic syntax, but remember that bracket structures are approved
> in any case and whatever we decide at the moment is simply fine-tuning
> by comparison.  I think it is good that we are taking the time to
> think over the syntax issues as syntax is the crucial common element
> in all CIF programs - dictionaries and DDLs are far less widely
> processed in software.
>
> Issue1:  Removing the requirement for a trailing whitespace after
> quoted strings outside of bracketed constructs.
>  Options:  1.1. Preserve the current convention as is
>            1.2. Terminate all quoted strings on the occurance of the
> trailing quoted delimiter without consideration of the next character
>
> I would like to vote 1.2, however, a quick run over the IUCr CIF
> archive as of June 2008 gives 640 cases of embedded delimiters (out of
> 32,300 files i.e. 2%).  Mostly these arise from accented characters in
> names and primes in chemical compound names.  Therefore, for the same
> reasons as Issue 2, I vote 1.3 - deprecate first, then remove.  And I
> would require that there was always whitespace between tokens,
> regardless of the status of embedded delimiters.
>
>  Issue2:  Restriction of the character set for non-delimited strings
> outside of bracketed constructs
>  Options  2.1.  Preserve the current convention as is
>           2.2.  Modify the current convention to deprecate use of
>                 any characters other than a strictly limited set
>                 of characters, adding a warning oon reads and
>                 defaulting to add quote marks on write
>           2.3.  Modify the current convention to forbid the use of
>                 any characters other than a strctly limited set
>                 of characters, making it an error to read a non-delimited
>                 string that does not comply even if the intention
>                 can be inferred from context
>
> Again, I go for 2.2, with the intention to move to 2.3 when writing
> software has caught up.  I make a distinction between breaking CIF
> readers, and breaking CIF writers.  The most important CIF readers are
> at a few locations: the IUCr journals, the PDB, the archives. As their
> software is written and deployed in house and in a programming sense
> the changes we are discussing are close to trivial, they can tolerate
> being forced to upgrade, especially if there is some potential
> benefit.  There are some other readers of importance (CMPR, Cifedit
> come to mind).
>
> CIF writers, however, are mostly distributed via standard software
> packages far and wide in the various crystallographic labs.  Even if
> all the software authors were prepared to immediately update their CIF
> output routines, the time it would take for those changes to propagate
> to most of the end-users is of the order of years.  And let's not
> forget that the CIF writers are located where the data are produced
> and/or analysed and in that sense are key to the adoption of CIF.
>
> We can therefore split our changes into two categories: those that
> break writers, and those that break readers.  Breaking readers is
> non-controversial, as it is easy to fix them (see above).  Breaking
> writers is much more of an issue.  Therefore I propose splitting our
> changes into those that break readers, adopting these immediately
> (UTF8, brackets), and into those that break writers, adopting a slower
> process of deprecation followed by adoption for these writer-related
> changes.
>
> Note that the period between deprecation and strict requirement could
> be quite short, if we actively encourage and work with CIF writing
> software authors (ie SHELX, PLATON, etc.) to update and distribute
> their revised programs.
>
> Perform a thought experiment of option 2.3: we rewrite the standard
> to make all current CIF writing software non-conforming.  The IUCr
> journals, CCDC etc. continue to receive CIF1.1 files. They can't just
> reject them out of hand, as the labs that produced them have no way to
> fix the files (short of hand editing), so the IUCr etc. accept them.
> Where is the stimulus to upgrade?  Everything works as before.  So by
> deprecating and then enforcing, we at least provide a way for the labs
> to catch up and give the journals a reasonable basis to reject
> non-conforming CIFs once they know that the relevant software packages
> have been updated.  Of course, if a CIF data file has #CIF1.1 in the
> header, I think it should always be accepted as such.
>
>   Issue 3:  Use of UTF-8
>   Options:  3.1.  Do not use UTF-8
>             3.2.  Use UTF-8
>
> Definitely 3.2.  As we are forcing CIF readers to update (by
> introducing bracket expressions) now is the time to introduce UTF8 as
> well. As I've written previously, on reading a UTF8 CIF, non-ASCII
> characters will at worst simply be represented as the wrong
> characters, and will not be interpreted as control characters, so the
> problem of reading a UTF8 CIF will typically be contained to a few
> isolated characters.  As far as writing a UTF8 CIF goes, as UTF8 and
> Unicode become the standard for encoding non-ASCII characters (which I
> think they will), editors to work with Unicode will become more
> common, and certainly those people who regularly work with non-ASCII
> character sets will know how to edit a UTF8 text file (even if e.g. I
> personally don't know how right now).  Note also that the current IUCr
> backslash encodings for accented characters and greek text can coexist
> with UTF8 versions of these characters.
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.