[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Sat, 10 Oct 2009 10:08:37 -0400 (EDT)
- In-Reply-To: <[email protected]>
- References: <C6F5BF24.1200E%[email protected]><[email protected]><[email protected]><[email protected]>
Dear Colleagues,
I find what James has said quite reasonable. I have my doubts on
whether 1.3 will ever manage to progress from deprecation to denial,
but it cannot hurt to give it a try, so I change my vote to
1.3, 2.2 and 3.2
Many thanks to James for a very clear presentation.
Regards,
Herbert
P.S. I am not frustrated. I think this to be a very useful discussion.
It has certainly helped me to clarify the issues in my own head.
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Sat, 10 Oct 2009, James Hester wrote:
> Following is my vote + opinion on the issues as presented by Herbert.
> I have summarized some of my previous arguments as well. Nick and
> Herb: I appreciate your frustration that we don't seem able to move
> past basic syntax, but remember that bracket structures are approved
> in any case and whatever we decide at the moment is simply fine-tuning
> by comparison. I think it is good that we are taking the time to
> think over the syntax issues as syntax is the crucial common element
> in all CIF programs - dictionaries and DDLs are far less widely
> processed in software.
>
> Issue1: Removing the requirement for a trailing whitespace after
> quoted strings outside of bracketed constructs.
> Options: 1.1. Preserve the current convention as is
> 1.2. Terminate all quoted strings on the occurance of the
> trailing quoted delimiter without consideration of the next character
>
> I would like to vote 1.2, however, a quick run over the IUCr CIF
> archive as of June 2008 gives 640 cases of embedded delimiters (out of
> 32,300 files i.e. 2%). Mostly these arise from accented characters in
> names and primes in chemical compound names. Therefore, for the same
> reasons as Issue 2, I vote 1.3 - deprecate first, then remove. And I
> would require that there was always whitespace between tokens,
> regardless of the status of embedded delimiters.
>
> Issue2: Restriction of the character set for non-delimited strings
> outside of bracketed constructs
> Options 2.1. Preserve the current convention as is
> 2.2. Modify the current convention to deprecate use of
> any characters other than a strictly limited set
> of characters, adding a warning oon reads and
> defaulting to add quote marks on write
> 2.3. Modify the current convention to forbid the use of
> any characters other than a strctly limited set
> of characters, making it an error to read a non-delimited
> string that does not comply even if the intention
> can be inferred from context
>
> Again, I go for 2.2, with the intention to move to 2.3 when writing
> software has caught up. I make a distinction between breaking CIF
> readers, and breaking CIF writers. The most important CIF readers are
> at a few locations: the IUCr journals, the PDB, the archives. As their
> software is written and deployed in house and in a programming sense
> the changes we are discussing are close to trivial, they can tolerate
> being forced to upgrade, especially if there is some potential
> benefit. There are some other readers of importance (CMPR, Cifedit
> come to mind).
>
> CIF writers, however, are mostly distributed via standard software
> packages far and wide in the various crystallographic labs. Even if
> all the software authors were prepared to immediately update their CIF
> output routines, the time it would take for those changes to propagate
> to most of the end-users is of the order of years. And let's not
> forget that the CIF writers are located where the data are produced
> and/or analysed and in that sense are key to the adoption of CIF.
>
> We can therefore split our changes into two categories: those that
> break writers, and those that break readers. Breaking readers is
> non-controversial, as it is easy to fix them (see above). Breaking
> writers is much more of an issue. Therefore I propose splitting our
> changes into those that break readers, adopting these immediately
> (UTF8, brackets), and into those that break writers, adopting a slower
> process of deprecation followed by adoption for these writer-related
> changes.
>
> Note that the period between deprecation and strict requirement could
> be quite short, if we actively encourage and work with CIF writing
> software authors (ie SHELX, PLATON, etc.) to update and distribute
> their revised programs.
>
> Perform a thought experiment of option 2.3: we rewrite the standard
> to make all current CIF writing software non-conforming. The IUCr
> journals, CCDC etc. continue to receive CIF1.1 files. They can't just
> reject them out of hand, as the labs that produced them have no way to
> fix the files (short of hand editing), so the IUCr etc. accept them.
> Where is the stimulus to upgrade? Everything works as before. So by
> deprecating and then enforcing, we at least provide a way for the labs
> to catch up and give the journals a reasonable basis to reject
> non-conforming CIFs once they know that the relevant software packages
> have been updated. Of course, if a CIF data file has #CIF1.1 in the
> header, I think it should always be accepted as such.
>
> Issue 3: Use of UTF-8
> Options: 3.1. Do not use UTF-8
> 3.2. Use UTF-8
>
> Definitely 3.2. As we are forcing CIF readers to update (by
> introducing bracket expressions) now is the time to introduce UTF8 as
> well. As I've written previously, on reading a UTF8 CIF, non-ASCII
> characters will at worst simply be represented as the wrong
> characters, and will not be interpreted as control characters, so the
> problem of reading a UTF8 CIF will typically be contained to a few
> isolated characters. As far as writing a UTF8 CIF goes, as UTF8 and
> Unicode become the standard for encoding non-ASCII characters (which I
> think they will), editors to work with Unicode will become more
> common, and certainly those people who regularly work with non-ASCII
> character sets will know how to edit a UTF8 text file (even if e.g. I
> personally don't know how right now). Note also that the current IUCr
> backslash encodings for accented characters and greek text can coexist
> with UTF8 versions of these characters.
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):

