[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- From: James Hester <jamesrhester@gmail.com>
- Date: Sat, 10 Oct 2009 12:49:44 +0300
- In-Reply-To: <645410.77656.qm@web87015.mail.ird.yahoo.com>
- References: <C6F5BF24.1200E%nick@csse.uwa.edu.au><645410.77656.qm@web87015.mail.ird.yahoo.com>
Dear All, This was formulated before the latest round of messages, but I think what I have said still holds: Nick writes: (1) restricting the character set of non-delimited strings is NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should drop it now and stick with its current DDL. Nick, surely it is negotiable insofar as non-delimited strings inside bracket expressions could be treated differently to those outside bracket expressions? I realise that having different productions would be inelegant, but we need to have some regard for the installed base of CIF writing software (obviously reading software is not affected). It is good that the PDB are happy to update their software, but what about all the single-crystal software that is actively used at the moment in all those labs? Don't those software authors require some warning in the form of deprecation first, and some time to distribute updated versions? Nick again: Now guess what? If we don't allow a ' within a '..' and a " within a ".." and any "',:{} within a non-delimited string or a data name WE DON'T NEED A SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more importantly NORMALIZES the grammar. I agree with removing string delimiters from inside delimited strings. I have already stated that I prefer to keep enforced whitespace separation for human-readability purposes. I'll admit that I don't understand the use of 'normalizing' in the above sentence, and as my ability to find out anything on the internet is very limited at the moment, perhaps Nick could elaborate on what is meant by this and why it is desirable in the CIF context. Apropos UTF8: it is clear (to me at least) that, with time, UTF8 will become the new ASCII (ie universally understood encoding), so the question for CIF is not if to do this, but when and how. A possible downside for CIF is that, if we jump to UTF8 too early, we sacrifice the ability to view any CIF perfectly in any text editor on any OS. But this is ultimately not a big deal, as the 'mystery' characters will not be interpreted as dangerous control characters (all will be >127) and the rest of the file will remain understandable in any case. I don't think the example given by Nick of someone wanting to insert a unicode character in vi or emacs is not such a big deal; presumably someone who needs to use such characters from day to day will know how to deal with unicode. More importantly, introducing non-ASCII characters into strings immediately breaks most (or all) current CIF readers, so the right time to do this is when we are breaking CIF readers anyway. That would be now, as we are introducing bracket structures which break CIF1.1 readers. I don't think there is a need to restrict UTF8 to special string types (e.g. triple quote delimited strings), and I don't see the long-term advantage of using an intermediate representation (i.e. \uA054 etc.). While an intermediate representation may keep us from breaking CIF readers now, it renders the string incomprehensible even in a unicode-aware editor, and when UTF8 has taken over the world, will appear backward. Nick is concerned about the task of encoding/decoding unicode - but if we use pure UTF8 with no intermediate encoding, this becomes a job for the producer/consumer of the string, and the CIF layer simply passes these strings back and forth. I believe the journals should comment on the usefulness of UTF8 to them, as most of the non ASCII characters will be the various greek and mathematical symbols, as well as various diacritics in author's names. Note that the current adhoc CIF control characters can be interleaved with UTF8 encoded text without confusion, as the former are a pure ASCII representation. There is also a certain symmetry in introducing UTF8 at the same time as we disallow string delimiters internally, as most of the single quotes that appear internally in single quote delimited strings are in order to represent accented characters. We give with one hand (these characters can be included in UTF8 encoding instead) and take with the other (no more single quotes in single quote delimited strings). I will give my vote in Herb's straw poll in a separate email. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)
- Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):