Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
- To: Group finalising DDLm and associated dictionaries <email@example.com>
- Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
- From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
- Date: Fri, 28 Apr 2017 14:34:06 +0000
- Accept-Language: en-US
- authentication-results: iucr.org; dkim=none (message not signed)header.d=none;iucr.org; dmarc=none action=none header.from=STJUDE.ORG;
- DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=cGx+t1TPqO90cFifcwOQyZoxyUTqE45yQHoFWDyJUU4=;b=jIjXJjxsMsLQfJpp6CuUAOpuWf78AK3ugE1UqshDUta/LAL28itawq7lcf/pW64ptqwFHmF71eeBxpIidPz8r064EF8ErC2zmH6BHI2ddXRDI8youFQO91JuZfNZe3dsFsU0u6lumZrPhfGmK7jrWG3uO1gUgqBwN7W27dv9UnA=
- In-Reply-To: <CAM+dB2eZP-Bx_2uuZe3H+yrSWEtcynyg_4CqzSHo+7z3FNBBog@mail.gmail.com>
- References: <firstname.lastname@example.org><email@example.com><CAM+dB2eZP-Bx_2uuZe3H+yrSWEtcynyg_4CqzSHo+7z3FNBBog@mail.gmail.com>
- spamdiagnosticmetadata: NSPM
- spamdiagnosticoutput: 1:99
Dear DDLm group,
Inasmuch as ITvG positions the markup codes as an aspect of data _semantics_ and ties their usage to item definitions, it seems to follow that the version of CIF syntax (if any) with which the data are conveyed should not be factor in whether or how the codes are interpreted. Certainly it is preferable to use Unicode characters instead of escape codes where the medium permits, but item values must be interpreted according to their definitions, regardless of how the values were transported, and that includes interpreting the markup codes where the definition calls for it. I think this leads to the same approach James suggests.
I think it would be worthwhile, however, to clarify the details of the correspondence between the escape codes and Unicode characters. Some, especially among the codes for Greek characters, are straightforward, but others not.
For example, I have always supposed that \f corresponds to U+03C6 ("GREEK SMALL LETTER PHI"), but is it incorrect to translate it instead to its compatibility equivalent, U+03D5 ("GREEK PHI SYMBOL")? Also, when writing converters in the past, I have converted the codes for diacritics to Unicode combining characters, and moved them after their base character, but what about series of *multiple* diacritics? (I have converted these to multiple combining characters associated with the base character.)
Of course, there are some codes that have no Unicode equivalent; these include at least the codes for sub- and superscripting, but I’ve also never found Unicode characters that map, semantically, to any of ---, \\db, \\tb, or \\ddb. There are also some that could map to any of several different characters, such as --, \\langle, and \\rangle. The exact characters used don’t much matter for typesetting, but they do matter if the codes are interpreted for other purposes.
And some of the codes for specific symbols suggest the possibility of generalization. In particular, the provision for \%a and \%A suggests that \%, when not followed by a space, could be interpreted as one of the general codes for diacritical marks, but that’s not accommodated by a strict reading of ITvG. Is that what’s really meant?
As for representing the '\' character itself, I have always supposed that where that character appears but does not form the first character of a valid markup code, it should be interpreted literally. In particular, it then follows that it is always interpreted literally when followed by a space or the end of the value. If we wanted to add a specific code for that, however, then the natural one would be "\\", to be interpreted according to paragraph (35) of chapter 188.8.131.52. I’ve always interpreted the last sentence of that paragraph to require *all* of the \\X codes presented therein (not just those specifically identified in the sentence) to be followed by a space or end-of-input, with the space being interpreted as part of the code. (Otherwise the interpretation of \\simeq would be ambiguous.) That interpretation affords assigning "\\ " meaning as a double-slash code with empty label, representing a single '\'.
From: ddlm-group [mailto:firstname.lastname@example.org]
On Behalf Of James Hester
Hi Simon and others,
I think I prefer this approach of clarifying markup conventions for CIF2, and so we drop the more CIF2->CIF1 translation approach I had initially proposed. Natural translations to ASCII strings will suggest themselves in the process. This is in the spirit of the CIF2 syntax paper appendix as well.
As a simple solution, I suggest that the markup conventions described in Vol G 184.108.40.206 remain available under the same conditions as described in 220.127.116.11.13. There are then two ways to represent the Unicode
code points corresponding to the characters listed in 18.104.22.168.14-16. Any other approach is likely to be fraught, as "ASCII string" is a subset of "UTF-8 string" and switching on and off markup depending on the presence of non-ASCII characters is fragile behaviour.
If we agree, I suggest that it is added to our CIF2 FAQ and eventually finds its way into the equivalent of 22.214.171.124 in the new Vol G.
all the best,
On 26 April 2017 at 23:05, SIMON WESTRIP <email@example.com> wrote:
T +61 (02) 9717 9907
Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________ ddlm-group mailing list firstname.lastname@example.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Prev by Date: Re: [ddlm-group] Managing deprecation in DDLm
- Next by Date: Re: [ddlm-group] Treatment of CIF2 unicode characterswith CIF1 equivalents
- Prev by thread: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents
- Next by thread: Re: [ddlm-group] Treatment of CIF2 unicode characterswith CIF1 equivalents