[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
From: "Bollinger, John C" <[email protected]>
Date: Fri, 28 Apr 2017 14:34:06 +0000
Accept-Language: en-US
authentication-results: iucr.org; dkim=none (message not signed)header.d=none;iucr.org; dmarc=none action=none header.from=STJUDE.ORG;
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=SJCRH.onmicrosoft.com; s=selector1-stjude-org;h=From:Date:Subject:Message-ID:Content-Type:MIME-Version;bh=cGx+t1TPqO90cFifcwOQyZoxyUTqE45yQHoFWDyJUU4=;b=jIjXJjxsMsLQfJpp6CuUAOpuWf78AK3ugE1UqshDUta/LAL28itawq7lcf/pW64ptqwFHmF71eeBxpIidPz8r064EF8ErC2zmH6BHI2ddXRDI8youFQO91JuZfNZe3dsFsU0u6lumZrPhfGmK7jrWG3uO1gUgqBwN7W27dv9UnA=
In-Reply-To: <CAM+dB2eZP-Bx_2uuZe3H+yrSWEtcynyg_4CqzSHo+7z3FNBBog@mail.gmail.com>
References: <[email protected]><[email protected]><CAM+dB2eZP-Bx_2uuZe3H+yrSWEtcynyg_4CqzSHo+7z3FNBBog@mail.gmail.com>
spamdiagnosticmetadata: NSPM
spamdiagnosticoutput: 1:99

Dear DDLm group,

Inasmuch as ITvG positions the markup codes as an aspect of data _semantics_ and ties their usage to item definitions, it seems to follow that the version of CIF syntax (if any) with which the data are conveyed should not be factor in whether or how the codes are interpreted. Certainly it is preferable to use Unicode characters instead of escape codes where the medium permits, but item values must be interpreted according to their definitions, regardless of how the values were transported, and that includes interpreting the markup codes where the definition calls for it. I think this leads to the same approach James suggests.

I think it would be worthwhile, however, to clarify the details of the correspondence between the escape codes and Unicode characters. Some, especially among the codes for Greek characters, are straightforward, but others not.

For example, I have always supposed that \f corresponds to U+03C6 ("GREEK SMALL LETTER PHI"), but is it incorrect to translate it instead to its compatibility equivalent, U+03D5 ("GREEK PHI SYMBOL")? Also, when writing converters in the past, I have converted the codes for diacritics to Unicode combining characters, and moved them after their base character, but what about series of *multiple* diacritics? (I have converted these to multiple combining characters associated with the base character.)

Of course, there are some codes that have no Unicode equivalent; these include at least the codes for sub- and superscripting, but I’ve also never found Unicode characters that map, semantically, to any of ---, \\db, \\tb, or \\ddb. There are also some that could map to any of several different characters, such as --, \\langle, and \\rangle. The exact characters used don’t much matter for typesetting, but they do matter if the codes are interpreted for other purposes.

And some of the codes for specific symbols suggest the possibility of generalization. In particular, the provision for \%a and \%A suggests that \%, when not followed by a space, could be interpreted as one of the general codes for diacritical marks, but that’s not accommodated by a strict reading of ITvG. Is that what’s really meant?

As for representing the '\' character itself, I have always supposed that where that character appears but does not form the first character of a valid markup code, it should be interpreted literally. In particular, it then follows that it is always interpreted literally when followed by a space or the end of the value. If we wanted to add a specific code for that, however, then the natural one would be "\\", to be interpreted according to paragraph (35) of chapter 2.2.7.4. I’ve always interpreted the last sentence of that paragraph to require *all* of the \\X codes presented therein (not just those specifically identified in the sentence) to be followed by a space or end-of-input, with the space being interpreted as part of the code. (Otherwise the interpretation of \\simeq would be ambiguous.) That interpretation affords assigning "\\ " meaning as a double-slash code with empty label, representing a single '\'.

Regards,

John

From: ddlm-group [mailto:[email protected]] On Behalf Of James Hester
Sent: Thursday, April 27, 2017 6:09 PM
To: ddlm-group <[email protected]>
Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1 equivalents

Hi Simon and others,

I think I prefer this approach of clarifying markup conventions for CIF2, and so we drop the more CIF2->CIF1 translation approach I had initially proposed. Natural translations to ASCII strings will suggest themselves in the process. This is in the spirit of the CIF2 syntax paper appendix as well.

As a simple solution, I suggest that the markup conventions described in Vol G 2.2.7.4 remain available under the same conditions as described in 2.2.7.4.13. There are then two ways to represent the Unicode code points corresponding to the characters listed in 2.2.7.4.14-16. Any other approach is likely to be fraught, as "ASCII string" is a subset of "UTF-8 string" and switching on and off markup depending on the presence of non-ASCII characters is fragile behaviour.

Should we introduce "triple backslash" to represent backslash, as double backslash is already used?

If we agree, I suggest that it is added to our CIF2 FAQ and eventually finds its way into the equivalent of 2.2.7.4 in the new Vol G.

all the best,

James.

On 26 April 2017 at 23:05, SIMON WESTRIP <[email protected]> wrote:

Hi James

I think that the 'common semantic features' need reviewing fully in any revision of Vol G,

not only in the light of CIF2 (e.g. I'm not sure it currently states how to represent a literal backslash,

does C:\foldername\filename contain Greek phi... :-)

So **if** the subject of CIF2->CIF1 is to be addressed in this context and recommendations made,

why not extend the semantics?

I'd prefer not to prescribe any conventions for CIF2->CIF1;

rather clarify the use of some these semantics with CIF2.

Although the IUCr journals have yet to receive/publish a CIF2, I suspect that when it does there will be CIF2

files that contain 'CIF1 markup'...

Cheers

Simon

PS just for info: I use \#xxxxxx when handling any unicode that isn't covered by the CIF1 semantics -

but that is very rare.

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

______________________________ _________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi- bin/mailman/listinfo/ddlm- group

______________________________ _________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi- bin/mailman/listinfo/ddlm- group

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Treatment of CIF2 unicode characterswith CIF1 equivalents (SIMON WESTRIP)

References:

Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents (SIMON WESTRIP)

Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents (James Hester)

Prev by Date: Re: [ddlm-group] Managing deprecation in DDLm

Next by Date: Re: [ddlm-group] Treatment of CIF2 unicode characterswith CIF1 equivalents

Prev by thread: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents

Next by thread: Re: [ddlm-group] Treatment of CIF2 unicode characterswith CIF1 equivalents

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents