# Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents

Dear All,

I for one don't have time, or the expertise, to get into the details here. Perhaps if we have no objections to the general principle of a complete default translation between Unicode and ASCII markup, we can commission Simon to come up with a scheme in time for the next edition of Volume G? I figure that the IUCr journals are in the best position to talk about markup and CIF.

all the best,
James.

On 30 April 2017 at 01:23, SIMON WESTRIP wrote:
Dear all

I am happy to hear that these markup conventions will be clarified in the light of CIF2
and agree that adding an FAQ to indicate that the current markup conventions will be respected in CIF2 is a good idea
(assuming that's the case - i.e. CIF2/DDLm is the same as CIF1/DDL1 inasmuch as the conventions apply by default unless explicitly prohibited).

With regard to the 'quirkiness' of those conventions
(e.g. you can't encode <degree>A because \%A is Angstrom;
you can't encode a <backslash>;
\%a is <aring>, but \%u is <degree>u...)
I would prefer simply to extend the markup to enable ASCII representation of any Unicode
(e.g. \#00b0;A could represent <degree>A;
\#005c; could represent a backslash;
\#016f; could represent <uring>...)
This would address these rare but nevertheless real issues, and have the added benefit of enabling any Unicode symbol to be represented if absolutely necessary.

Just for interest: in a collection of ~40000 full-text CIFs submitted to Acta E and C (1995-2010),
less than 40 actually appear to have needed any backslashes to be marked-up
(using the dubious, yet intuitive and unofficially supported practice of simply using a double backslash in phrases such as "centroid of the C28\\C29\\C30\\C31\\C32\\C33 ring").

Cheers

Simon

From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Friday, 28 April 2017, 15:35

Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1 equivalents

Dear DDLm group,

Inasmuch as ITvG positions the markup codes as an aspect of data _semantics_ and ties their usage to item definitions, it seems to follow that the version of CIF syntax (if any) with which the data are conveyed should not be factor in whether or how the codes are interpreted.  Certainly it is preferable to use Unicode characters instead of escape codes where the medium permits, but item values must be interpreted according to their definitions, regardless of how the values were transported, and that includes interpreting the markup codes where the definition calls for it.  I think this leads to the same approach James suggests.

I think it would be worthwhile, however, to clarify the details of the correspondence between the escape codes and Unicode characters.  Some, especially among the codes for Greek characters, are straightforward, but others not.

For example, I have always supposed that \f corresponds to U+03C6 ("GREEK SMALL LETTER PHI"), but is it incorrect to translate it instead to its compatibility equivalent, U+03D5 ("GREEK PHI SYMBOL")?  Also, when writing converters in the past, I have converted the codes for diacritics to Unicode combining characters, and moved them after their base character, but what about series of *multiple* diacritics?  (I have converted these to multiple combining characters associated with the base character.)

Of course, there are some codes that have no Unicode equivalent; these include at least the codes for sub- and superscripting, but I’ve also never found Unicode characters that map, semantically, to any of ---, \\db,  \\tb,  or \\ddb.  There are also some that could map to any of several different characters, such as --, \\langle, and \\rangle. The exact characters used don’t much matter for typesetting, but they do matter if the codes are interpreted for other purposes.

And some of the codes for specific symbols suggest the possibility of generalization.  In particular, the provision for \%a and \%A suggests that \%, when not followed by a space, could be interpreted as one of the general codes for diacritical marks, but that’s not accommodated by a strict reading of ITvG.  Is that what’s really meant?

As for representing the '\' character itself, I have always supposed that where that character appears but does not form the first character of a valid markup code, it should be interpreted literally.  In particular, it then follows that it is always interpreted literally when followed by a space or the end of the value.  If we wanted to add a specific code for that, however, then the natural one would be "\\", to be interpreted according to paragraph (35) of chapter 2.2.7.4.  I’ve always interpreted the last sentence of that paragraph to require *all* of the \\X codes presented therein (not just those specifically identified in the sentence) to be followed by a space or end-of-input, with the space being interpreted as part of the code.  (Otherwise the interpretation of \\simeq would be ambiguous.)  That interpretation affords assigning "\\ " meaning as a double-slash code with empty label, representing a single '\'.

Regards,

John

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Thursday, April 27, 2017 6:09 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1 equivalents

Hi Simon and others,
I think I prefer this approach of clarifying markup conventions for CIF2, and so we drop the more CIF2->CIF1 translation approach I had initially proposed. Natural translations to ASCII strings will suggest themselves in the process.  This is in the spirit of the CIF2 syntax paper appendix as well.
As a simple solution, I suggest that the markup conventions described in Vol G 2.2.7.4 remain available under the same conditions as described in 2.2.7.4.13.  There are then two ways to represent the Unicode code points corresponding to the characters listed in 2.2.7.4.14-16.  Any other approach is likely to be fraught, as "ASCII string" is a subset of "UTF-8 string" and switching on and off markup depending on the presence of non-ASCII characters is fragile behaviour.

Should we introduce "triple backslash" to represent backslash, as double backslash is already used?
If we agree, I suggest that it is added to our CIF2 FAQ and eventually finds its way into the equivalent of 2.2.7.4 in the new Vol G.
all the best,
James.

On 26 April 2017 at 23:05, SIMON WESTRIP <simonwestrip@btinternet.com> wrote:
Hi James

I think that the 'common semantic features' need reviewing fully in any revision of Vol G,
not only in the light of CIF2 (e.g. I'm not sure it currently states how to represent a literal backslash,
does C:\foldername\filename contain Greek phi... :-)

So **if** the subject of CIF2->CIF1 is to be addressed in this context and recommendations made,
why not extend the semantics?
I'd prefer not to prescribe any conventions for CIF2->CIF1;
rather clarify the use of some these semantics with CIF2.
Although the IUCr journals have yet to receive/publish a CIF2, I suspect that when it does there will be CIF2
files that contain 'CIF1 markup'...

Cheers

Simon

PS just for info: I use \#xxxxxx when handling any unicode that isn't covered by the CIF1 semantics -
but that is very rare.

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
______________________________ _________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi- bin/mailman/listinfo/ddlm- group

______________________________ _________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi- bin/mailman/listinfo/ddlm- group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
• References: