[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
--
Reply to: [list | sender only]
Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents
- To: SIMON WESTRIP <simonwestrip@btinternet.com>, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents
- From: James Hester <jamesrhester@gmail.com>
- Date: Wed, 26 Apr 2017 17:40:53 +1000
- DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;h=mime-version:in-reply-to:references:from:date:message-id:subject:to;bh=TCDzw23Hy2whF5sQE3/Lw1Mp8yRTlcykASoYlbGDUro=;b=q+P5px9oKrbvDChh/+FWdQal/XX74z55bMv5PYFKvXH72eqPrOqe8kDC2TUdW620fA4BSSkFlZ0PNJXsmZw/dLpPOqBZcX5Rcea8vEKRfmJleLIhq6tsJOEj5mTTTPw3x5n2+cmRh9LOUAlIiShwSfud5Fm1iu1Y1goB6hqHXnMcoTsjQEdPjjuOd8Pm9GSMUYDe4wHeBHAkJnDsV7o7lfgqXJvHdvSubIy1FI32bnNkpirKw3zhGjZUgwcOOb/p7f82y2BEXiVYfEQUjD1r8k7ZNh8LyiQCwwhgNJMyZz8ZmlTKIp6yEX92TLpgMILYj+kPv2EBnv3JRlDKSTXTCg==
- In-Reply-To: <270116483.7872715.1493032195779@mail.yahoo.com>
- References: <CAM+dB2fj5b9wvBk2JZU4ATX-4qjJkKJfY1p8zst5k8jrR_XiWQ@mail.gmail.com><270116483.7872715.1493032195779@mail.yahoo.com>
Hi Simon,
While the translation between the markup digraphs and Unicode is essentially a no-brainer, if we go beyond that to cover the rest of Unicode, there are a variety of alternatives, none of which are likely to suit every use case. Off the top of my head I can think of: (a) silently drop characters that have no legacy representation (b) substitute the ASCII name of the character (c) use some backslash convention that is not \Uxxxxxx (d) replace the characters or whole datavalue with a question mark. Which of these is acceptable will depend on the use case. Where a faithful reproduction of the definition is needed (for example, typesetting a CIF dictionary) there is no avoiding the need to adjust legacy code. At some point it becomes easier to make the legacy software Unicode-aware rather than implement whatever convention we might think up.
Of course, authors and organisations are free to invent their own procedures for interfacing Unicode strings to legacy software, as they understand what is easiest for them and what level of fidelity is acceptable.
It is possible that I don't have a good handle on practical issues, so perhaps you could expand on why having a full Unicode conversion scheme is better than the current proposal ("undefined"). Also, as mentioned in my recent email in reply to John, we could drop this CIF2->CIF1 conversion conversation and instead discuss strategies in our online materials and Vol G.While the translation between the markup digraphs and Unicode is essentially a no-brainer, if we go beyond that to cover the rest of Unicode, there are a variety of alternatives, none of which are likely to suit every use case. Off the top of my head I can think of: (a) silently drop characters that have no legacy representation (b) substitute the ASCII name of the character (c) use some backslash convention that is not \Uxxxxxx (d) replace the characters or whole datavalue with a question mark. Which of these is acceptable will depend on the use case. Where a faithful reproduction of the definition is needed (for example, typesetting a CIF dictionary) there is no avoiding the need to adjust legacy code. At some point it becomes easier to make the legacy software Unicode-aware rather than implement whatever convention we might think up.
Of course, authors and organisations are free to invent their own procedures for interfacing Unicode strings to legacy software, as they understand what is easiest for them and what level of fidelity is acceptable.
On 24 April 2017 at 21:09, SIMON WESTRIP <simonwestrip@btinternet.com> wrote:
I agree with the approach, but think that if CIF2->CIF1 is to be mentioned at all,a full convention for conversion for legacy processing should be described.CheersSimon
From: James Hester <jamesrhester@gmail.com>
To: ddlm-group <ddlm-group@iucr.org>
Sent: Monday, 24 April 2017, 0:45
Subject: [ddlm-group] Treatment of CIF2 unicode characters with CIF1 equivalents
______________________________One issue raised was what to do about CIF2 datavalues that contained unicode characters that have equivalent ASCII sequences described by the CIF markup conventions (e.g. Greek characters).Dear DDLm-group (aka COMCIFS technical committee)There has been some lively discussion on the cif-developers mailing list of late which you may review at http://www.iucr.org/__data/iucr/lists/cif-developers/ .According to section 2.2.7.4.13 - 17 of International Tables Vol G, by default Greek and some other non-ASCII characters can be represented in text datavalues using a backslash notation <backslash><ascii character>, e.g. \a is alpha. Different markup conventions are possible on a per-dictionary or per-definition basis. In CIF2, these characters can be represented natively, but legacy CIF applications presented with a datavalue containing non-ASCII values may not be prepared to typeset or present them appropriately. On the other hand, it would seem inefficient to define separate Unicode-aware datanames for every text value simply to avoid legacy problems.Proposal: add the following paragraph to Vol G section 2.2.7.4. Note that "meets the requirements of paragraph 2.2.7.4.13" means that this paragraph only applies in those cases for which the CIF1 markup conventions would apply.(2.2.7.4.18) Whenever an application is required to convert a datavalue from a CIF2 datafile containing code points outside the ASCII range to a datavalue containing only ASCII codepoints, the appropriate markup as per paragraphs 2.2.7.4.13-16 should be substituted, provided that the relevant definition meets the requirements of paragraph 2.2.7.4.13. If no markup is defined for the Unicode code point, no CIF1 equivalent value exists and application behaviour is undefined._________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm- group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm- group
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Prev by Date: Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
- Next by Date: Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents
- Prev by thread: Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
- Next by thread: Re: [ddlm-group] Treatment of CIF2 unicode characters withCIF1 equivalents
- Index(es):