Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Treatment of CIF2 unicode characters with CIF1equivalents

Dear DDLm-ers,

Comments inserted below:

On 25 April 2017 at 00:40, Bollinger, John C <John.Bollinger@stjude.org> wrote:

Dear DDLm-group,

 

Although the kind of transformation described by the James’s proposed rule is appropriate under the conditions the rule describes, I am a bit apprehensive about adding the specific proposed text to Vol G.  In particular,

 

(1) I am inclined to think it inappropriate to assert rules for *applications* in a chapter presenting specifications for the CIF *format*, except to the extent that such rules can be construed as de facto specifications for details of the format or its intended interpretation.


Indeed, except that the current chapter does contain information about markup conventions.  Insofar as these conventions exist, they need to go somewhere. I'm happy as co-editor (or for the authors of the chapter) to consider moving them to a place that is more appropriate and does not risk them being misinterpreted.

 

(2) The proposed rule comes awfully close to comingling syntax with semantics, and I think the more separation we can maintain between those, the better.


Absolutely. It needs to be clear that these rules is intended as a convention to help software agree about presentation of text intended for human consumption.  Perhaps a caveat to this effect should be included in the future equivalent of section 2.2.7.4
 

(3) I’m not convinced that such a provision is needed at all, inasmuch as it seems to follow directly from the principle that CIF data values will be interpreted according to their items’ definitions.


Indeed, it is within the power of any given dictionary definition to describe the interpretation of datavalue contents. However, having a default approach to markup is convenient for dictionary and datafile processors, as the number of different, ad-hoc definition-dependent conventions is reduced.  Likewise, definition authors have a ready-made markup guide.  As such a convention is useful, it has a place somewhere in our documentation.

Given that code exists which takes advantage of the original convention (and I believe that the IUCr Journals have quite a bit of this code), the question arises of how modern CIF libraries might convert Unicode datavalues for such legacy processors.  We could allow potentially incompatible ad-hoc transformations to flourish, or we could give software authors some useful default option to cluster around.  So I see this instruction as leading to options in CIF library software to convert non-ASCII characters "according to IUCr convention", as well as the reverse option, to substitute Unicode code points for backslash digraphs "according to IUCr convention".  Mind you, the proposal as it stands seems to verge on the trivial and obvious, but the question was asked, so it is not obvious to everybody.
 

 

Evidently we’re talking about the next edition of Vol G, and it is unclear to me exactly how that edition will need to be changed to cover CIF 2.0.  Inasmuch as I imagine it will incorporate at least some of the content of the CIF 2.0 specification paper, however, I observe that appendix A.2 of that paper addresses exactly this area already, in the context of a more complete discussion of format conversions such as the current proposal considers.  Since we already have applicable text, I suggest we use that as context and starting point, instead of drafting something new from scratch.  (Hyperlink for the paper: http://journals.iucr.org/j/issues/2016/01/00/aj5269/index.html)


Sure, I guess Vol G is on my mind a lot lately so I've been thinking in terms of it. In the context of section A.2 of the CIF2.0 paper, the proposed statement takes point (b) and gives a bit more force to the idea that datavalues before/after substitution can be considered equivalent - perhaps a caveat to the effect of "for presentational purposes" is needed in the proposed text? 

I would have no objection if this proposal reverts to the less prescriptive Appendix A.2(b) of the paper plus some web-based materials and suggestions and, eventually, some discussion in Vol G second edition.  What does everybody think?

all the best,
James.

 

 

From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of James Hester
Sent: Sunday, April 23, 2017 6:46 PM
To: ddlm-group <ddlm-group@iucr.org>
Subject: [ddlm-group] Treatment of CIF2 unicode characters with CIF1 equivalents

 

Dear DDLm-group (aka COMCIFS technical committee)

There has been some lively discussion on the cif-developers mailing list of late which you may review at http://www.iucr.org/__data/iucr/lists/cif-developers/ .

One issue raised was what to do about CIF2 datavalues that contained unicode characters that have equivalent ASCII sequences described by the CIF markup conventions (e.g. Greek characters).

According to section 2.2.7.4.13 - 17 of International Tables Vol G, by default Greek and some other non-ASCII characters can be represented in text datavalues using a backslash notation <backslash><ascii character>, e.g. \a is alpha.   Different markup conventions are possible on a per-dictionary or per-definition basis. In CIF2, these characters can be represented natively, but legacy CIF applications presented with a datavalue containing non-ASCII values may not be prepared to typeset or present them appropriately.  On the other hand, it would seem inefficient to define separate Unicode-aware datanames for every text value simply to avoid legacy problems.

 

Proposal: add the following paragraph to Vol G section 2.2.7.4. Note that "meets the requirements of paragraph 2.2.7.4.13" means that this paragraph only applies in those cases for which the CIF1 markup conventions would apply. 

(2.2.7.4.18) Whenever an application is required to convert a datavalue from a CIF2 datafile containing code points outside the ASCII range to a datavalue containing only ASCII codepoints, the appropriate markup as per paragraphs 2.2.7.4.13-16 should be substituted, provided that the relevant definition meets the requirements of paragraph 2.2.7.4.13. If no markup is defined for the Unicode code point, no CIF1 equivalent value exists and application behaviour is undefined.

 

Please comment.

James.

--

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148



Email Disclaimer: www.stjude.org/emaildisclaimer
Consultation Disclaimer: www.stjude.org/consultationdisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.