On Wednesday, June 30, 2010 7:08 AM, Brian McMahon wrote:


>Are we proposing to allow also the "old" CIF1
>character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and
>have we a procedure for distinguishing when it has been applied?

Inasmuch as CIF1 takes great pains to specify that that scheme is a "common semantic feature" and not part of the specification per se, I see no need to promote it into the CIF2 format specification.  I would prefer instead to deprecate it for use with CIF2, recommending that the appropriate Unicode characters be used directly instead.  By "deprecate", however, I do not mean to forbid CIF2 parsers or applications from using that scheme, should they wish to do so in addition to providing direct support for Unicode.

Furthermore, I feel obliged to point out that any mapping between the old encoding scheme and Unicode is at best ambiguous, in the sense that Unicode provides for both composed and decomposed versions of many of the resulting characters, and offers multiple possible characters for a few of the codes (and no characters for a few others).  My favorites are \\langle and \\rangle.  The former, for example, could be decoded as U+003C "Less Than", but it seems more likely to be intended as one (or more) of U+2329 "Left-Pointing Angle Bracket", U+27E8 "Mathematical Left Angle Bracket", or U+3008 "Left Angle Bracket".  Overall, many of the old codes are oriented toward typesetting (for which purpose the distinctions Unicode draws among various possible meanings of \\langle are unimportant), whereas Unicode largely ignores formatting and typesetting, focusing instead on character-level semantics.

>The elephant in the room: the discussion has so far addressed
>alphabetic character encodings. To retain at least the functionality
>of CIF1, we will need to specify schemes for handling other constructs
>such as sub/superscripts and possibly some other mathematical
>constructs - though most of the CIF1 maths is at the level of
>character representations which may be catered for by Unicode code
>points. Will procedures for identifying which of these schemes apply
>(possibly at the level of individual CIF data items) be orthogonal
>to specification of character encodings in the CIF, will the two
>interfere destructively, or can they somehow be handled in much the
>same way in terms of attaching "metadata about encoding" of some sort
>to the relevant objects?

The CIF1 markup codes for sub- and superscripts, boldface, and italics are expressly _not_ part of the CIF1 specification, and I am comfortable with that approach.  It is ultimately a question of data type: do data types 'char' and 'uchar' convey plain text or styled text?  Does the answer vary by tag?  By application?  I submit that the answer *should* vary by tag, and *may* also vary by application.  That lifts it out of the scope of the CIF2 specification, into the realm of dictionaries and usage profiles.  The CIF1 Common Semantic Features document points in this direction when it remarks that "if it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type [...]" (paragraph 37).


