[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Recommended character set and use restrictions. .

I can see the advantages of using Unicode in data values where one may wish to render text is some non-ascii formmat, but is there any reason why data names should not be restricted (at least for the forseeable future) to ASCII characters?  These names are assigned by COMCIFS and we are in no real danger of running out of ASCII data names.  One day we may need to write our dictionaries in Arabic, but I doubt that any of us will be around wheb that happens.  If we only allowed non-ASCII characters in delimited strings we would meet all the needs of the community for many years to come, and save ourselves a lot of grief trying to sort out which code points to allow.


Bollinger, John C wrote:
On Friday, June 18, 2010 5:08 PM, Herbert J. Bernstein wrote:

The current specification in in terms of what is included, rather than
what is excluded:

#x9 #xA #xD
#x20 - #xD7FF
#xE000 - #xFFFD
#x10000 - #x10FFF
Yes, but do note that the spec contains an apparent typo in the last line: the upper bound should almost certainly be U+10FFFF, the upper bound of the Unicode code point range.  There are already many Unicode characters assigned to code points greater than U+10FFF (roughly 20% of all assignments).  Also, XML, which the CIF spec references as its guide in this regard, allows characters up to U+10FFFF.


I think John is proposing that the included set become:

#x9 #xA #xD
#x20 - #x7E
#xA0 - #xD7FF
#xE000 - #xFDCF
#xFDF0 - #xFFFD
#x10000 - #x10FFD
Basically, yes, but the upper bound should be U+10FFFD, and there are additional designated non-characters that I propose to exclude:
Unicode explicitly guarantees that no character will ever be assigned to any of these 30 code points, or to the other designated non-character code points among the excluded characters on Herb's list.  They are "meaningless", as Unicode puts it at one point.

In addition, he is proposing to exclude the general class of "non-printing
Unicode characters" from tag names. For many of these he is right
that they would not be missed, but some of them are essential to
correct rendering
of some languages (e.g. Arabic in which the joining or non-joining
of characters is essential to proper rendering).
My argument is only partly that these characters would not be missed.  It is also that those that might be missed would be harmful to allow outside quoted text.  Herb is right to raise the question of rendering, because that's entirely what the non-printing character issue is about.  None of these distinctions are likely to confuse a parsing program.

Consider, for example, the potential save frame code consisting of only character U+0600 ("Arabic Number Sign").  This is a formatting character -- it does not have an independent glyph, but rather directs a Unicode renderer to present a different form of one or more subsequent digits (or preceding ones, if you look at it from an Arabic right-to-left point of view).  A Unicode text presentation system, e.g. an editor, might do any one of several things with that, from altering the presentation of the preceding "save_" through presenting a placeholder character, to displaying no visible representation at all.  I judge the last alternative fairly likely, and it is the resulting confusing appearance that concerns me (in this example, a save frame header rendered identically to a save frame terminator).

Similar considerations apply to such a character appearing in or as a data block code or whitespace-delimited value.

 I would suggest
a careful review of the non-printing characters one by one before
making a firm decision on which ones to exclude from tag names, but
the general idea that a tag should be made of of characters that
either print or which make a clear change in the way in which the
characters print seems sensible to me.
The current spec does not specify a particular version of Unicode, and I have interpreted that to mean CIF is intended for use with any past, present, or future version.  I therefore recommended excluding whole Unicode categories, because doing so will make CIF more resilient with respect to future Unicode updates.  That is, if the general idea is accepted that non-printing characters should not appear in data names etc., then I think it wiser to formulate that part of the specification such that it is not subject to review and possible update with each Unicode update.  If necessary, specific characters could be added back based on a review such as Herb suggests.

On the other hand, these issues present no problem for a parser, so CIF has no overriding need to address them at all.  I made the non-printing character suggestion largely because it seemed to dovetail with the discussion of what should be done with embedded U+FEFF (which is one of the non-printing characters that would be affected).  I assume that it is already understood that certain character errors in a CIF could cause its visual presentation to be confusing and/or inconsistent, and that a determined person could craft a well-formed CIF whose Unicode-compliant visual presentation was either very misleading or a nasty hash.  Such a CIF would serve no useful scientific purpose I can see, however, so maybe CIF does not need to formally invalidate it.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

fn:I.David Brown
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773

ddlm-group mailing list

Reply to: [list | sender only]