Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Recommended character set and use restrictions. .


On Friday, June 18, 2010 5:08 PM, Herbert J. Bernstein wrote:

>The current specification in in terms of what is included, rather than
>what is excluded:
>
>#x9 #xA #xD
>#x20 - #xD7FF
>#xE000 - #xFFFD
>#x10000 - #x10FFF

Yes, but do note that the spec contains an apparent typo in the last line: the upper bound should almost certainly be U+10FFFF, the upper bound of the Unicode code point range.  There are already many Unicode characters assigned to code points greater than U+10FFF (roughly 20% of all assignments).  Also, XML, which the CIF spec references as its guide in this regard, allows characters up to U+10FFFF.

[...]

>I think John is proposing that the included set become:
>
>#x9 #xA #xD
>#x20 - #x7E
>#xA0 - #xD7FF
>#xE000 - #xFDCF
>#xFDF0 - #xFFFD
>#x10000 - #x10FFD

Basically, yes, but the upper bound should be U+10FFFD, and there are additional designated non-characters that I propose to exclude:
        U+1FFFE U+1FFFF U+2FFFE U+2FFFF U+3FFFE U+3FFFF ... U+FFFFE U+FFFFF.
Unicode explicitly guarantees that no character will ever be assigned to any of these 30 code points, or to the other designated non-character code points among the excluded characters on Herb's list.  They are "meaningless", as Unicode puts it at one point.

>In addition, he is proposing to exclude the general class of "non-printing
>Unicode characters" from tag names. For many of these he is right
>that they would not be missed, but some of them are essential to
>correct rendering
>of some languages (e.g. Arabic in which the joining or non-joining
>of characters is essential to proper rendering).

My argument is only partly that these characters would not be missed.  It is also that those that might be missed would be harmful to allow outside quoted text.  Herb is right to raise the question of rendering, because that's entirely what the non-printing character issue is about.  None of these distinctions are likely to confuse a parsing program.

Consider, for example, the potential save frame code consisting of only character U+0600 ("Arabic Number Sign").  This is a formatting character -- it does not have an independent glyph, but rather directs a Unicode renderer to present a different form of one or more subsequent digits (or preceding ones, if you look at it from an Arabic right-to-left point of view).  A Unicode text presentation system, e.g. an editor, might do any one of several things with that, from altering the presentation of the preceding "save_" through presenting a placeholder character, to displaying no visible representation at all.  I judge the last alternative fairly likely, and it is the resulting confusing appearance that concerns me (in this example, a save frame header rendered identically to a save frame terminator).

Similar considerations apply to such a character appearing in or as a data block code or whitespace-delimited value.

>  I would suggest
>a careful review of the non-printing characters one by one before
>making a firm decision on which ones to exclude from tag names, but
>the general idea that a tag should be made of of characters that
>either print or which make a clear change in the way in which the
>characters print seems sensible to me.

The current spec does not specify a particular version of Unicode, and I have interpreted that to mean CIF is intended for use with any past, present, or future version.  I therefore recommended excluding whole Unicode categories, because doing so will make CIF more resilient with respect to future Unicode updates.  That is, if the general idea is accepted that non-printing characters should not appear in data names etc., then I think it wiser to formulate that part of the specification such that it is not subject to review and possible update with each Unicode update.  If necessary, specific characters could be added back based on a review such as Herb suggests.

On the other hand, these issues present no problem for a parser, so CIF has no overriding need to address them at all.  I made the non-printing character suggestion largely because it seemed to dovetail with the discussion of what should be done with embedded U+FEFF (which is one of the non-printing characters that would be affected).  I assume that it is already understood that certain character errors in a CIF could cause its visual presentation to be confusing and/or inconsistent, and that a determined person could craft a well-formed CIF whose Unicode-compliant visual presentation was either very misleading or a nasty hash.  Such a CIF would serve no useful scientific purpose I can see, however, so maybe CIF does not need to formally invalidate it.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.