[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] Recommended character set and use restrictions

Hello All,

The current spec excludes most ASCII control characters as well as code points U+FFFE and U+FFFF from the CIF character set, apparently following XML.  I think it would be wise to exclude also the "C1 Controls" and all the other permanent non-characters (which is also XML's recommendation, if that gives it extra authority).  Excluding the C1 control characters should be justified by the same logic that justifies excluding most of the ASCII controls.  Excluding the non-characters is appropriate because Unicode formally specifies that they have no meaning and never will have.  The additional excluded characters would be U+007F – U+009F (except possibly allowing U+0085 "next line"), U+FDD0 – U+FDEF, and all code points of the form U+xFFFE or U+xFFFF for x = any hex digit or 10.

Furthermore, I suggest that all non-printing Unicode characters be forbidden from use except in quoted data values, where "non-printing Unicode characters" includes all in Unicode general categories Cc, Cf, Zl, Zp, and Zs.  Some of these are intended to affect the formatting of characters near them, and some are spaces of various lengths and characteristics, but many of them have no visual representation at all.  They do not present a problem from an automated processing perspective, but they could cause a great deal of confusion for humans.  (For what it's worth, U+FEFF is in category Cf.)


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]