Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Recommended character set and use restrictions

The current specification in in terms of what is included, rather than
what is excluded:

#x9 #xA #xD
#x20 - #xD7FF
#xE000 - #xFFFD
#x10000 - #x10FFF
The characters #xE000-#xF8FF are reserved for private use, and the
IUCr can specify
what these characters must be.

I think John is proposing that the included set become:

#x9 #xA #xD
#x20 - #x7E
#xA0 - #xD7FF
#xE000 - #xFDCF
#xFDF0 - #xFFFD
#x10000 - #x10FFD

In addition, he is proposing to exclude the general class of "non-printing
Unicode characters" from tag names. For many of these he is right 
that they would not be missed, but some of them are essential to 
correct rendering
of some languages (e.g. Arabic in which the joining or non-joining
of characters is essential to proper rendering).  I would suggest
a careful review of the non-printing characters one by one before
making a firm decision on which ones to exclude from tag names, but
the general idea that a tag should be made of of characters that
either print or which make a clear change in the way in which the
characters print seems sensible to me.

   -- Herbert

At 4:05 PM -0500 6/18/10, Bollinger, John C wrote:
>Hello All,
>The current spec excludes most ASCII control characters as well as 
>code points U+FFFE and U+FFFF from the CIF character set, apparently 
>following XML.  I think it would be wise to exclude also the "C1 
>Controls" and all the other permanent non-characters (which is also 
>XML's recommendation, if that gives it extra authority).  Excluding 
>the C1 control characters should be justified by the same logic that 
>justifies excluding most of the ASCII controls.  Excluding the 
>non-characters is appropriate because Unicode formally specifies 
>that they have no meaning and never will have.  The additional 
>excluded characters would be U+007F - U+009F (except possibly 
>allowing U+0085 "next line"), U+FDD0 - U+FDEF, and all code points 
>of the form U+xFFFE or U+xFFFF for x = any hex digit or 10.
>Furthermore, I suggest that all non-printing Unicode characters be 
>forbidden from use except in quoted data values, where "non-printing 
>Unicode characters" includes all in Unicode general categories Cc, 
>Cf, Zl, Zp, and Zs.  Some of these are intended to affect the 
>formatting of characters near them, and some are spaces of various 
>lengths and characteristics, but many of them have no visual 
>representation at all.  They do not present a problem from an 
>automated processing perspective, but they could cause a great deal 
>of confusion for humans.  (For what it's worth, U+FEFF is in 
>category Cf.)
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.