[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Recommended character set and use restrictions

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Recommended character set and use restrictions
From: "Herbert J. Bernstein" <[email protected]>
Date: Fri, 18 Jun 2010 18:07:44 -0400
In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA541661229519@SJMEMXMBS11.stjude.sjcrh.local>
References: <[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229515@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229518@SJMEMXMBS11.stjude.sjcrh.local><8F77913624F7524AACD2A92EAF3BFA541661229519@SJMEMXMBS11.stjude.sjcrh.local>

The current specification in in terms of what is included, rather than
what is excluded:

#x9 #xA #xD
#x20 - #xD7FF
#xE000 - #xFFFD
#x10000 - #x10FFF
The characters #xE000-#xF8FF are reserved for private use, and the
IUCr can specify
what these characters must be.

I think John is proposing that the included set become:


#x9 #xA #xD
#x20 - #x7E
#xA0 - #xD7FF
#xE000 - #xFDCF
#xFDF0 - #xFFFD
#x10000 - #x10FFD

In addition, he is proposing to exclude the general class of "non-printing
Unicode characters" from tag names. For many of these he is right 
that they would not be missed, but some of them are essential to 
correct rendering
of some languages (e.g. Arabic in which the joining or non-joining
of characters is essential to proper rendering).  I would suggest
a careful review of the non-printing characters one by one before
making a firm decision on which ones to exclude from tag names, but
the general idea that a tag should be made of of characters that
either print or which make a clear change in the way in which the
characters print seems sensible to me.

   -- Herbert



At 4:05 PM -0500 6/18/10, Bollinger, John C wrote:
>Hello All,
>
>The current spec excludes most ASCII control characters as well as 
>code points U+FFFE and U+FFFF from the CIF character set, apparently 
>following XML.  I think it would be wise to exclude also the "C1 
>Controls" and all the other permanent non-characters (which is also 
>XML's recommendation, if that gives it extra authority).  Excluding 
>the C1 control characters should be justified by the same logic that 
>justifies excluding most of the ASCII controls.  Excluding the 
>non-characters is appropriate because Unicode formally specifies 
>that they have no meaning and never will have.  The additional 
>excluded characters would be U+007F - U+009F (except possibly 
>allowing U+0085 "next line"), U+FDD0 - U+FDEF, and all code points 
>of the form U+xFFFE or U+xFFFF for x = any hex digit or 10.
>
>Furthermore, I suggest that all non-printing Unicode characters be 
>forbidden from use except in quoted data values, where "non-printing 
>Unicode characters" includes all in Unicode general categories Cc, 
>Cf, Zl, Zp, and Zs.  Some of these are intended to affect the 
>formatting of characters near them, and some are spaces of various 
>lengths and characteristics, but many of them have no visual 
>representation at all.  They do not present a problem from an 
>automated processing perspective, but they could cause a great deal 
>of confusion for humans.  (For what it's worth, U+FEFF is in 
>category Cf.)
>
>
>Regards,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>_______________________________________________
>ddlm-group mailing list
>[email protected]
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Recommended character set and use restrictions. . (Bollinger, John C)

References:

[ddlm-group] Vote on BOM (James Hester)

Re: [ddlm-group] Vote on BOM (Bollinger, John C)

Re: [ddlm-group] Vote on BOM (SIMON WESTRIP)

[ddlm-group] Character set for data block and save frame codes (Bollinger, John C)

[ddlm-group] Recommended character set and use restrictions (Bollinger, John C)

Prev by Date: [ddlm-group] Recommended character set and use restrictions

Next by Date: Re: [ddlm-group] Vote on BOM

Prev by thread: [ddlm-group] Recommended character set and use restrictions

Next by thread: Re: [ddlm-group] Recommended character set and use restrictions. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Recommended character set and use restrictions