[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Recommended character set and use restrictions
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Recommended character set and use restrictions
- From: James Hester <jamesrhester@gmail.com>
- Date: Mon, 21 Jun 2010 16:17:51 +1000
- In-Reply-To: <a0624080ac84197c8f154@192.168.2.104>
- References: <AANLkTikPRP0zLmeWCde-UjR599qJBDP4ps8mpT2FB07E@mail.gmail.com><84803.69690.qm@web87001.mail.ird.yahoo.com><a0624080ac84197c8f154@192.168.2.104>
I agree with John's suggestions, and Herbert makes a good point that we may be unduly restricting the languages in which tags can be written by excluding non-printing characters. May I suggest that the precise list of excluded non-printing characters is given as an addendum to the main CIF2 syntax standard, allowing us to proceed with ratification of the main standard? So who is volunteering to come up with a list of non-printing characters to be allowed/excluded from tags? On Sat, Jun 19, 2010 at 8:07 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > The current specification in in terms of what is included, rather than > what is excluded: > > #x9 #xA #xD > #x20 - #xD7FF > #xE000 - #xFFFD > #x10000 - #x10FFF > The characters #xE000-#xF8FF are reserved for private use, and the > IUCr can specify > what these characters must be. > > I think John is proposing that the included set become: > > > #x9 #xA #xD > #x20 - #x7E > #xA0 - #xD7FF > #xE000 - #xFDCF > #xFDF0 - #xFFFD > #x10000 - #x10FFD > > In addition, he is proposing to exclude the general class of "non-printing > Unicode characters" from tag names. For many of these he is right > that they would not be missed, but some of them are essential to > correct rendering > of some languages (e.g. Arabic in which the joining or non-joining > of characters is essential to proper rendering). I would suggest > a careful review of the non-printing characters one by one before > making a firm decision on which ones to exclude from tag names, but > the general idea that a tag should be made of of characters that > either print or which make a clear change in the way in which the > characters print seems sensible to me. > > -- Herbert > > > > At 4:05 PM -0500 6/18/10, Bollinger, John C wrote: >>Hello All, >> >>The current spec excludes most ASCII control characters as well as >>code points U+FFFE and U+FFFF from the CIF character set, apparently >>following XML. I think it would be wise to exclude also the "C1 >>Controls" and all the other permanent non-characters (which is also >>XML's recommendation, if that gives it extra authority). Excluding >>the C1 control characters should be justified by the same logic that >>justifies excluding most of the ASCII controls. Excluding the >>non-characters is appropriate because Unicode formally specifies >>that they have no meaning and never will have. The additional >>excluded characters would be U+007F - U+009F (except possibly >>allowing U+0085 "next line"), U+FDD0 - U+FDEF, and all code points >>of the form U+xFFFE or U+xFFFF for x = any hex digit or 10. >> >>Furthermore, I suggest that all non-printing Unicode characters be >>forbidden from use except in quoted data values, where "non-printing >>Unicode characters" includes all in Unicode general categories Cc, >>Cf, Zl, Zp, and Zs. Some of these are intended to affect the >>formatting of characters near them, and some are spaces of various >>lengths and characteristics, but many of them have no visual >>representation at all. They do not present a problem from an >>automated processing perspective, but they could cause a great deal >>of confusion for humans. (For what it's worth, U+FEFF is in >>category Cf.) >> >> >>Regards, >> >>John >>-- >>John C. Bollinger, Ph.D. >>Department of Structural Biology >>St. Jude Children's Research Hospital >> >> >> >>Email Disclaimer: www.stjude.org/emaildisclaimer >>_______________________________________________ >>ddlm-group mailing list >>ddlm-group@iucr.org >>http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Vote on BOM (James Hester)
- Re: [ddlm-group] Vote on BOM (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .
- Next by Date: Re: [ddlm-group] Character set for data block and save frame codes
- Prev by thread: Re: [ddlm-group] Character set for data block and save frame codes
- Next by thread: Re: [ddlm-group] Vote on BOM
- Index(es):