[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: James Hester <jamesrhester@gmail.com>
- Date: Mon, 21 Jun 2010 16:00:24 +1000
- In-Reply-To: <alpine.BSF.2.00.1006181057080.92557@epsilon.pair.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com><alpine.BSF.2.00.1006152223480.59900@epsilon.pair.com><AANLkTimmOPFkQhY1KY24Dg5kz3MUB4mO2sjoM848bqjV@mail.gmail.com><alpine.BSF.2.00.1006160719520.58405@epsilon.pair.com><881462.27872.qm@web87009.mail.ird.yahoo.com><AANLkTin51hXra-cIPzH3VMcUxJHMaUPWL71Kf1zM8SNt@mail.gmail.com><alpine.BSF.2.00.1006172025070.91418@epsilon.pair.com><AANLkTimEn-5bOcLNsa1DSOjDS7XqFmqVKA-W-6Z4NxFO@mail.gmail.com><alpine.BSF.2.00.1006172107430.91418@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229514@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1006181057080.92557@epsilon.pair.com>
So, can I count your vote on code point 0xFEFF as 'always a syntax error in the code point stream' (option 2(a))? On Sat, Jun 19, 2010 at 1:10 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > Ah, but I remember the days when FEFF was also a non-character and I am > happy to see it deprecated and headed back to that status. In a UCS-2 > byte stream it is a disaster as a code point, causing confusion with > a UTF-16 big-endian BOM. It was a mistake ever to make it anything > other than a BOM, and we would be making a mistake in perpetuating > a deprecated use. > > I supposed COMCIFS could decide to change the specification of imgCIF > and diverge from the already-established use of UCS-2/UTF-16 in the > bin-utf encoding of images, but why do that? > > Nothing is gained by COMCIFS diverging from Unicode and XML practice > on handling BOMs. > > In any case, the real question is whether CIF2 will be itself a binary > format, or whether CIF2 will be a text format. I think we would serve > the community best with a text format for at least the next ten years. > > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Fri, 18 Jun 2010, Bollinger, John C wrote: > >> We have already acknowledged several times that use of U+FEFF as a >> ZWNBSP is deprecated by Unicode, but that does not in any way make it an >> invalid character. Its definition in the code chart (as something other >> than "not a character") in fact shows that it *is* a valid character. >> >> Unicode explicitly specifies that certain code points have not been and >> never will be assigned to characters; they are called "not a character" >> or "non-character" in Unicode-speak. Those are to be distinguished from >> "unassigned code points" and also from deprecated characters and uses. >> U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP). >> Unicode in fact adopts an approach similar to COMCIF's, and for similar >> reasons: once a character is defined, it has the defined meaning >> forever. U+FEFF's significance as ZWNBSP therefore will never be removed >> from Unicode. >> >> I daresay most of us have code in production that relies on deprecated >> features of various programming languages, and some of us may even from >> time to time write new code relying on such features. These programs >> are not for that reason non-conformant. Similarly, let's not >> overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP. Unicode >> does advise against that use in new documents, but it does not forbid >> it. We can choose. >> >> Inasmuch as all CIF2 documents will be new documents, CIF2 could >> incorporate Unicode's recommendation as a requirement. That would allow >> U+FEFF to be reserved for use as an encoding switch or other protocol >> metacharacter, but at the cost of creating a new way for otherwise >> perfectly acceptable, innocently created CIFs to be ill-formed. For >> example, imagine using Unicode-aware (but not CIF-specific) tools to >> copy perfectly good text from an existing manuscript into a CIF text >> block, and having the result turn out to be invalid because it contains >> U+FEFF (used as ZWNBSP). This is a much more likely scenario than >> U+FEFF introduced by concatenation of CIFs, and it is inconsistent with >> this group's continued interest in keeping CIF compatible with >> general-purpose tools. >> >> There are plenty of non-character code points in Unicode that could be >> used in imgCIF or other protocols as escape characters to introduce >> meta-functionality such as an encoding switch. These include U+FDD0 - >> U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... >> U+10FFFE, and U+10FFFF. (NOTE: other than U+FFFE and U+FFFF, none of >> these are explicitly forbidden in the current CIF2 draft, but they can >> and should be. They are non-characters.) For example, U+FDD0 (encoded >> appropriately) could be inserted to signal the end of a sub-stream, >> after which an imgCIF parser could perform encoding detection and any >> other start-of-stream behavior it deems appropriate on whatever follows. >> Alternatively, a defined Unicode character that is excluded from CIF, >> such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII >> control character could serve this purpose. >> >> See also comments in-line below. >> >> On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote: >>> Sorry, you are mistaken. What the code chart says is: >>> >>> Special >>> FEFF ZERO WIDTH NO-BREAK SPACE >>> = BYTE ORDER MARK (BOM), ZWNBSP >>> may be used to detect byto order by contrast >>> with the noncharcater code point FFFE >>> use as an indication of non-breaking is >>> deprecated; see 2060 instead >>> -> 200B zero width space >>> -> 2060 word joiner >>> -> FFFE <not a charcater> >>> >>> So, under the latest version of unicode, the use you are describing in >>> deprecated. The unicode consortium has the character back to what it >>> originally was -- the BOM, which is not a character, and I intend to >>> process it that way, not in the very odd way that some people followed >>> for a few recent Unicode versions that made no sense and has now been >>> deprecated. >> >> As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character. U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP. Applications and protocols are, however, *allowed* to reject such documents without violating Unicode. That they are allowed to do so is not an argument that they must do so. >> >>> In theory there could be old unicode UTF-8 files somehow with stray >>> FEFF characters in them as code points, but inasmuch as CIF2 is new, we >>> are all spared the puzzlement of dealing with this non-problem of >>> dealing with a noncharacter which became a strange character and is >>> now again a noncharacter. >> >> U+FEFF is still a character, and always will be. I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF. The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources. >> >>> In addition, if you read the bizarre discussions on FEFF when people >>> were trying to use if as a code point instead of just stopping it at >>> the text processing level, you will see that the only thing they >>> could do with it was throw it away (that is what a zero width no-break >>> space means) >> >> Do you have an example? Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away. I have in fact used U+2060 myself. Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing. >> >>> The _only_ fully compliant use for FEFF in the current standard is as >>> a BOM, not as a valid code point, so it is not really an issue for CIF2 >>> any more than FFFF or FFFE are, none of which should be delivered as >>> code points in text processing. >> >> I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion. CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance. The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications. >> >> >> Regards, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> >> >> Email Disclaimer: www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] Vote on BOM
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):