[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: James Hester <[email protected]>
- Date: Mon, 21 Jun 2010 16:00:24 +1000
- In-Reply-To: <[email protected]>
- References: <[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229514@SJMEMXMBS11.stjude.sjcrh.local><[email protected]>
So, can I count your vote on code point 0xFEFF as 'always a syntax error in the code point stream' (option 2(a))? On Sat, Jun 19, 2010 at 1:10 AM, Herbert J. Bernstein <[email protected]> wrote: > Ah, but I remember the days when FEFF was also a non-character and I am > happy to see it deprecated and headed back to that status. �In a UCS-2 > byte stream it is a disaster as a code point, causing confusion with > a UTF-16 big-endian BOM. �It was a mistake ever to make it anything > other than a BOM, and we would be making a mistake in perpetuating > a deprecated use. > > I supposed COMCIFS could decide to change the specification of imgCIF > and diverge from the already-established use of UCS-2/UTF-16 in the > bin-utf encoding of images, but why do that? > > Nothing is gained by COMCIFS diverging from Unicode and XML practice > on handling BOMs. > > In any case, the real question is whether CIF2 will be itself a binary > format, or whether CIF2 will be a text format. �I think we would serve > the community best with a text format for at least the next ten years. > > > ===================================================== > �Herbert J. Bernstein, Professor of Computer Science > � �Dowling College, Kramer Science Center, KSC 121 > � � � � Idle Hour Blvd, Oakdale, NY, 11769 > > � � � � � � � � �+1-631-244-3035 > � � � � � � � � �[email protected] > ===================================================== > > On Fri, 18 Jun 2010, Bollinger, John C wrote: > >> We have already acknowledged several times that use of U+FEFF as a >> ZWNBSP is deprecated by Unicode, but that does not in any way make it an >> invalid character. �Its definition in the code chart (as something other >> than "not a character") in fact shows that it *is* a valid character. >> >> Unicode explicitly specifies that certain code points have not been and >> never will be assigned to characters; they are called "not a character" >> or "non-character" in Unicode-speak. �Those are to be distinguished from >> "unassigned code points" and also from deprecated characters and uses. >> U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP). >> Unicode in fact adopts an approach similar to COMCIF's, and for similar >> reasons: once a character is defined, it has the defined meaning >> forever. U+FEFF's significance as ZWNBSP therefore will never be removed >> from Unicode. >> >> I daresay most of us have code in production that relies on deprecated >> features of various programming languages, and some of us may even from >> time to time write new code relying on such features. �These programs >> are not for that reason non-conformant. �Similarly, let's not >> overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP. �Unicode >> does advise against that use in new documents, but it does not forbid >> it. �We can choose. >> >> Inasmuch as all CIF2 documents will be new documents, CIF2 could >> incorporate Unicode's recommendation as a requirement. �That would allow >> U+FEFF to be reserved for use as an encoding switch or other protocol >> metacharacter, but at the cost of creating a new way for otherwise >> perfectly acceptable, innocently created CIFs to be ill-formed. �For >> example, imagine using Unicode-aware (but not CIF-specific) tools to >> copy perfectly good text from an existing manuscript into a CIF text >> block, and having the result turn out to be invalid because it contains >> U+FEFF (used as ZWNBSP). �This is a much more likely scenario than >> U+FEFF introduced by concatenation of CIFs, and it is inconsistent with >> this group's continued interest in keeping CIF compatible with >> general-purpose tools. >> >> There are plenty of non-character code points in Unicode that could be >> used in imgCIF or other protocols as escape characters to introduce >> meta-functionality such as an encoding switch. �These include U+FDD0 - >> U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... >> U+10FFFE, and U+10FFFF. �(NOTE: other than U+FFFE and U+FFFF, none of >> these are explicitly forbidden in the current CIF2 draft, but they can >> and should be. �They are non-characters.) �For example, U+FDD0 (encoded >> appropriately) could be inserted to signal the end of a sub-stream, >> after which an imgCIF parser could perform encoding detection and any >> other start-of-stream behavior it deems appropriate on whatever follows. >> Alternatively, a defined Unicode character that is excluded from CIF, >> such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII >> control character could serve this purpose. >> >> See also comments in-line below. >> >> On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote: >>> Sorry, you are mistaken. �What the code chart says is: >>> >>> Special >>> FEFF � �ZERO WIDTH NO-BREAK SPACE >>> � � � � = BYTE ORDER MARK (BOM), ZWNBSP >>> � � � � �may be used to detect byto order by contrast >>> � � � � �with the noncharcater code point FFFE >>> � � � � � use as an indication of non-breaking is >>> � � � � �deprecated; see 2060 instead >>> � � � � �-> 200B zero width space >>> � � � � �-> 2060 word joiner >>> � � � � �-> FFFE <not a charcater> >>> >>> So, under the latest version of unicode, the use you are describing in >>> deprecated. The unicode consortium has the character back to what it >>> originally was -- the BOM, which is not a character, and I intend to >>> process it that way, not in the very odd way that some people followed >>> for a few recent Unicode versions that made no sense and has now been >>> deprecated. >> >> As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character. �U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP. �Applications and protocols are, however, *allowed* to reject such documents without violating Unicode. �That they are allowed to do so is not an argument that they must do so. >> >>> In theory there could be old unicode UTF-8 files somehow with stray >>> FEFF characters in them as code points, but inasmuch as CIF2 is new, we >>> are all spared the puzzlement of dealing with this non-problem of >>> dealing with a noncharacter which became a strange character and is >>> now again a noncharacter. >> >> U+FEFF is still a character, and always will be. �I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF. �The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources. >> >>> In addition, if you read the bizarre discussions on FEFF when people >>> were trying to use if as a code point instead of just stopping it at >>> the text processing level, you will see that the only thing they >>> could do with it was throw it away (that is what a zero width no-break >>> space means) >> >> Do you have an example? �Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away. �I have in fact used U+2060 myself. �Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing. >> >>> The _only_ fully compliant use for FEFF in the current standard is as >>> a BOM, not as a valid code point, so it is not really an issue for CIF2 >>> any more than FFFF or FFFE are, none of which should be delivered as >>> code points in text processing. >> >> I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion. �CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance. �The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications. >> >> >> Regards, >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> >> >> Email Disclaimer: �www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> ddlm-group mailing list >> [email protected] >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> > _______________________________________________ > ddlm-group mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] Vote on BOM
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):