[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Ah, but I remember the days when FEFF was also a non-character and I am
happy to see it deprecated and headed back to that status.  In a UCS-2
byte stream it is a disaster as a code point, causing confusion with
a UTF-16 big-endian BOM.  It was a mistake ever to make it anything
other than a BOM, and we would be making a mistake in perpetuating
a deprecated use.

I supposed COMCIFS could decide to change the specification of imgCIF
and diverge from the already-established use of UCS-2/UTF-16 in the
bin-utf encoding of images, but why do that?

Nothing is gained by COMCIFS diverging from Unicode and XML practice
on handling BOMs.

In any case, the real question is whether CIF2 will be itself a binary
format, or whether CIF2 will be a text format.  I think we would serve
the community best with a text format for at least the next ten years.


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 18 Jun 2010, Bollinger, John C wrote:

> We have already acknowledged several times that use of U+FEFF as a 
> ZWNBSP is deprecated by Unicode, but that does not in any way make it an 
> invalid character.  Its definition in the code chart (as something other 
> than "not a character") in fact shows that it *is* a valid character.
>
> Unicode explicitly specifies that certain code points have not been and 
> never will be assigned to characters; they are called "not a character" 
> or "non-character" in Unicode-speak.  Those are to be distinguished from 
> "unassigned code points" and also from deprecated characters and uses. 
> U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP). 
> Unicode in fact adopts an approach similar to COMCIF's, and for similar 
> reasons: once a character is defined, it has the defined meaning 
> forever. U+FEFF's significance as ZWNBSP therefore will never be removed 
> from Unicode.
>
> I daresay most of us have code in production that relies on deprecated 
> features of various programming languages, and some of us may even from 
> time to time write new code relying on such features.  These programs 
> are not for that reason non-conformant.  Similarly, let's not 
> overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP.  Unicode 
> does advise against that use in new documents, but it does not forbid 
> it.  We can choose.
>
> Inasmuch as all CIF2 documents will be new documents, CIF2 could 
> incorporate Unicode's recommendation as a requirement.  That would allow 
> U+FEFF to be reserved for use as an encoding switch or other protocol 
> metacharacter, but at the cost of creating a new way for otherwise 
> perfectly acceptable, innocently created CIFs to be ill-formed.  For 
> example, imagine using Unicode-aware (but not CIF-specific) tools to 
> copy perfectly good text from an existing manuscript into a CIF text 
> block, and having the result turn out to be invalid because it contains 
> U+FEFF (used as ZWNBSP).  This is a much more likely scenario than 
> U+FEFF introduced by concatenation of CIFs, and it is inconsistent with 
> this group's continued interest in keeping CIF compatible with 
> general-purpose tools.
>
> There are plenty of non-character code points in Unicode that could be 
> used in imgCIF or other protocols as escape characters to introduce 
> meta-functionality such as an encoding switch.  These include U+FDD0 - 
> U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... 
> U+10FFFE, and U+10FFFF.  (NOTE: other than U+FFFE and U+FFFF, none of 
> these are explicitly forbidden in the current CIF2 draft, but they can 
> and should be.  They are non-characters.)  For example, U+FDD0 (encoded 
> appropriately) could be inserted to signal the end of a sub-stream, 
> after which an imgCIF parser could perform encoding detection and any 
> other start-of-stream behavior it deems appropriate on whatever follows. 
> Alternatively, a defined Unicode character that is excluded from CIF, 
> such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII 
> control character could serve this purpose.
>
> See also comments in-line below.
>
> On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote:
>> Sorry, you are mistaken.  What the code chart says is:
>>
>> Special
>> FEFF    ZERO WIDTH NO-BREAK SPACE
>>         = BYTE ORDER MARK (BOM), ZWNBSP
>>          may be used to detect byto order by contrast
>>          with the noncharcater code point FFFE
>>           use as an indication of non-breaking is
>>          deprecated; see 2060 instead
>>          -> 200B zero width space
>>          -> 2060 word joiner
>>          -> FFFE <not a charcater>
>>
>> So, under the latest version of unicode, the use you are describing in
>> deprecated. The unicode consortium has the character back to what it
>> originally was -- the BOM, which is not a character, and I intend to
>> process it that way, not in the very odd way that some people followed
>> for a few recent Unicode versions that made no sense and has now been
>> deprecated.
>
> As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character.  U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP.  Applications and protocols are, however, *allowed* to reject such documents without violating Unicode.  That they are allowed to do so is not an argument that they must do so.
>
>> In theory there could be old unicode UTF-8 files somehow with stray
>> FEFF characters in them as code points, but inasmuch as CIF2 is new, we
>> are all spared the puzzlement of dealing with this non-problem of
>> dealing with a noncharacter which became a strange character and is
>> now again a noncharacter.
>
> U+FEFF is still a character, and always will be.  I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF.  The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources.
>
>> In addition, if you read the bizarre discussions on FEFF when people
>> were trying to use if as a code point instead of just stopping it at
>> the text processing level, you will see that the only thing they
>> could do with it was throw it away (that is what a zero width no-break
>> space means)
>
> Do you have an example?  Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away.  I have in fact used U+2060 myself.  Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing.
>
>> The _only_ fully compliant use for FEFF in the current standard is as
>> a BOM, not as a valid code point, so it is not really an issue for CIF2
>> any more than FFFF or FFFE are, none of which should be delivered as
>> code points in text processing.
>
> I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion.  CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance.  The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]