[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] UTF-8 BOM
From: James Hester <[email protected]>
Date: Mon, 21 Jun 2010 16:00:24 +1000
In-Reply-To: <[email protected]>
References: <[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229514@SJMEMXMBS11.stjude.sjcrh.local><[email protected]>

So, can I count your vote on code point 0xFEFF as 'always a syntax
error in the code point stream' (option 2(a))?

On Sat, Jun 19, 2010 at 1:10 AM, Herbert J. Bernstein
<[email protected]> wrote:
> Ah, but I remember the days when FEFF was also a non-character and I am
> happy to see it deprecated and headed back to that status. �In a UCS-2
> byte stream it is a disaster as a code point, causing confusion with
> a UTF-16 big-endian BOM. �It was a mistake ever to make it anything
> other than a BOM, and we would be making a mistake in perpetuating
> a deprecated use.
>
> I supposed COMCIFS could decide to change the specification of imgCIF
> and diverge from the already-established use of UCS-2/UTF-16 in the
> bin-utf encoding of images, but why do that?
>
> Nothing is gained by COMCIFS diverging from Unicode and XML practice
> on handling BOMs.
>
> In any case, the real question is whether CIF2 will be itself a binary
> format, or whether CIF2 will be a text format. �I think we would serve
> the community best with a text format for at least the next ten years.
>
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � �Dowling College, Kramer Science Center, KSC 121
> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � �+1-631-244-3035
> � � � � � � � � �[email protected]
> =====================================================
>
> On Fri, 18 Jun 2010, Bollinger, John C wrote:
>
>> We have already acknowledged several times that use of U+FEFF as a
>> ZWNBSP is deprecated by Unicode, but that does not in any way make it an
>> invalid character. �Its definition in the code chart (as something other
>> than "not a character") in fact shows that it *is* a valid character.
>>
>> Unicode explicitly specifies that certain code points have not been and
>> never will be assigned to characters; they are called "not a character"
>> or "non-character" in Unicode-speak. �Those are to be distinguished from
>> "unassigned code points" and also from deprecated characters and uses.
>> U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP).
>> Unicode in fact adopts an approach similar to COMCIF's, and for similar
>> reasons: once a character is defined, it has the defined meaning
>> forever. U+FEFF's significance as ZWNBSP therefore will never be removed
>> from Unicode.
>>
>> I daresay most of us have code in production that relies on deprecated
>> features of various programming languages, and some of us may even from
>> time to time write new code relying on such features. �These programs
>> are not for that reason non-conformant. �Similarly, let's not
>> overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP. �Unicode
>> does advise against that use in new documents, but it does not forbid
>> it. �We can choose.
>>
>> Inasmuch as all CIF2 documents will be new documents, CIF2 could
>> incorporate Unicode's recommendation as a requirement. �That would allow
>> U+FEFF to be reserved for use as an encoding switch or other protocol
>> metacharacter, but at the cost of creating a new way for otherwise
>> perfectly acceptable, innocently created CIFs to be ill-formed. �For
>> example, imagine using Unicode-aware (but not CIF-specific) tools to
>> copy perfectly good text from an existing manuscript into a CIF text
>> block, and having the result turn out to be invalid because it contains
>> U+FEFF (used as ZWNBSP). �This is a much more likely scenario than
>> U+FEFF introduced by concatenation of CIFs, and it is inconsistent with
>> this group's continued interest in keeping CIF compatible with
>> general-purpose tools.
>>
>> There are plenty of non-character code points in Unicode that could be
>> used in imgCIF or other protocols as escape characters to introduce
>> meta-functionality such as an encoding switch. �These include U+FDD0 -
>> U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ...
>> U+10FFFE, and U+10FFFF. �(NOTE: other than U+FFFE and U+FFFF, none of
>> these are explicitly forbidden in the current CIF2 draft, but they can
>> and should be. �They are non-characters.) �For example, U+FDD0 (encoded
>> appropriately) could be inserted to signal the end of a sub-stream,
>> after which an imgCIF parser could perform encoding detection and any
>> other start-of-stream behavior it deems appropriate on whatever follows.
>> Alternatively, a defined Unicode character that is excluded from CIF,
>> such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII
>> control character could serve this purpose.
>>
>> See also comments in-line below.
>>
>> On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote:
>>> Sorry, you are mistaken. �What the code chart says is:
>>>
>>> Special
>>> FEFF � �ZERO WIDTH NO-BREAK SPACE
>>> � � � � = BYTE ORDER MARK (BOM), ZWNBSP
>>> � � � � �may be used to detect byto order by contrast
>>> � � � � �with the noncharcater code point FFFE
>>> � � � � � use as an indication of non-breaking is
>>> � � � � �deprecated; see 2060 instead
>>> � � � � �-> 200B zero width space
>>> � � � � �-> 2060 word joiner
>>> � � � � �-> FFFE <not a charcater>
>>>
>>> So, under the latest version of unicode, the use you are describing in
>>> deprecated. The unicode consortium has the character back to what it
>>> originally was -- the BOM, which is not a character, and I intend to
>>> process it that way, not in the very odd way that some people followed
>>> for a few recent Unicode versions that made no sense and has now been
>>> deprecated.
>>
>> As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character. �U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP. �Applications and protocols are, however, *allowed* to reject such documents without violating Unicode. �That they are allowed to do so is not an argument that they must do so.
>>
>>> In theory there could be old unicode UTF-8 files somehow with stray
>>> FEFF characters in them as code points, but inasmuch as CIF2 is new, we
>>> are all spared the puzzlement of dealing with this non-problem of
>>> dealing with a noncharacter which became a strange character and is
>>> now again a noncharacter.
>>
>> U+FEFF is still a character, and always will be. �I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF. �The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources.
>>
>>> In addition, if you read the bizarre discussions on FEFF when people
>>> were trying to use if as a code point instead of just stopping it at
>>> the text processing level, you will see that the only thing they
>>> could do with it was throw it away (that is what a zero width no-break
>>> space means)
>>
>> Do you have an example? �Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away. �I have in fact used U+2060 myself. �Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing.
>>
>>> The _only_ fully compliant use for FEFF in the current standard is as
>>> a BOM, not as a valid code point, so it is not really an issue for CIF2
>>> any more than FFFF or FFFE are, none of which should be delivered as
>>> code points in text processing.
>>
>> I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion. �CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance. �The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications.
>>
>>
>> Regards,
>>
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>>
>>
>>
>>
>> Email Disclaimer: �www.stjude.org/emaildisclaimer
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Brian McMahon)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] Vote on BOM

Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] UTF-8 BOM