[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

So, can I count your vote on code point 0xFEFF as 'always a syntax
error in the code point stream' (option 2(a))?

On Sat, Jun 19, 2010 at 1:10 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Ah, but I remember the days when FEFF was also a non-character and I am
> happy to see it deprecated and headed back to that status.  In a UCS-2
> byte stream it is a disaster as a code point, causing confusion with
> a UTF-16 big-endian BOM.  It was a mistake ever to make it anything
> other than a BOM, and we would be making a mistake in perpetuating
> a deprecated use.
> I supposed COMCIFS could decide to change the specification of imgCIF
> and diverge from the already-established use of UCS-2/UTF-16 in the
> bin-utf encoding of images, but why do that?
> Nothing is gained by COMCIFS diverging from Unicode and XML practice
> on handling BOMs.
> In any case, the real question is whether CIF2 will be itself a binary
> format, or whether CIF2 will be a text format.  I think we would serve
> the community best with a text format for at least the next ten years.
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>                  +1-631-244-3035
>                  yaya@dowling.edu
> =====================================================
> On Fri, 18 Jun 2010, Bollinger, John C wrote:
>> We have already acknowledged several times that use of U+FEFF as a
>> ZWNBSP is deprecated by Unicode, but that does not in any way make it an
>> invalid character.  Its definition in the code chart (as something other
>> than "not a character") in fact shows that it *is* a valid character.
>> Unicode explicitly specifies that certain code points have not been and
>> never will be assigned to characters; they are called "not a character"
>> or "non-character" in Unicode-speak.  Those are to be distinguished from
>> "unassigned code points" and also from deprecated characters and uses.
>> U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP).
>> Unicode in fact adopts an approach similar to COMCIF's, and for similar
>> reasons: once a character is defined, it has the defined meaning
>> forever. U+FEFF's significance as ZWNBSP therefore will never be removed
>> from Unicode.
>> I daresay most of us have code in production that relies on deprecated
>> features of various programming languages, and some of us may even from
>> time to time write new code relying on such features.  These programs
>> are not for that reason non-conformant.  Similarly, let's not
>> overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP.  Unicode
>> does advise against that use in new documents, but it does not forbid
>> it.  We can choose.
>> Inasmuch as all CIF2 documents will be new documents, CIF2 could
>> incorporate Unicode's recommendation as a requirement.  That would allow
>> U+FEFF to be reserved for use as an encoding switch or other protocol
>> metacharacter, but at the cost of creating a new way for otherwise
>> perfectly acceptable, innocently created CIFs to be ill-formed.  For
>> example, imagine using Unicode-aware (but not CIF-specific) tools to
>> copy perfectly good text from an existing manuscript into a CIF text
>> block, and having the result turn out to be invalid because it contains
>> U+FEFF (used as ZWNBSP).  This is a much more likely scenario than
>> U+FEFF introduced by concatenation of CIFs, and it is inconsistent with
>> this group's continued interest in keeping CIF compatible with
>> general-purpose tools.
>> There are plenty of non-character code points in Unicode that could be
>> used in imgCIF or other protocols as escape characters to introduce
>> meta-functionality such as an encoding switch.  These include U+FDD0 -
>> U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ...
>> U+10FFFE, and U+10FFFF.  (NOTE: other than U+FFFE and U+FFFF, none of
>> these are explicitly forbidden in the current CIF2 draft, but they can
>> and should be.  They are non-characters.)  For example, U+FDD0 (encoded
>> appropriately) could be inserted to signal the end of a sub-stream,
>> after which an imgCIF parser could perform encoding detection and any
>> other start-of-stream behavior it deems appropriate on whatever follows.
>> Alternatively, a defined Unicode character that is excluded from CIF,
>> such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII
>> control character could serve this purpose.
>> See also comments in-line below.
>> On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote:
>>> Sorry, you are mistaken.  What the code chart says is:
>>> Special
>>>          may be used to detect byto order by contrast
>>>          with the noncharcater code point FFFE
>>>           use as an indication of non-breaking is
>>>          deprecated; see 2060 instead
>>>          -> 200B zero width space
>>>          -> 2060 word joiner
>>>          -> FFFE <not a charcater>
>>> So, under the latest version of unicode, the use you are describing in
>>> deprecated. The unicode consortium has the character back to what it
>>> originally was -- the BOM, which is not a character, and I intend to
>>> process it that way, not in the very odd way that some people followed
>>> for a few recent Unicode versions that made no sense and has now been
>>> deprecated.
>> As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character.  U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP.  Applications and protocols are, however, *allowed* to reject such documents without violating Unicode.  That they are allowed to do so is not an argument that they must do so.
>>> In theory there could be old unicode UTF-8 files somehow with stray
>>> FEFF characters in them as code points, but inasmuch as CIF2 is new, we
>>> are all spared the puzzlement of dealing with this non-problem of
>>> dealing with a noncharacter which became a strange character and is
>>> now again a noncharacter.
>> U+FEFF is still a character, and always will be.  I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF.  The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources.
>>> In addition, if you read the bizarre discussions on FEFF when people
>>> were trying to use if as a code point instead of just stopping it at
>>> the text processing level, you will see that the only thing they
>>> could do with it was throw it away (that is what a zero width no-break
>>> space means)
>> Do you have an example?  Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away.  I have in fact used U+2060 myself.  Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing.
>>> The _only_ fully compliant use for FEFF in the current standard is as
>>> a BOM, not as a valid code point, so it is not really an issue for CIF2
>>> any more than FFFF or FFFE are, none of which should be delivered as
>>> code points in text processing.
>> I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion.  CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance.  The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications.
>> Regards,
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]