Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

We have already acknowledged several times that use of U+FEFF as a ZWNBSP is deprecated by Unicode, but that does not in any way make it an invalid character.  Its definition in the code chart (as something other than "not a character") in fact shows that it *is* a valid character.

Unicode explicitly specifies that certain code points have not been and never will be assigned to characters; they are called "not a character" or "non-character" in Unicode-speak.  Those are to be distinguished from "unassigned code points" and also from deprecated characters and uses.  U+FFFE is a non-character, but U+FEFF is merely deprecated (as ZWNBSP).  Unicode in fact adopts an approach similar to COMCIF's, and for similar reasons: once a character is defined, it has the defined meaning forever. U+FEFF's significance as ZWNBSP therefore will never be removed from Unicode.

I daresay most of us have code in production that relies on deprecated features of various programming languages, and some of us may even from time to time write new code relying on such features.  These programs are not for that reason non-conformant.  Similarly, let's not overinterpret Unicode's deprecation of U+FEFF's use as ZWNBSP.  Unicode does advise against that use in new documents, but it does not forbid it.  We can choose.

Inasmuch as all CIF2 documents will be new documents, CIF2 could incorporate Unicode's recommendation as a requirement.  That would allow U+FEFF to be reserved for use as an encoding switch or other protocol metacharacter, but at the cost of creating a new way for otherwise perfectly acceptable, innocently created CIFs to be ill-formed.  For example, imagine using Unicode-aware (but not CIF-specific) tools to copy perfectly good text from an existing manuscript into a CIF text block, and having the result turn out to be invalid because it contains U+FEFF (used as ZWNBSP).  This is a much more likely scenario than U+FEFF introduced by concatenation of CIFs, and it is inconsistent with this group's continued interest in keeping CIF compatible with general-purpose tools.

There are plenty of non-character code points in Unicode that could be used in imgCIF or other protocols as escape characters to introduce meta-functionality such as an encoding switch.  These include U+FDD0 - U+FDEF, U+FFFE (maybe), U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, ... U+10FFFE, and U+10FFFF.  (NOTE: other than U+FFFE and U+FFFF, none of these are explicitly forbidden in the current CIF2 draft, but they can and should be.  They are non-characters.)  For example, U+FDD0 (encoded appropriately) could be inserted to signal the end of a sub-stream, after which an imgCIF parser could perform encoding detection and any other start-of-stream behavior it deems appropriate on whatever follows.  Alternatively, a defined Unicode character that is excluded from CIF, such as U+0000 (null), U+0004 (end-of-transmission), or another ASCII control character could serve this purpose.

See also comments in-line below.

On Thursday, June 17, 2010 8:31 PM, Herbert J. Bernstein wrote:
>Sorry, you are mistaken.  What the code chart says is:
>
>Special
>FEFF    ZERO WIDTH NO-BREAK SPACE
>         = BYTE ORDER MARK (BOM), ZWNBSP
>          may be used to detect byto order by contrast
>          with the noncharcater code point FFFE
>           use as an indication of non-breaking is
>          deprecated; see 2060 instead
>          -> 200B zero width space
>          -> 2060 word joiner
>          -> FFFE <not a charcater>
>
>So, under the latest version of unicode, the use you are describing in
>deprecated. The unicode consortium has the character back to what it
>originally was -- the BOM, which is not a character, and I intend to
>process it that way, not in the very odd way that some people followed
>for a few recent Unicode versions that made no sense and has now been
>deprecated.

As I observed above, deprecating the use of U+FEFF as ZWNBSP is not at all the same thing as removing that meaning of the character.  U+FEFF in the body of a Unicode document is, and always will be, a ZWNBSP.  Applications and protocols are, however, *allowed* to reject such documents without violating Unicode.  That they are allowed to do so is not an argument that they must do so.

>In theory there could be old unicode UTF-8 files somehow with stray
>FEFF characters in them as code points, but inasmuch as CIF2 is new, we
>are all spared the puzzlement of dealing with this non-problem of
>dealing with a noncharacter which became a strange character and is
>now again a noncharacter.

U+FEFF is still a character, and always will be.  I presented above a reasonable scenario for how and why CIF2 might need to deal with embedded U+FEFF.  The problem for CIF is that it serves as a protocol for combining multiple types of information from multiple sources, and therefore it is exposed to the behavior, quirks, and foibles of those sources.

>In addition, if you read the bizarre discussions on FEFF when people
>were trying to use if as a code point instead of just stopping it at
>the text processing level, you will see that the only thing they
>could do with it was throw it away (that is what a zero width no-break
>space means)

Do you have an example?  Unicode has provided U+2060 as a replacement for U+FEFF's ZWNBSP function, so evidently they believe that there is use for it other than just throwing it away.  I have in fact used U+2060 myself.  Its designation "word joiner" is a better functional description than is ZWNBSP, but ultimately they mean the same thing.

>The _only_ fully compliant use for FEFF in the current standard is as
>a BOM, not as a valid code point, so it is not really an issue for CIF2
>any more than FFFF or FFFE are, none of which should be delivered as
>code points in text processing.

I'm sorry, but I don't see the Unicode standard or ancillary documents supporting that conclusion.  CIF2 may choose to disallow U+FEFF outside its function as a BOM, but that is *not* a requirement for Unicode compliance.  The deprecation is primarily a warning to *users*, not a conformance constraint on programs or specifications.


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital




Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.