Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Dear colleagues,

I did some experiments with iconv, a robust text encoding conversion
tool. It will silently skip over an embedded BOM for any UTF encoding.
So, this seems to be a reasonable behavior even though it is non-standard.

>From internet searches, there does not seem to be a consensus on whether
UTF-8 should include a BOM. I suspect that the dislike of UTF-8 BOM
comes from people who only work on systems where UTF-8 is the default.

CIF1 does not allow "extended" ASCII encoding for characters 128-255, so
there is no potential for misinterpreting valid CIF files. That is the
main purpose of a UTF-8 BOM.

So arguments for or against BOMs in CIF files both seem reasonable. CIF
parsers should tolerate leading and embedded UTF-8 BOMs, as well as
files with no BOM. For writing CIF2, it is probably best to make BOMs
optional, because the parser needs to tolerate both variations on input.

Krahn, Joe (NIH/NIEHS) [C] wrote:
> In general, CIF does not directly deal with encoding. It should be
> possible to allow a low-level I/O library to deal with all encoding
> issues. Therefore, supporting non-standard stream interpretation should
> be avoided.
> 
> It is a good practical idea to allow mid-stream BOMs by interpreting
> them as character 0xFEFF, and allow it as whitespace, with a warning. It
> should not be a required feature, because it is non-standard, and only
> exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O
> libraries will interpret them as invalid. Ideally, text concatenation
> software will become BOM-aware.
> 
> Interpret mid-stream BOMs and allowing mixed encodings is a major hack,
> and impractical for systems that deal with encoding at the I/O level. It
> is reasonable to allow it as an non-standard extension, but should
> always give a warning so that people realize that such files are likely
> to be broken elsewhere.
> 
> In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the
> beginning of a file. Anything else should be considered a non-standard
> extension. For practical reasons, CIF2 parsers should be encouraged but
> not required to allow mid-stream UTF-8 0xFEFF as whitespace.
> 
> Joe
> 
> Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    While it is certainly prudent to tell people to either write a pure
>> UTF-8 file with no BOM or to prefix it with a BOM, and that is
>> home a compliant CIF writer should work, it is not practical to
>> insist the CIF readers should reject embedded BOMs.  Indeed, the
>> URL cited by John does not tell you they are illegal, but that
>> you should treat them as a zero width non-breaking space.
>>
>>    The reason we cannot insist on readers demanding that BOMs occur
>> at the beginning is that users may concatenate whole CIF or
>> build one CIF out of fragments of text, and this will very likely
>> result in embedded BOMs and possibly switches in encodings.  If
>> we fail to handle the BOMs were are much more likely to garble
>> such files.  I strongly recommend the approach in my prior
>> message -- recognize BOMs are all times.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 11 May 2010, Bollinger, John C wrote:
>>
>>> Dear Colleagues,
>>>
>>> I think CIF processor behavior such as Herb describes would be
>>> outstanding, and I commend Herb for his dedication to providing such
>>> capable and robust software.  I do disagree about one of his specific
>>> points, however:
>>>
>>>> The
>>>> minimum to do with any BOM is:
>>> [...]
>>>
>>>>   1.  Accept it at any point in a character stream.
>>> It would be both unconventional and programmatically inconvenient to
>>> give special significance to U+FEFF anywhere other than at the very
>>> beginning of a file.  The Unicode consortium in fact addresses this exact
>>> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6.
>>> Although the Unicode's comments do allow for protocol-specific support
>>> for accepting U+FEFF as a BOM other than at the beginning of the stream,
>>> I see little advantage to adding such a complication to the CIF2
>>> specifications.
>>>
>>> This all expands the scope of the topic far beyond what I had intended,
>>> however.  I think it is perhaps useful to recognize at this point,
>>> therefore, that the CIF2 language specification and the behavior of CIF2
>>> processors are separate questions.  This group has already decided that
>>> files compliant with CIF 2.0 are encoded in UTF-8, period.  I do not want
>>> to reopen that debate.  On the other hand, that in no way prevents CIF
>>> processors from -- as an extension -- recognizing and handling putative
>>> CIFs that violate the spec by employing character encodings different
>>> from UTF-8.  That sort of thing is generally heralded as beneficial for
>>> ease of use, and it is consistent with the good design principle of being
>>> relaxed about inputs but strict about outputs.  (And in that vein I would
>>> hope that any CIF 2.0 writer's normal behavior would be to encode in
>>> UTF-8.)
>>>
>>> My suggestion is slightly different, as I hope this restatement will
>>> show: *in light of the fact that spec-compliant CIF2 files are encoded in
>>> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM
>>> to be spec-compliant (subject to the compliance of the rest of the
>>> contents).  Like Herb, I intend that my parsers will accept such CIFs
>>> whether they strictly comply with the spec or not, but the question is
>>> whether accepting such files should be a compliance requirement or an
>>> extension.  Either way, I think it will be valuable to document this
>>> decision in the spec, if only to draw attention to the issue.
>>>
>>>
>>> Best Regards,
>>>
>>> John
>>> --
>>> John C. Bollinger, Ph.D.
>>> Department of Structural Biology


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.