[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM

Dear colleagues,

I did some experiments with iconv, a robust text encoding conversion
tool. It will silently skip over an embedded BOM for any UTF encoding.
So, this seems to be a reasonable behavior even though it is non-standard.

>From internet searches, there does not seem to be a consensus on whether
UTF-8 should include a BOM. I suspect that the dislike of UTF-8 BOM
comes from people who only work on systems where UTF-8 is the default.

CIF1 does not allow "extended" ASCII encoding for characters 128-255, so
there is no potential for misinterpreting valid CIF files. That is the
main purpose of a UTF-8 BOM.

So arguments for or against BOMs in CIF files both seem reasonable. CIF
parsers should tolerate leading and embedded UTF-8 BOMs, as well as
files with no BOM. For writing CIF2, it is probably best to make BOMs
optional, because the parser needs to tolerate both variations on input.

Krahn, Joe (NIH/NIEHS) [C] wrote:
> In general, CIF does not directly deal with encoding. It should be
> possible to allow a low-level I/O library to deal with all encoding
> issues. Therefore, supporting non-standard stream interpretation should
> be avoided.
> 
> It is a good practical idea to allow mid-stream BOMs by interpreting
> them as character 0xFEFF, and allow it as whitespace, with a warning. It
> should not be a required feature, because it is non-standard, and only
> exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O
> libraries will interpret them as invalid. Ideally, text concatenation
> software will become BOM-aware.
> 
> Interpret mid-stream BOMs and allowing mixed encodings is a major hack,
> and impractical for systems that deal with encoding at the I/O level. It
> is reasonable to allow it as an non-standard extension, but should
> always give a warning so that people realize that such files are likely
> to be broken elsewhere.
> 
> In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the
> beginning of a file. Anything else should be considered a non-standard
> extension. For practical reasons, CIF2 parsers should be encouraged but
> not required to allow mid-stream UTF-8 0xFEFF as whitespace.
> 
> Joe
> 
> Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    While it is certainly prudent to tell people to either write a pure
>> UTF-8 file with no BOM or to prefix it with a BOM, and that is
>> home a compliant CIF writer should work, it is not practical to
>> insist the CIF readers should reject embedded BOMs.  Indeed, the
>> URL cited by John does not tell you they are illegal, but that
>> you should treat them as a zero width non-breaking space.
>>
>>    The reason we cannot insist on readers demanding that BOMs occur
>> at the beginning is that users may concatenate whole CIF or
>> build one CIF out of fragments of text, and this will very likely
>> result in embedded BOMs and possibly switches in encodings.  If
>> we fail to handle the BOMs were are much more likely to garble
>> such files.  I strongly recommend the approach in my prior
>> message -- recognize BOMs are all times.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 11 May 2010, Bollinger, John C wrote:
>>
>>> Dear Colleagues,
>>>
>>> I think CIF processor behavior such as Herb describes would be
>>> outstanding, and I commend Herb for his dedication to providing such
>>> capable and robust software.  I do disagree about one of his specific
>>> points, however:
>>>
>>>> The
>>>> minimum to do with any BOM is:
>>> [...]
>>>
>>>>   1.  Accept it at any point in a character stream.
>>> It would be both unconventional and programmatically inconvenient to
>>> give special significance to U+FEFF anywhere other than at the very
>>> beginning of a file.  The Unicode consortium in fact addresses this exact
>>> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6.
>>> Although the Unicode's comments do allow for protocol-specific support
>>> for accepting U+FEFF as a BOM other than at the beginning of the stream,
>>> I see little advantage to adding such a complication to the CIF2
>>> specifications.
>>>
>>> This all expands the scope of the topic far beyond what I had intended,
>>> however.  I think it is perhaps useful to recognize at this point,
>>> therefore, that the CIF2 language specification and the behavior of CIF2
>>> processors are separate questions.  This group has already decided that
>>> files compliant with CIF 2.0 are encoded in UTF-8, period.  I do not want
>>> to reopen that debate.  On the other hand, that in no way prevents CIF
>>> processors from -- as an extension -- recognizing and handling putative
>>> CIFs that violate the spec by employing character encodings different
>>> from UTF-8.  That sort of thing is generally heralded as beneficial for
>>> ease of use, and it is consistent with the good design principle of being
>>> relaxed about inputs but strict about outputs.  (And in that vein I would
>>> hope that any CIF 2.0 writer's normal behavior would be to encode in
>>> UTF-8.)
>>>
>>> My suggestion is slightly different, as I hope this restatement will
>>> show: *in light of the fact that spec-compliant CIF2 files are encoded in
>>> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM
>>> to be spec-compliant (subject to the compliance of the rest of the
>>> contents).  Like Herb, I intend that my parsers will accept such CIFs
>>> whether they strictly comply with the spec or not, but the question is
>>> whether accepting such files should be a compliance requirement or an
>>> extension.  Either way, I think it will be valuable to document this
>>> decision in the spec, if only to draw attention to the issue.
>>>
>>>
>>> Best Regards,
>>>
>>> John
>>> --
>>> John C. Bollinger, Ph.D.
>>> Department of Structural Biology


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]