[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

In general, CIF does not directly deal with encoding. It should be
possible to allow a low-level I/O library to deal with all encoding
issues. Therefore, supporting non-standard stream interpretation should
be avoided.

It is a good practical idea to allow mid-stream BOMs by interpreting
them as character 0xFEFF, and allow it as whitespace, with a warning. It
should not be a required feature, because it is non-standard, and only
exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O
libraries will interpret them as invalid. Ideally, text concatenation
software will become BOM-aware.

Interpret mid-stream BOMs and allowing mixed encodings is a major hack,
and impractical for systems that deal with encoding at the I/O level. It
is reasonable to allow it as an non-standard extension, but should
always give a warning so that people realize that such files are likely
to be broken elsewhere.

In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the
beginning of a file. Anything else should be considered a non-standard
extension. For practical reasons, CIF2 parsers should be encouraged but
not required to allow mid-stream UTF-8 0xFEFF as whitespace.


Herbert J. Bernstein wrote:
> Dear Colleagues,
>    While it is certainly prudent to tell people to either write a pure
> UTF-8 file with no BOM or to prefix it with a BOM, and that is
> home a compliant CIF writer should work, it is not practical to
> insist the CIF readers should reject embedded BOMs.  Indeed, the
> URL cited by John does not tell you they are illegal, but that
> you should treat them as a zero width non-breaking space.
>    The reason we cannot insist on readers demanding that BOMs occur
> at the beginning is that users may concatenate whole CIF or
> build one CIF out of fragments of text, and this will very likely
> result in embedded BOMs and possibly switches in encodings.  If
> we fail to handle the BOMs were are much more likely to garble
> such files.  I strongly recommend the approach in my prior
> message -- recognize BOMs are all times.
>    Regards,
>      Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> On Tue, 11 May 2010, Bollinger, John C wrote:
>> Dear Colleagues,
>> I think CIF processor behavior such as Herb describes would be
>> outstanding, and I commend Herb for his dedication to providing such
>> capable and robust software.  I do disagree about one of his specific
>> points, however:
>>> The
>>> minimum to do with any BOM is:
>> [...]
>>>   1.  Accept it at any point in a character stream.
>> It would be both unconventional and programmatically inconvenient to
>> give special significance to U+FEFF anywhere other than at the very
>> beginning of a file.  The Unicode consortium in fact addresses this exact
>> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6.
>> Although the Unicode's comments do allow for protocol-specific support
>> for accepting U+FEFF as a BOM other than at the beginning of the stream,
>> I see little advantage to adding such a complication to the CIF2
>> specifications.
>> This all expands the scope of the topic far beyond what I had intended,
>> however.  I think it is perhaps useful to recognize at this point,
>> therefore, that the CIF2 language specification and the behavior of CIF2
>> processors are separate questions.  This group has already decided that
>> files compliant with CIF 2.0 are encoded in UTF-8, period.  I do not want
>> to reopen that debate.  On the other hand, that in no way prevents CIF
>> processors from -- as an extension -- recognizing and handling putative
>> CIFs that violate the spec by employing character encodings different
>> from UTF-8.  That sort of thing is generally heralded as beneficial for
>> ease of use, and it is consistent with the good design principle of being
>> relaxed about inputs but strict about outputs.  (And in that vein I would
>> hope that any CIF 2.0 writer's normal behavior would be to encode in
>> UTF-8.)
>> My suggestion is slightly different, as I hope this restatement will
>> show: *in light of the fact that spec-compliant CIF2 files are encoded in
>> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM
>> to be spec-compliant (subject to the compliance of the rest of the
>> contents).  Like Herb, I intend that my parsers will accept such CIFs
>> whether they strictly comply with the spec or not, but the question is
>> whether accepting such files should be a compliance requirement or an
>> extension.  Either way, I think it will be valuable to document this
>> decision in the spec, if only to draw attention to the issue.
>> Best Regards,
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

ddlm-group mailing list

Reply to: [list | sender only]