Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Colleagues,

I think CIF processor behavior such as Herb describes would be
outstanding, and I commend Herb for his dedication to providing such
capable and robust software.  I do disagree about one of his specific
points, however:

> The
> minimum to do with any BOM is:

[...]

>   1.  Accept it at any point in a character stream.

It would be both unconventional and programmatically inconvenient to
give special significance to U+FEFF anywhere other than at the very
beginning of a file.  The Unicode consortium in fact addresses this exact
question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6.
Although the Unicode's comments do allow for protocol-specific support
for accepting U+FEFF as a BOM other than at the beginning of the stream,
I see little advantage to adding such a complication to the CIF2
specifications.

This all expands the scope of the topic far beyond what I had intended,
however.  I think it is perhaps useful to recognize at this point,
therefore, that the CIF2 language specification and the behavior of CIF2
processors are separate questions.  This group has already decided that
files compliant with CIF 2.0 are encoded in UTF-8, period.  I do not want
to reopen that debate.  On the other hand, that in no way prevents CIF
processors from -- as an extension -- recognizing and handling putative
CIFs that violate the spec by employing character encodings different
from UTF-8.  That sort of thing is generally heralded as beneficial for
ease of use, and it is consistent with the good design principle of being
relaxed about inputs but strict about outputs.  (And in that vein I would
hope that any CIF 2.0 writer's normal behavior would be to encode in
UTF-8.)

My suggestion is slightly different, as I hope this restatement will
show: *in light of the fact that spec-compliant CIF2 files are encoded in
UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM
to be spec-compliant (subject to the compliance of the rest of the
contents).  Like Herb, I intend that my parsers will accept such CIFs
whether they strictly comply with the spec or not, but the question is
whether accepting such files should be a compliance requirement or an
extension.  Either way, I think it will be valuable to document this
decision in the spec, if only to draw attention to the issue.


Best Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.