Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Allow me to clarify my position, so there is no misunderstanding:

I believe that we will be dealing with a world with at least UTF-8
and UCS-2/UTF-16 encodings for many years to come.  I have no
objection to CIF2 being specified solely in terms of UTF-8 for
simplicity and consistency, but if we are to write software that
people can use, we must have a reasonable position with respect
to the encodings people use, and that means that, at the very
least, we need to accept and process UTF-8 BOMs as harmless
additional text.  Some of us will also be supporting UCS-2/UTF-16
directly in our applications.  I don't mind if other applications
are only going to support UTF-8, but inasmuch as, as long as
we have java and web browsers, we are going to encounter UCS-2/UTF-16,
we should do something sensible when a UCS-2/UTF-16 BOM pops up,
either doing the internal translation if we so choose, or, if that
is not handled by a particular application, issuing a polite warning
suggesting the used of an external translator if the application does
not wish to handle UCS-2/UTF-16.

BOMS will almost always appear in modern UCS-2/UTF-16 files, and when
they are converted to UTF-8 that will give us yet another source of
UTF-8 BOMs.  I believe the sensible thing to so it to recognize BOMs.

Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 18 May 2010, Bollinger, John C wrote:

> Herbert Bernstein wrote:
>> Let me see if I understand this correctly -- a user takes 2 perfectly good
>> CIF2 files, edits each to clean up, say, some comments to keep straight where
>> one begins and one ends, using a well-designed modern text editor that
>> happens to put a BOM at the start of each file, concatenates the two files
>> with cat to ship them into the IUCr, and suddenly they have a syntax error
>> caused by a character that they cannot see!!!
>>
>> To me this seems pointless when it is trivial for software to recognize the
>> character and handle it sensibly.
>
> And that is my principal rationale for preferring that embedded U+FEFF be recognized as CIF whitespace.  With that approach, the concatenation of two well-formed CIF2 files is always a well-formed CIF2 file, regardless of the presence or absence of BOMs in the original files.  Note, too, that such concatenation cannot produce a mixed-encoding file because files encoded in UTF-16[BE|LE], UTF-32[BE|LE], or any other encoding that can be distinguished from UTF-8 are not well-formed CIF2 files to start.  The file concatenation scenario thus does not provide a use case for the CIF2 *specification* to recognize embedded U+FEFF as an encoding marker.
>
> On the other hand, I again feel compelled to distinguish program behaviors from the CIF2 format specification.  None of the above would prevent a CIF processor from recognizing and handling CIF-like character streams encoded via schemes other than UTF-8, nor from recognizing embedded U+FEFF code sequences in various encodings as encoding switches, thereby handling mixed-encoding files.  Indeed, such a program or library would be invaluable for correcting encoding-related errors.  That does not, however, mean that such files must be considered well-formed CIF2, no matter how likely they may (or may not) be to arise.
>
>
> James Hester wrote:
>> I would be happy to call an embedded BOM a syntax error.
>
> In light of the possibility of U+FEFF appearing in a data value (for example, from cutting text from a Unicode manuscript and pasting it into a CIF), I need to refine my earlier blanket alternative of treating embedded U+FEFF as a syntax error.  I now think it would be ok to treat U+FEFF as a syntax error *provided* that it appears outside a delimited string.  That's still not my preference, though, and I feel confident that Herb will still disagree.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Computing and X-Ray Scientist
> Department of Structural Biology
> St. Jude Children's Research Hospital
> John.Bollinger@StJude.org
> (901) 595-3166 [office]
> www.stjude.org
>
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.