On Tuesday, June 22, 2010 11:06 AM, SIMON WESTRIP wrote:
>CIF may currently be handled with multiple encodings, but as its restricted to ASCII, the
>encoding issue hasn't really been relevent - most code pages include the ASCII code points?

It is a common feature of many encodings to be congruent with 7-bit ASCII over its range, but that is not universal.  UTF-16 and UTF-32, for example, are not congruent with ASCII anywhere.  Neither is EBCDIC.  Shift-JIS is mostly congruent with ASCII, but varies at two code points.

>If CIF2 is also to allow multiple encodings, it is quite possible that a basic text editor will not render the content
>appropriately for anything outside the ASCII range if it is unable to determine the encoding (it may not even attempt to
>determine the encoding - my linux text editors aren't very good at autodetection - I don't know about windows notepad,
>but last time I looked it couldn't even interpret linux line endings appropriately).

Indeed.  I believe some basic text editors will assume that any file presented to them uses the host's default encoding.  In many cases that is not UTF-8, so selecting UTF-8 as the only CIF encoding does not promote CIF interoperability with those particular programs.

>In the absence of a BOM, the only solution is to use an heuristic approach to determine the encoding?

Not necessarily.  If the data are delivered via web form or other HTTP-based method, for example, then the HTTP protocol provides support for specifying the encoding.  Similarly, if the file is delivered as part of a MIME multipart message, then the content type specified by its MIME headers can express the encoding.

>Such heuristics would also have to be applied in order to process the CIF (which I'd already decided I will have to do
>because of the likelihood of receiving non-UTF8 CIF2's)

Were I in your shoes, I would plan to transcode non-UTF-8 CIFs to UTF-8 upon receipt, as part of the verification process.  I would store only the UTF-8 version; thereafter, no worries.  One of the advantages of defining CIF2 as an encoding-independent text format would be that doing as I describe would preserve the original *CIF* data (i.e. the text) with 100% fidelity, even though it might not preserve the exact byte stream.

>So I still beleive that as a *standard* we should specify UTF8.
>However, that does not mean that we cannot be tolerant of other encodings?
>If a system exists that processes all its CIFs in a  different encoding, I see no reason for it to change -
>only when the CIF is to be made publically available should it be converted to UTF-8.
>Likewise, if such a system is capable of handling current CIFs, surely it will manage UTF-8 CIFs with
>little overhead? Afterall, CIF2 is going to be different from CIF1.

This nicely captures my point about the CIF data format vs. CIF storage and interchange.  UTF-8 can very easily be a standard for CIF interchange -- perhaps the only standard -- without conflating that with the CIF data format.


