[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

To: "'Group finalising DDLm and associated dictionaries'" <[email protected]>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
From: "Bollinger, John C" <[email protected]>
Date: Tue, 22 Jun 2010 12:25:10 -0500
Accept-Language: en-US
acceptlanguage: en-US
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]! om><[email protected]> <[email protected]><[email protected]>

On Tuesday, June 22, 2010 11:06 AM, SIMON WESTRIP wrote:
>CIF may currently be handled with multiple encodings, but as its restricted to ASCII, the
>encoding issue hasn't really been relevent - most code pages include the ASCII code points?

It is a common feature of many encodings to be congruent with 7-bit ASCII over its range, but that is not universal.  UTF-16 and UTF-32, for example, are not congruent with ASCII anywhere.  Neither is EBCDIC.  Shift-JIS is mostly congruent with ASCII, but varies at two code points.

>If CIF2 is also to allow multiple encodings, it is quite possible that a basic text editor will not render the content
>appropriately for anything outside the ASCII range if it is unable to determine the encoding (it may not even attempt to
>determine the encoding - my linux text editors aren't very good at autodetection - I don't know about windows notepad,
>but last time I looked it couldn't even interpret linux line endings appropriately).

Indeed.  I believe some basic text editors will assume that any file presented to them uses the host's default encoding.  In many cases that is not UTF-8, so selecting UTF-8 as the only CIF encoding does not promote CIF interoperability with those particular programs.

>In the absence of a BOM, the only solution is to use an heuristic approach to determine the encoding?

Not necessarily.  If the data are delivered via web form or other HTTP-based method, for example, then the HTTP protocol provides support for specifying the encoding.  Similarly, if the file is delivered as part of a MIME multipart message, then the content type specified by its MIME headers can express the encoding.

>Such heuristics would also have to be applied in order to process the CIF (which I'd already decided I will have to do
>because of the likelihood of receiving non-UTF8 CIF2's)

Were I in your shoes, I would plan to transcode non-UTF-8 CIFs to UTF-8 upon receipt, as part of the verification process.  I would store only the UTF-8 version; thereafter, no worries.  One of the advantages of defining CIF2 as an encoding-independent text format would be that doing as I describe would preserve the original *CIF* data (i.e. the text) with 100% fidelity, even though it might not preserve the exact byte stream.

>So I still beleive that as a *standard* we should specify UTF8.
>
>However, that does not mean that we cannot be tolerant of other encodings?
>If a system exists that processes all its CIFs in a  different encoding, I see no reason for it to change -
>only when the CIF is to be made publically available should it be converted to UTF-8.
>Likewise, if such a system is capable of handling current CIFs, surely it will manage UTF-8 CIFs with
>little overhead? Afterall, CIF2 is going to be different from CIF1.

This nicely captures my point about the CIF data format vs. CIF storage and interchange.  UTF-8 can very easily be a standard for CIF interchange -- perhaps the only standard -- without conflating that with the CIF data format.

Cheers,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

[ddlm-group] options/text vs binary/end-of-line (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (David Brown)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .