[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .


I prefer leaving the issue of character encoding entirely out of the scope of the CIF format specification (effectively allowing any encoding).  On the other hand, I think it's a bit of an aggrandizement to characterize UTF-16 / Shift-JIS / etc. as "ways in which many of our colleagues get their science done."  In no way do I dispute that many of our colleagues indeed use these encodings routinely, but I am doubtful that editing Unicode text with a text editor constitutes a significant part of many of their research programs.  At least, few of my English-speaking colleagues edit flat Unicode text files with any frequency, if ever they do at all.

I think there is already good software, some of it free (both senses), for operating systems at least as old as Windows 9x, that supports editing UTF-8 encoded text.  Most of it also supports a multitude of other encodings.  We would leave no one out by requiring UTF-8, and I do not see that respect for our colleagues demands that CIF2 be equally convenient to create and edit with every text editor in current use.  If that is doubtful, however, and respect is our goal, then wouldn't the most respectful thing be to *ask* a few of the people about whom we are concerned?

My issue here is different, and at least partly philosophical.  The CIF format can and should be about the structure and meaning of CIF text content.  Character encoding is on a different level: it's a characteristic of storage and interchange.  Comingling these layers is inelegant and unnecessary.

Moreover, a CIF2 requirement to encode in UTF-8 will be small comfort when presented with a file that is not, in fact, encoded that way.  What can you then do?  Either reject the file or autodetect the encoding.  If CIF2 does not specify a particular encoding, and you receive the same file, then what can you do?  Exactly the same things, but then it's more likely that the file's provider will have also specified the encoding by some means.  (Particularly so if the CIF2 spec calls attention to the need to do so.)

Perhaps something like this would be an acceptable compromise:
a) Rewrite change 2 to remove the requirement for UTF-8
b) Add:
====
CHANGE 9 - NEW (CIF Interchange Format)

Many alternative encodings are available for recording and exchanging Unicode character data via byte-oriented media.  The CIF format itself is encoding independent, but that allows for uncertainty as to how to handle putative CIF data unaccompanied by encoding information.  We therefore define a simple, binary CIF Interchange Format, consisting of CIF2 text encoded in UTF-8, with an optional initial UTF-8 byte-order mark.  CIF Interchange Format is intended as a storage and interchange standard for CIF2.  Its use is strongly encouraged, but its existence should not be taken as a prohibition against use of alternative storage and interchange formats among agreeing parties.

The standard file name extension for CIF Interchange Format files is .cif.
====


Regards,

John
--
John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]