[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 22 Jun 2010 13:40:47 -0400
- In-Reply-To: <227318.9387.qm@web87008.mail.ird.yahoo.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><alpine.BSF.2.00.1006172025070.91418@epsilon.pair.com><AANLkTimEn-5bOcLNsa1DSOjDS7XqFmqVKA-W-6Z4NxFO@mail.gmail.com><alpine.BSF.2.00.1006172107430.91418@epsilon.pair.com><AANLkTilJUtXpw5UFQv0Y04Knrv9wCPLr5eertWPCcTzz@mail.gmail.com><alpine.BSF.2.00.1006180703230.91255@epsilon.pair.com><alpine.BSF.2.00.1006180837330.91255@epsilon.pair.com><AANLkTildS0DVEj76rffd8sgXgno2INL8zkXI_qsBjSLP@mail.gmail.com><a06240803c845518a843e@192.168.2.104><AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><alpine.BSF.2.00.1006212018430.91069@epsilon.pair.com><AANLkTilolZk4SzLF8mzqOz4EagFJcEHDKOAblGMnoqpW@mail.gmail.com><alpine.BSF.2.00.1006212120510.91069@epsilon.pair.com><AANLkTiklvzlKquqlRQIrpPGZjJfuRzLqiv2E6Stcq6wd@mail.gmail.com><alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com><AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com><alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com> <8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local><227318.9387.qm@web87008.mail.ird.yahoo.com>
Dear Colleagues, Except when I find the time to work with hardware, much of the science I do ends up involving a great deal of editing of documents -- and it is a royal waste of time to tell somebody to learn new editing habits without a very good reason, so it is very much the case the such mundane issues as encodings and keyboard layouts are a large factor in how science gets done by many people. Most people don't even realize how many different text encodings they use and how different the text encodings used by their colleagues may be. In going from system to system, e.g. by email, the translations among encodings are close to invisible. Instead of focusing on the change document, could we please focus on what the CIF2 specification as a complete, coherent document should say. Taking into account what has been said thus far, here is a slightly revised version of what I proposed: ===================================================================== CIF2 is a specification for the interchange of text files. Text files have many possible system dependent representations and encodings. To ensure clarity in the specification of CIF2, this document is written in terms of a sequence of unicode code points, and all fully compliant CIF2 processing systems should, at a minimum be able to process text files as unicode code points represented in UTF-8, subject to the XML-based restrictions below. This approach is not meant to prevent people from preparing valid CIF2 files with non-UTF-8-based text editors, but, if a non-UTF-8 file format is produced, it is important to clearly specify the intended mapping to UTF-8. Almost all modern systems have available a standard mapping from their internal text representation to and from UTF-8. Special care is needed in dealing with end-of-line indicators (see http://en.wikipedia.org/wiki/Newline). This document will only refer to LF (line feed or newline) as the line terminator. When handling CIF2 files produced under MS windows, CR-LF sequences should be accepted as an alternative to LF, and when handling CIF2 files produced under Mac OS, CR should be accepted as an alternative to LF. The safest policy is to accept any of CR-LF or CR or LF and line terminators if possible, and to map all of them to LF on reading a CIF. Systems with other, additional line terminators should avoid introducing them into CIF2 files meant for interchange. To ensure compatibility with older Fortran text processing software, lines in CIF2 files should be restricted to no more than 2048 code points in length, not including the line terminator itself. Not that the UTF-8 encoding of such a line may well be much longer." =================================================================== At 4:13 PM +0000 6/22/10, SIMON WESTRIP wrote: >Perhaps John's compromise might be the way forward? > > > > >From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG> >To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> >Sent: Tuesday, 22 June, 2010 16:15:36 >Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . > > >I prefer leaving the issue of character encoding entirely out of the >scope of the CIF format specification (effectively allowing any >encoding). On the other hand, I think it's a bit of an >aggrandizement to characterize UTF-16 / Shift-JIS / etc. as "ways in >which many of our colleagues get their science done." In no way do >I dispute that many of our colleagues indeed use these encodings >routinely, but I am doubtful that editing Unicode text with a text >editor constitutes a significant part of many of their research >programs. At least, few of my English-speaking colleagues edit flat >Unicode text files with any frequency, if ever they do at all. > >I think there is already good software, some of it free (both >senses), for operating systems at least as old as Windows 9x, that >supports editing UTF-8 encoded text. Most of it also supports a >multitude of other encodings. We would leave no one out by >requiring UTF-8, and I do not see that respect for our colleagues >demands that CIF2 be equally convenient to create and edit with >every text editor in current use. If that is doubtful, however, and >respect is our goal, then wouldn't the most respectful thing be to >*ask* a few of the people about whom we are concerned? > >My issue here is different, and at least partly philosophical. The >CIF format can and should be about the structure and meaning of CIF >text content. Character encoding is on a different level: it's a >characteristic of storage and interchange. Comingling these layers >is inelegant and unnecessary. > >Moreover, a CIF2 requirement to encode in UTF-8 will be small >comfort when presented with a file that is not, in fact, encoded >that way. What can you then do? Either reject the file or >autodetect the encoding. If CIF2 does not specify a particular >encoding, and you receive the same file, then what can you do? >Exactly the same things, but then it's more likely that the file's >provider will have also specified the encoding by some means. >(Particularly so if the CIF2 spec calls attention to the need to do >so.) > >Perhaps something like this would be an acceptable compromise: >a) Rewrite change 2 to remove the requirement for UTF-8 >b) Add: >==== >CHANGE 9 - NEW (CIF Interchange Format) > >Many alternative encodings are available for recording and >exchanging Unicode character data via byte-oriented media. The CIF >format itself is encoding independent, but that allows for >uncertainty as to how to handle putative CIF data unaccompanied by >encoding information. We therefore define a simple, binary CIF >Interchange Format, consisting of CIF2 text encoded in UTF-8, with >an optional initial UTF-8 byte-order mark. CIF Interchange Format >is intended as a storage and interchange standard for CIF2. Its use >is strongly encouraged, but its existence should not be taken as a >prohibition against use of alternative storage and interchange >formats among agreeing parties. > >The standard file name extension for CIF Interchange Format files is .cif. >==== > > >Regards, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > >Email Disclaimer: ><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer > >_______________________________________________ >ddlm-group mailing list ><mailto:ddlm-group@iucr.org>ddlm-group@iucr.org ><http://scripts.iucr.org/mailman/listinfo/ddlm-group>http://scripts.iucr.org/mailman/listinfo/ddlm-group > > >_______________________________________________ >ddlm-group mailing list >ddlm-group@iucr.org >http://scripts.iucr.org/mailman/listinfo/ddlm-group -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- [ddlm-group] options/text vs binary/end-of-line (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
- Index(es):