[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

1) So if a compliant CIF2 processing system can reject any non-UTF-8 CIF,
all non-UTF-8 CIFs are non-compliant?

2) So why not just state that only Unicode encodings are acceptable?



PS I totally accept the point you're making about how we are often oblivious to the underlying encoding used by our software,
but it also demonstrates what can happen if you do not know  what the encoding is :-)

From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 22 June, 2010 20:57:20
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

I am pleased to report that your UTF-8;quoted printable  encoded email is
about as close to unreadble as alomost anything I have seen in recent
years.  Well done!!!!

Let me take one question at a time.

1.  Does this mean that a fully compliant CIF2 processing system need
*not* accept text encoded in anything else?

  Yes, in deference to those who wanted _just_ UTF8, I am proposing that
we accept as compliant an CIF2 processing system that is unable to
process anything else.

2.  What exactly do you mean by it is important to clearly specify
the intended mapping to UTF-8?

I mean "it is important to clearly specify the intended mapping to UTF-8".
In other words, if you are working with CIFs and you are working with
a text system for which it is not clear how to map the characters
with which you are working to valid Unicode code points, and you would
like anybody other than yourself to ever be able to work with that
CIF, then you have a responsibilty for resolving the issue of that
mapping.  Once you have made it to Unicode code point, the rest of the
journey to UTF-8 is well specified.

And thank you for demonstrating the normally invisible encodings with
which we all have to work.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 22 Jun 2010, SIMON WESTRIP wrote:

>> --===============1314147413==
> Content-Type: multipart/alternative; boundary="0-856724611-1277235714=:44070"
>> --0-856724611-1277235714=:44070
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: quoted-printable
> Dear Herbert=0A=0AI have to confess to not entirely understanding your prop=
> osed description.=0A=0ATwo questions:=0A=0A1) "all fully compliant CIF2 pro=
> cessing systems should, at a minimum be able to process=0Atext files as uni=
> code code points represented in UTF-8"=0A=0ADoes this mean that a fully com=
> pliant CIF2 processing system need *not* accept text encoded in anything el=
> se?=0A=0AMore importantly:=0A=0A2) What exactly do you mean by=0A=0A "it is=
> important to clearly specify the intended mapping to UTF-8" ?=0A=0A=0AThan=
> ks=0A=0ASimon=0A=0A=0A=0A=0A=0A=0A=0A=0A________________________________=0A=
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>=0ATo: Group final=
> ising DDLm and associated dictionaries <ddlm-group@iucr.org>=0ASent: Tuesda=
> y, 22 June, 2010 18:40:47=0ASubject: Re: [ddlm-group] options/text vs binar=
> y/end-of-line. .. ..    .=0A=0ADear Colleagues,=0A=0A  Except when I find =
> the time to work with hardware, much of the science=0AI do ends up involvin=
> g a great deal of editing of documents -- and it=0Ais a royal waste of time=
> to tell somebody to learn new editing habits=0Awithout a very good reason,=
> so it is very much the case the such=0Amundane issues as encodings and key=
> board layouts are a large factor=0Ain how science gets done by many people.=
> =0A=0A  Most people don't even realize how many different text encodings=
> =0Athey use and how different the text encodings used by their colleagues=
> =0Amay be.  In going from system to system, e.g. by email, the translations=
> =0Aamong encodings are close to invisible.=0A=0A  Instead of focusing on t=
> he change document, could we please focus=0Aon what the CIF2 specification =
> as a complete, coherent document should=0Asay.  Taking into account what ha=
> s been said thus far, here is a =0Aslightly revised version of what I propo=
> sed:=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=
> CIF2 is a specification for the interchange of text files.  Text files=0Aha=
> ve many possible system dependent representations and encodings.  To=0Aensu=
> re clarity in the specification of CIF2, this document is written=0Ain term=
> s of a sequence of unicode code points, and all fully compliant=0ACIF2 proc=
> essing systems should, at a minimum be able to process=0Atext files as unic=
> ode code points represented in UTF-8, subject to the=0AXML-based restrictio=
> ns below.  This approach is not meant to prevent=0Apeople from preparing va=
> lid CIF2 files with non-UTF-8-based text=0Aeditors, but, if a non-UTF-8 fil=
> e format is produced, it is important=0Ato clearly specify the intended map=
> ping to UTF-8.  Almost all modern=0Asystems have available a standard mappi=
> ng from their internal text=0Arepresentation to and from UTF-8.=0A=0ASpecia=
> l care is needed in dealing with end-of-line indicators (see=0Ahttp://en.wi=
> kipedia.org/wiki/Newline).  This document will only=0Arefer to LF (line fe=
> ed or newline) as the line terminator.  When handling=0ACIF2 files produced=
> under MS windows, CR-LF sequences should be accepted as=0Aan alternative t=
> o LF, and when handling CIF2 files produced under=0AMac OS, CR should be ac=
> cepted as an alternative to LF.  The safest policy=0Ais to accept any of CR=
> -LF or CR or LF and line terminators if possible,=0Aand to map all of them =
> to LF on reading a CIF.  Systems with other,=0Aadditional line terminators =
> should avoid introducing them into CIF2=0Afiles meant for interchange.=0A=
> =0ATo ensure compatibility with older Fortran text processing software,=0Al=
> ines in CIF2 files should be restricted to no more than 2048=0Acode points =
> in length, not including the line terminator itself.=0ANot that the UTF-8 e=
> ncoding of such a line may well be much longer."=0A  =3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=0A=0AAt 4:13 PM +0000 6/22/10, SIMON W=
> ESTRIP wrote:=0A>Perhaps John's compromise might be the way forward?=0A>=0A=
>>> =0A>=0A>=0A>From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>=0A>To: G=
> roup finalising DDLm and associated dictionaries <ddlm-group@iucr.org>=0A>S=
> ent: Tuesday, 22 June, 2010 16:15:36=0A>Subject: Re: [ddlm-group] options/t=
> ext vs binary/end-of-line. .. .. .=0A>=0A>=0A>I prefer leaving the issue of=
> character encoding entirely out of the =0A>scope of the CIF format specifi=
> cation (effectively allowing any =0A>encoding).  On the other hand, I think=
> it's a bit of an =0A>aggrandizement to characterize UTF-16 / Shift-JIS / e=
> tc. as "ways in =0A>which many of our colleagues get their science done."  =
> In no way do =0A>I dispute that many of our colleagues indeed use these enc=
> odings =0A>routinely, but I am doubtful that editing Unicode text with a te=
> xt =0A>editor constitutes a significant part of many of their research =0A>=
> programs.  At least, few of my English-speaking colleagues edit flat =0A>Un=
> icode text files with any frequency, if ever they do at all.=0A>=0A>I think=
> there is already good software, some of it free (both =0A>senses), for ope=
> rating systems at least as old as Windows 9x, that =0A>supports editing UTF=
> -8 encoded text.  Most of it also supports a =0A>multitude of other encodin=
> gs.  We would leave no one out by =0A>requiring UTF-8, and I do not see tha=
> t respect for our colleagues =0A>demands that CIF2 be equally convenient to=
> create and edit with =0A>every text editor in current use.  If that is dou=
> btful, however, and =0A>respect is our goal, then wouldn't the most respect=
> ful thing be to =0A>*ask* a few of the people about whom we are concerned?=
> =0A>=0A>My issue here is different, and at least partly philosophical.  The=
> =0A>CIF format can and should be about the structure and meaning of CIF =
> =0A>text content.  Character encoding is on a different level: it's a =0A>c=
> haracteristic of storage and interchange.  Comingling these layers =0A>is i=
> nelegant and unnecessary.=0A>=0A>Moreover, a CIF2 requirement to encode in =
> UTF-8 will be small =0A>comfort when presented with a file that is not, in =
> fact, encoded =0A>that way.  What can you then do?  Either reject the file =
> or =0A>autodetect the encoding.  If CIF2 does not specify a particular =0A>=
> encoding, and you receive the same file, then what can you do? =0A>Exactly =
> the same things, but then it's more likely that the file's =0A>provider wil=
> l have also specified the encoding by some means. =0A>(Particularly so if t=
> he CIF2 spec calls attention to the need to do =0A>so.)=0A>=0A>Perhaps some=
> thing like this would be an acceptable compromise:=0A>a) Rewrite change 2 t=
> o remove the requirement for UTF-8=0A>b) Add:=0A>=3D=3D=3D=3D=0A>CHANGE 9 -=
> NEW (CIF Interchange Format)=0A>=0A>Many alternative encodings are availab=
> le for recording and =0A>exchanging Unicode character data via byte-oriente=
> d media.  The CIF =0A>format itself is encoding independent, but that allow=
> [*** Terminated Message ***]
ddlm-group mailing list
ddlm-group mailing list

Reply to: [list | sender only]