Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

OK, I think I'm starting to understand - by specifying CIF as 'text', we are
obliged to accept any 'text' encoding and do the best with it as we can (which is basically
what I've been thinking from a practical point of view).

I'm happy enough to work with this (or anything that keeps me in gainful employment :-),
but I would suggest that if this is the route that CIF2 takes, the specification will need to be
a bit more explicit.

I still have reservations about having to employ heuristic encoding determination for a
'CIF standard', but in the end,it's all 'do-able'.



From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Tuesday, 22 June, 2010 21:49:12
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Dear Simon,

  No, the processor is compliant, but it is unable to process eveything.
To use that rather limited processing program, you first have to pass
files in other encoding through a filer to make it happy, just as,
if you now have an EBCDIC CIF1 program, which is happily compliant
on some older IBM system and sftp an ASCII CIF to it, then, even
though both the EBSDIC CIF1 program and the ASCII CIF are both
CIF compliant, before the file can be processed it has to go through
an ASCII to CIF convesion program.

  This is sort of limitation is true of many programs and many data
standards.  The value to the community in having text-based standards
has, in general, outweighed the nuisance involved.

  There has been far more trouble with binary standards.  Over the years
I have spent many happy hours cracking old binary files that were
written according to supposedly stable binary data file standards that,
after a decade or so, nobody could read anymore.

  I am sure Unicode will survive for many decades.  I am not sure any
particular encoding of unicode will survive long term.  UTF6 is good and
useful, but it is not the last word.


Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 22 Jun 2010, SIMON WESTRIP wrote:

> 1) So if a compliant CIF2 processing system can reject any non-UTF-8 CIF,
> all non-UTF-8 CIFs are non-compliant?
> 2) So why not just state that only Unicode encodings are acceptable?
> Cheers
> Simon
> PS I totally accept the point you're making about how we are often oblivious to the
> underlying encoding used by our software,
> but it also demonstrates what can happen if you do not know  what the encoding is :-)
> _____________________________________________________________________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Tuesday, 22 June, 2010 20:57:20
> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
> I am pleased to report that your UTF-8;quoted printable  encoded email is
> about as close to unreadble as alomost anything I have seen in recent
> years.  Well done!!!!
> Let me take one question at a time.
> 1.  Does this mean that a fully compliant CIF2 processing system need
> *not* accept text encoded in anything else?
>   Yes, in deference to those who wanted _just_ UTF8, I am proposing that
> we accept as compliant an CIF2 processing system that is unable to
> process anything else.
> 2.  What exactly do you mean by it is important to clearly specify
> the intended mapping to UTF-8?
> I mean "it is important to clearly specify the intended mapping to UTF-8".
> In other words, if you are working with CIFs and you are working with
> a text system for which it is not clear how to map the characters
> with which you are working to valid Unicode code points, and you would
> like anybody other than yourself to ever be able to work with that
> CIF, then you have a responsibilty for resolving the issue of that
> mapping.  Once you have made it to Unicode code point, the rest of the
> journey to UTF-8 is well specified.
> And thank you for demonstrating the normally invisible encodings with
> which we all have to work.
> Regards,
>   Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
> >> --===============1314147413==
> > Content-Type: multipart/alternative; boundary="0-856724611-1277235714=:44070"
> >
> >> --0-856724611-1277235714=:44070
> > Content-Type: text/plain; charset=utf-8
> > Content-Transfer-Encoding: quoted-printable
> >
> > Dear Herbert=0A=0AI have to confess to not entirely understanding your prop=
> > osed description.=0A=0ATwo questions:=0A=0A1) "all fully compliant CIF2 pro=
> > cessing systems should, at a minimum be able to process=0Atext files as uni=
> > code code points represented in UTF-8"=0A=0ADoes this mean that a fully com=
> > pliant CIF2 processing system need *not* accept text encoded in anything el=
> > se?=0A=0AMore importantly:=0A=0A2) What exactly do you mean by=0A=0A "it is=
> > important to clearly specify the intended mapping to UTF-8" ?=0A=0A=0AThan=
> > ks=0A=0ASimon=0A=0A=0A=0A=0A=0A=0A=0A=0A________________________________=0A=
> > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>=0ATo: Group final=
> > ising DDLm and associated dictionaries <ddlm-group@iucr.org>=0ASent: Tuesda=
> > y, 22 June, 2010 18:40:47=0ASubject: Re: [ddlm-group] options/text vs binar=
> > y/end-of-line. .. ..    .=0A=0ADear Colleagues,=0A=0A  Except when I find =
> > the time to work with hardware, much of the science=0AI do ends up involvin=
> > g a great deal of editing of documents -- and it=0Ais a royal waste of time=
> > to tell somebody to learn new editing habits=0Awithout a very good reason,=
> > so it is very much the case the such=0Amundane issues as encodings and key=
> > board layouts are a large factor=0Ain how science gets done by many people.=
> > =0A=0A  Most people don't even realize how many different text encodings=
> > =0Athey use and how different the text encodings used by their colleagues=
> > =0Amay be.  In going from system to system, e.g. by email, the translations=
> > =0Aamong encodings are close to invisible.=0A=0A  Instead of focusing on t=
> > he change document, could we please focus=0Aon what the CIF2 specification =
> > as a complete, coherent document should=0Asay.  Taking into account what ha=
> > s been said thus far, here is a =0Aslightly revised version of what I propo=
> > sed:=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=
> > CIF2 is a specification for the interchange of text files.  Text files=0Aha=
> > ve many possible system dependent representations and encodings.  To=0Aensu=
> > re clarity in the specification of CIF2, this document is written=0Ain term=
> > s of a sequence of unicode code points, and all fully compliant=0ACIF2 proc=
> > essing systems should, at a minimum be able to process=0Atext files as unic=
> > ode code points represented in UTF-8, subject to the=0AXML-based restrictio=
> > ns below.  This approach is not meant to prevent=0Apeople from preparing va=
> > lid CIF2 files with non-UTF-8-based text=0Aeditors, but, if a non-UTF-8 fil=
> > e format is produced, it is important=0Ato clearly specify the intended map=
> > ping to UTF-8.  Almost all modern=0Asystems have available a standard mappi=
> > ng from their internal text=0Arepresentation to and from UTF-8.=0A=0ASpecia=
> > l care is needed in dealing with end-of-line indicators (see=0Ahttp://en.wi=
> > kipedia.org/wiki/Newline).  This document will only=0Arefer to LF (line fe=
> > ed or newline) as the line terminator.  When handling=0ACIF2 files produced=
> > under MS windows, CR-LF sequences should be accepted as=0Aan alternative t=
> > o LF, and when handling CIF2 files produced under=0AMac OS, CR should be ac=
> > cepted as an alternative to LF.  The safest policy=0Ais to accept any of CR=
> > -LF or CR or LF and line terminators if possible,=0Aand to map all of them =
> > to LF on reading a CIF.  Systems with other,=0Aadditional line terminators =
> > should avoid introducing them into CIF2=0Afiles meant for interchange.=0A=
> > =0ATo ensure compatibility with older Fortran text processing software,=0Al=
> > ines in CIF2 files should be restricted to no more than 2048=0Acode points =
> > in length, not including the line terminator itself.=0ANot that the UTF-8 e=
> > ncoding of such a line may well be much longer."=0A  =3D=3D=3D=3D=3D=3D=3D=
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
> > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=0A=0AAt 4:13 PM +0000 6/22/10, SIMON W=
> > ESTRIP wrote:=0A>Perhaps John's compromise might be the way forward?=0A>=0A=
> >>> =0A>=0A>=0A>From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>=0A>To: G=
> > roup finalising DDLm and associated dictionaries <ddlm-group@iucr.org>=0A>S=
> > ent: Tuesday, 22 June, 2010 16:15:36=0A>Subject: Re: [ddlm-group] options/t=
> > ext vs binary/end-of-line. .. .. .=0A>=0A>=0A>I prefer leaving the issue of=
> > character encoding entirely out of the =0A>scope of the CIF format specifi=
> > cation (effectively allowing any =0A>encoding).  On the other hand, I think=
> > it's a bit of an =0A>aggrandizement to characterize UTF-16 / Shift-JIS / e=
> > tc. as "ways in =0A>which many of our colleagues get their science done."  =
> > In no way do =0A>I dispute that many of our colleagues indeed use these enc=
> > odings =0A>routinely, but I am doubtful that editing Unicode text with a te=
> > xt =0A>editor constitutes a significant part of many of their research =0A>=
> > programs.  At least, few of my English-speaking colleagues edit flat =0A>Un=
> > icode text files with any frequency, if ever they do at all.=0A>=0A>I think=
> > there is already good software, some of it free (both =0A>senses), for ope=
> > rating systems at least as old as Windows 9x, that =0A>supports editing UTF=
> > -8 encoded text.  Most of it also supports a =0A>multitude of other encodin=
> > gs.  We would leave no one out by =0A>requiring UTF-8, and I do not see tha=
> > t respect for our colleagues =0A>demands that CIF2 be equally convenient to=
> > create and edit with =0A>every text editor in current use.  If that is dou=
> > btful, however, and =0A>respect is our goal, then wouldn't the most respect=
> > ful thing be to =0A>*ask* a few of the people about whom we are concerned?=
> > =0A>=0A>My issue here is different, and at least partly philosophical.  The=
> > =0A>CIF format can and should be about the structure and meaning of CIF =
> > =0A>text content.  Character encoding is on a different level: it's a =0A>c=
> > haracteristic of storage and interchange.  Comingling these layers =0A>is i=
> > nelegant and unnecessary.=0A>=0A>Moreover, a CIF2 requirement to encode in =
> > UTF-8 will be small =0A>comfort when presented with a file that is not, in =
> > fact, encoded =0A>that way.  What can you then do?  Either reject the file =
> > or =0A>autodetect the encoding.  If CIF2 does not specify a particular =0A>=
> > encoding, and you receive the same file, then what can you do? =0A>Exactly =
> > the same things, but then it's more likely that the file's =0A>provider wil=
> > l have also specified the encoding by some means. =0A>(Particularly so if t=
> > he CIF2 spec calls attention to the need to do =0A>so.)=0A>=0A>Perhaps some=
> > thing like this would be an acceptable compromise:=0A>a) Rewrite change 2 t=
> > o remove the requirement for UTF-8=0A>b) Add:=0A>=3D=3D=3D=3D=0A>CHANGE 9 -=
> > NEW (CIF Interchange Format)=0A>=0A>Many alternative encodings are availab=
> > le for recording and =0A>exchanging Unicode character data via byte-oriente=
> > d media.  The CIF =0A>format itself is encoding independent, but that allow=
> >
> >
> > [*** Terminated Message ***]
> >
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.