Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Dear James,

   I would suggest you put your proposed UTF-8 mandate out to the
community at large.  Perhaps I am wrong, and people will be delighted
to put in the money, time and effort to make the transition.  The
best way to find out is to ask them, not _tell_ them, but _ask_ them.

   Certainly, if the IUCr wants to say that for Acta submissions, you
must send them UTF-8, that is up to the IUCr.  What is not up to
the IUCr is to tell an author what they have to use to edit the
file before they send it to the IUCr, not to tell them that they
cannot call an EBCDIC file a CIF nor that they cannot post
a UCS-2 CIF to the web on their on server nor that they cannot
call a code page 1251 Cyrillic document or, more likely, a KOI8-R
code page document a CIF.

   It really is a matter of respect.

   Regards,
     Herbert

P.S.  I am less interested in being absolutely consistent than
with making the necessary compromises to get work done, but if
I were to publish a CD of CIFS, I would choose a very limited
number of text encodings (probably only one, and most likely
UTF-8), but I most certainly would not tell those submitting the
CIFs they they had to edit them in my favorite encoding.  I would
tell them that submitting in UTF-8 would reduce the chance of errors,
but can see no reasons to reject documents that have been reliably
identified as being in one of a reasonable range of encodings.


At 10:38 AM +1000 6/23/10, James Hester wrote:
>Herbert writes:
>
>On Wed, Jun 23, 2010 at 8:48 AM, Herbert J. Bernstein
><yaya@bernstein-plus-sons.com> wrote:
>>  No, you are not obliged to accept any 'text' encoding.  It is perfectly
>>  reasonble for you to insist that the user provide you with some encoding
>>  that you are prepared to read reliably.
>
>No, it is not perfectly reasonable, because you and the "user" may not
>have the opportunity to negotiate encodings.  Think a collection of
>archived CIF files on CD - maybe they all have different encodings,
>and you just have to figure it out, for every file 32,000.  Hope your
>Chinese is good enough to tell the difference between those 3 Chinese
>encodings.  And wouldn't insisting on an encoding that you are
>prepared to accept contradict your principle about 'respect' for the
>way other people do things?  No, to be consistent you, Herbert, must
>be prepared to accept absolutely every encoding that comes your way.
>My own point of view is that this step of 'insist that the user
>provide you with some encoding that you are prepared to read reliably'
>is best done right here, when the standard is drafted.  That way,
>there is no need for negotiation of encoding.
>
>Frankly, I am amazed that I am the only one who thinks mandating a
>single encoding is the obvious way forward.  Perhaps it is because I
>worked in Japan for many years, and in addition have had to deal with
>Russian text frequently.  Believe me when I say, I know the problems
>caused by the simultaneous coexistence of multiple encodings. Why
>anybody would want to pass up the opportunity to settle on one is
>beyond me.
>
>>  Thus it would be fine for the IUCr
>>  to decline to handle, say, a 7-track CDC display code mag tape or a paper
>>  tape in 5-level Murray Code, but there is an amozing range of text
>>  encodings that can now be handled easily under Linux, Windows and Mac OS X.
>>   For example, on my Mac the utility I use, Cyclone, is willing
>>  to convert the following encodings:
>>
>>   Unicode
>>     Unicode
>>     Unicode 2.1
>>     Unicode 3.0
>>     Unicode 3.2
>>     Unicode 4.0
>>     Unicode 5.0
>>       With variants for UTF-16,
>>                         UTF-16 Canonical Decomposition,
>>                         UTF-16 Canonical Composition,
>>                         UTF-16 HFS+ Decomposition,
>>                         UTF-16 HFS+ Composition,
>  >                        UTF-7,
>>                         UTF-7 Canonical Decomposition,
>>                         UTF-7 Canonical Composition,
>>                         UTF-7 HFS+ Decomposition,
>>                         UTF-7 HFS+ Composition,
>>                         UTF-8,
>>                         UTF-8 Canonical Decomposition,
>>                         UTF-8 Canonical Composition,
>>                         UTF-8 HFS+ Decomposition,
>  >                        UTF-8 HFS+ Composition,
>>                         UTF-7,
>>                         UTF-7 Canonical Decomposition,
>>                         UTF-7 Canonical Composition,
>>                         UTF-7 HFS+ Decomposition,
>>                         UTF-7 HFS+ Composition,
>>     Windows
>>       Arabic (CP1256)
>>       Baltic (CP1257)
>>       Central European - Latin-2 (CP1250)
>>       Chinese Simplified (CP 936)
>>       Chinese Traditional (CP 950)
>>       Cyrillic (CP1251)
>>       Greek (CP1253)
>>       Hebrew (CP1255)
>>       Japanese (CP932)
>>       Korean (CP949)
>>       Thai (CP874)
>>       Turkisk - Latin-5 (CP1254)
>>       Vietnamese (CP1258)
>>       Western - Latin-1 (CP1252)
>>    6 different Chinese Simplified ecodings
>>    3 Different Chinese Traditional encodings
>>    etc., etc., etc.
>>
>>  Under unix (or Mac OSX or MINGW) you can use the incov utility
>>
>>  and the wikipedia will show you many more.
>>
>>  Bottom line -- it makes sense to impose reasonable limits on what
>>  encodings to handle, and to ask people to be very clear about
>>  what encoding they used, but reasonable limits should cover a very wide
>>  range of encodings these days.
>>
>>  Regards,
>>     Herbert
>>
>>
>>  =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>>  =====================================================
>>
>>  On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>
>>>  OK, I think I'm starting to understand - by specifying CIF as 'text', we
>>>  are
>>>  obliged to accept any 'text' encoding and do the best with it as we can
>>>  (which is basically
>>>  what I've been thinking from a practical point of view).
>>>
>>>  I'm happy enough to work with this (or anything that keeps me in gainful
>>>  employment :-),
>>>  but I would suggest that if this is the route that CIF2 takes, the
>>>  specification will need
>>>  to be
>>>  a bit more explicit.
>>>
>>>  I still have reservations about having to employ heuristic encoding
>>>  determination for a
>>>  'CIF standard', but in the end,it's all 'do-able'.
>>>
>>>  Cheers
>>>
>>>  Simon
>>>
>>>
>>> 
>>>___________________________________________________________________________________________
>>>  From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>>  To: Group finalising DDLm and associated dictionaries
>>>  <ddlm-group@iucr.org>
>>>  Sent: Tuesday, 22 June, 2010 21:49:12
>>>  Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>>>
>>>  Dear Simon,
>>>
>>>    No, the processor is compliant, but it is unable to process eveything.
>>>  To use that rather limited processing program, you first have to pass
>>>  files in other encoding through a filer to make it happy, just as,
>>>  if you now have an EBCDIC CIF1 program, which is happily compliant
>>>  on some older IBM system and sftp an ASCII CIF to it, then, even
>>>  though both the EBSDIC CIF1 program and the ASCII CIF are both
>>>  CIF compliant, before the file can be processed it has to go through
>>>  an ASCII to CIF convesion program.
>>>
>>>    This is sort of limitation is true of many programs and many data
>>>  standards.  The value to the community in having text-based standards
>>>  has, in general, outweighed the nuisance involved.
>>>
>>>    There has been far more trouble with binary standards.  Over the years
>>>  I have spent many happy hours cracking old binary files that were
>>>  written according to supposedly stable binary data file standards that,
>>>  after a decade or so, nobody could read anymore.
>>>
>>>    I am sure Unicode will survive for many decades.  I am not sure any
>  >> particular encoding of unicode will survive long term.  UTF6 is good and
>>>  useful, but it is not the last word.
>>>
>>>    Regards,
>>>      Herbert
>>>
>>>  =====================================================
>>>  Herbert J. Bernstein, Professor of Computer Science
>>>    Dowling College, Kramer Science Center, KSC 121
>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                  +1-631-244-3035
>>>                  yaya@dowling.edu
>  >> =====================================================
>>>
>>>  On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>>
>>>  > 1) So if a compliant CIF2 processing system can reject any non-UTF-8
>>>  > CIF,
>>>  > all non-UTF-8 CIFs are non-compliant?
>>>  >
>>>  > 2) So why not just state that only Unicode encodings are acceptable?
>>>  >
>>>  > Cheers
>>>  >
>>>  > Simon
>>>  >
>>>  > PS I totally accept the point you're making about how we are often
>>>  > oblivious to the
>>>  > underlying encoding used by our software,
>>>  > but it also demonstrates what can happen if you do not know  what the
>>>  > encoding is :-)
>>>  >
>>>  >
>>>  >
>>>
>>>  > >__________________________________________________________________________________________
>>>  ___
>>>  > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>>  > To: Group finalising DDLm and associated dictionaries
>>>  > <ddlm-group@iucr.org>
>>>  > Sent: Tuesday, 22 June, 2010 20:57:20
>>>  > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>>>  >
>>>  > I am pleased to report that your UTF-8;quoted printable  encoded email
>>>  > is
>>>  > about as close to unreadble as alomost anything I have seen in recent
>>>  > years.  Well done!!!!
>>>  >
>>>  > Let me take one question at a time.
>>>  >
>>>  > 1.  Does this mean that a fully compliant CIF2 processing system need
>>>  > *not* accept text encoded in anything else?
>>>  >
>>>  >   Yes, in deference to those who wanted _just_ UTF8, I am proposing that
>>>  > we accept as compliant an CIF2 processing system that is unable to
>>>  > process anything else.
>>>  >
>>>  > 2.  What exactly do you mean by it is important to clearly specify
>>>  > the intended mapping to UTF-8?
>>>  >
>>>  > I mean "it is important to clearly specify the intended mapping to
>>>  > UTF-8".
>>>  > In other words, if you are working with CIFs and you are working with
>>>  > a text system for which it is not clear how to map the characters
>>>  > with which you are working to valid Unicode code points, and you would
>>>  > like anybody other than yourself to ever be able to work with that
>>>  > CIF, then you have a responsibilty for resolving the issue of that
>>>  > mapping.  Once you have made it to Unicode code point, the rest of the
>>>  > journey to UTF-8 is well specified.
>>>  >
>>>  > And thank you for demonstrating the normally invisible encodings with
>>>  > which we all have to work.
>>>  >
>>>  > Regards,
>>>  >   Herbert
>>>  >
>>>  >
>>>  >
>>>  >
>>>  > =====================================================
>>>  >   Herbert J. Bernstein, Professor of Computer Science
>>>  >     Dowling College, Kramer Science Center, KSC 121
>>>  >         Idle Hour Blvd, Oakdale, NY, 11769
>>>  >
>>>  >                   +1-631-244-3035
>>>  >                   yaya@dowling.edu
>>>  > =====================================================
>>>  >
>>>  > On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>>  >
>>>  > >> --===============1314147413==
>>>  > > Content-Type: multipart/alternative;
>>>  > > boundary="0-856724611-1277235714=:44070"
>>>  > >
>>>  > >> --0-856724611-1277235714=:44070
>>>  > > Content-Type: text/plain; charset=utf-8
>>>  > > Content-Transfer-Encoding: quoted-printable
>>>  > >
>>>  > > Dear Herbert=0A=0AI have to confess to not entirely understanding your
>>>  > > prop=
>>>  > > osed description.=0A=0ATwo questions:=0A=0A1) "all fully compliant
>>>  > > CIF2 pro=
>>>  > > cessing systems should, at a minimum be able to process=0Atext files
>>>  > > as uni=
>>>  > > code code points represented in UTF-8"=0A=0ADoes this mean that a
>>>  > > fully com=
>>>  > > pliant CIF2 processing system need *not* accept text encoded in
>>>  > > anything el=
>>>  > > se?=0A=0AMore importantly:=0A=0A2) What exactly do you mean by=0A=0A
>>>  > > "it is=
>>>  > > important to clearly specify the intended mapping to UTF-8"
>  >> > > ?=0A=0A=0AThan=
>>>  > >
>>>  > > 
>>>ks=0A=0ASimon=0A=0A=0A=0A=0A=0A=0A=0A=0A________________________________=0A=
>>>  > > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>=0ATo: Group
>>>  > > final=
>>>  > > ising DDLm and associated dictionaries <ddlm-group@iucr.org>=0ASent:
>>>  > > Tuesda=
>>>  > > y, 22 June, 2010 18:40:47=0ASubject: Re: [ddlm-group] options/text vs
>>>  > > binar=
>>>  > > y/end-of-line. .. ..    .=0A=0ADear Colleagues,=0A=0A  Except when I
>  >> > > find =
>>>  > > the time to work with hardware, much of the science=0AI do ends up
>>>  > > involvin=
>>>  > > g a great deal of editing of documents -- and it=0Ais a royal waste of
>>>  > > time=
>>>  > > to tell somebody to learn new editing habits=0Awithout a very good
>>>  > > reason,=
>>>  > > so it is very much the case the such=0Amundane issues as encodings and
>>>  > > key=
>>>  > > board layouts are a large factor=0Ain how science gets done by many
>>>  > > people.=
>>>  > > =0A=0A  Most people don't even realize how many different text
>>>  > > encodings=
>>>  > > =0Athey use and how different the text encodings used by their
>>>  > > colleagues=
>>>  > > =0Amay be.  In going from system to system, e.g. by email, the
>>>  > > translations=
>>>  > > =0Aamong encodings are close to invisible.=0A=0A  Instead of focusing
>>>  > > on t=
>>>  > > he change document, could we please focus=0Aon what the CIF2
>>>  > > specification =
>>>  > > as a complete, coherent document should=0Asay.  Taking into account
>>>  > > what ha=
>>>  > > s been said thus far, here is a =0Aslightly revised version of what I
>>>  > > propo=
>>>  > >
>>>  > > 
>>>sed:=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>  > >
>>>  > > 
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>  > >
>>>  > > 
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=
>>>  > > CIF2 is a specification for the interchange of text files.  Text
>>>  > > files=0Aha=
>>>  > > ve many possible system dependent representations and encodings.
>>>  > > To=0Aensu=
>>>  > > re clarity in the specification of CIF2, this document is written=0Ain
>>>  > > term=
>>>  > > s of a sequence of unicode code points, and all fully compliant=0ACIF2
>>>  > > proc=
>>>  > > essing systems should, at a minimum be able to process=0Atext files as
>>>  > > unic=
>>>  > > ode code points represented in UTF-8, subject to the=0AXML-based
>>>  > > restrictio=
>>>  > > ns below.  This approach is not meant to prevent=0Apeople from
>>>  > > preparing va=
>>>  > > lid CIF2 files with non-UTF-8-based text=0Aeditors, but, if a
>>>  > > non-UTF-8 fil=
>>>  > > e format is produced, it is important=0Ato clearly specify the
>>>  > > intended map=
>>>  > > ping to UTF-8.  Almost all modern=0Asystems have available a standard
>>>  > > mappi=
>>>  > > ng from their internal text=0Arepresentation to and from
>>>  > > UTF-8.=0A=0ASpecia=
>>>  > > l care is needed in dealing with end-of-line indicators
>>>  > > (see=0Ahttp://en.wi=
>>>  > > kipedia.org/wiki/Newline).  This document will only=0Arefer to LF
>>>  > > (line fe=
>>>  > > ed or newline) as the line terminator.  When handling=0ACIF2 files
>>>  > > produced=
>>>  > > under MS windows, CR-LF sequences should be accepted as=0Aan
>>>  > > alternative t=
>>>  > > o LF, and when handling CIF2 files produced under=0AMac OS, CR should
>>>  > > be ac=
>>>  > > cepted as an alternative to LF.  The safest policy=0Ais to accept any
>>>  > > of CR=
>>>  > > -LF or CR or LF and line terminators if possible,=0Aand to map all of
>>>  > > them =
>>>  > > to LF on reading a CIF.  Systems with other,=0Aadditional line
>>>  > > terminators =
>>>  > > should avoid introducing them into CIF2=0Afiles meant for
>>>  > > interchange.=0A=
>>>  > > =0ATo ensure compatibility with older Fortran text processing
>>>  > > software,=0Al=
>>>  > > ines in CIF2 files should be restricted to no more than 2048=0Acode
>>>  > > points =
>>>  > > in length, not including the line terminator itself.=0ANot that the
>>>  > > UTF-8 e=
>>>  > > ncoding of such a line may well be much longer."=0A
>>>  > > =3D=3D=3D=3D=3D=3D=3D=
>>>  > >
>>>  > > 
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>  >> > >
>>>  > > 
>>>=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>  > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=0A=0AAt 4:13 PM +0000 6/22/10,
>>>  > > SIMON W=
>>>  > > ESTRIP wrote:=0A>Perhaps John's compromise might be the way
>>>  > > forward?=0A>=0A=
>>>  > >>> =0A>=0A>=0A>From: "Bollinger, John C"
>>>  > >>> <John.Bollinger@STJUDE.ORG>=0A>To: G=
>>>  > > roup finalising DDLm and associated dictionaries
>>>  > > <ddlm-group@iucr.org>=0A>S=
>>>  > > ent: Tuesday, 22 June, 2010 16:15:36=0A>Subject: Re: [ddlm-group]
>  >> > > options/t=
>>>  > > ext vs binary/end-of-line. .. .. .=0A>=0A>=0A>I prefer leaving the
>>>  > > issue of=
>>>  > > character encoding entirely out of the =0A>scope of the CIF format
>>>  > > specifi=
>>>  > > cation (effectively allowing any =0A>encoding).  On the other hand, I
>>>  > > think=
>>>  > > it's a bit of an =0A>aggrandizement to characterize UTF-16 / Shift-JIS
>>>  > > / e=
>>>  > > tc. as "ways in =0A>which many of our colleagues get their science
>>>  > > done."  =
>>>  > > In no way do =0A>I dispute that many of our colleagues indeed use
>>>  > > these enc=
>>>  > > odings =0A>routinely, but I am doubtful that editing Unicode text with
>>>  > > a te=
>>>  > > xt =0A>editor constitutes a significant part of many of their research
>>>  > > =0A>=
>>>  > > programs.  At least, few of my English-speaking colleagues edit flat
>>>  > > =0A>Un=
>>>  > > icode text files with any frequency, if ever they do at all.=0A>=0A>I
>>>  > > think=
>>>  > > there is already good software, some of it free (both =0A>senses), for
>>>  > > ope=
>>>  > > rating systems at least as old as Windows 9x, that =0A>supports
>>>  > > editing UTF=
>>>  > > -8 encoded text.  Most of it also supports a =0A>multitude of other
>>>  > > encodin=
>>>  > > gs.  We would leave no one out by =0A>requiring UTF-8, and I do not
>>>  > > see tha=
>>>  > > t respect for our colleagues =0A>demands that CIF2 be equally
>>>  > > convenient to=
>>>  > > create and edit with =0A>every text editor in current use.  If that is
>>>  > > dou=
>>>  > > btful, however, and =0A>respect is our goal, then wouldn't the most
>>>  > > respect=
>>>  > > ful thing be to =0A>*ask* a few of the people about whom we are
>>>  > > concerned?=
>>>  > > =0A>=0A>My issue here is different, and at least partly
>>>  > > philosophical.  The=
>>>  > > =0A>CIF format can and should be about the structure and meaning of
>>>  > > CIF =
>>>  > > =0A>text content.  Character encoding is on a different level: it's a
>>>  > > =0A>c=
>>>  > > haracteristic of storage and interchange.  Comingling these layers
>>>  > > =0A>is i=
>>>  > > nelegant and unnecessary.=0A>=0A>Moreover, a CIF2 requirement to
>>>  > > encode in =
>>>  > > UTF-8 will be small =0A>comfort when presented with a file that is
>>>  > > not, in =
>>>  > > fact, encoded =0A>that way.  What can you then do?  Either reject the
>>>  > > file =
>>>  > > or =0A>autodetect the encoding.  If CIF2 does not specify a particular
>>>  > > =0A>=
>>>  > > encoding, and you receive the same file, then what can you do?
>>>  > > =0A>Exactly =
>>>  > > the same things, but then it's more likely that the file's
>>>  > > =0A>provider wil=
>>>  > > l have also specified the encoding by some means. =0A>(Particularly so
>>>  > > if t=
>>>  > > he CIF2 spec calls attention to the need to do =0A>so.)=0A>=0A>Perhaps
>>>  > > some=
>>>  > > thing like this would be an acceptable compromise:=0A>a) Rewrite
>>>  > > change 2 t=
>>>  > > o remove the requirement for UTF-8=0A>b)
>>>  > > Add:=0A>=3D=3D=3D=3D=0A>CHANGE 9 -=
>>>  > > NEW (CIF Interchange Format)=0A>=0A>Many alternative encodings are
>>>  > > availab=
>>>  > > le for recording and =0A>exchanging Unicode character data via
>>>  > > byte-oriente=
>>>  > > d media.  The CIF =0A>format itself is encoding independent, but that
>>>  > > allow=
>>>  > >
>>>  > >
>>>  > > [*** Terminated Message ***]
>>>  > >
>>>  > _______________________________________________
>>>  > ddlm-group mailing list
>>>  > ddlm-group@iucr.org
>>>  > http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>  >
>>>  >
>>>
>>
>>  _______________________________________________
>>  ddlm-group mailing list
>>  ddlm-group@iucr.org
>>  http://scripts.iucr.org/mailman/listinfo/ddlm-group
>  >
>>
>
>
>
>--
>T +61 (02) 9717 9907
>F +61 (02) 9717 3145
>M +61 (04) 0249 4148
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.