[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
From: James Hester <[email protected]>
Date: Wed, 23 Jun 2010 10:38:33 +1000
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

Herbert writes:

On Wed, Jun 23, 2010 at 8:48 AM, Herbert J. Bernstein
<[email protected]> wrote:
> No, you are not obliged to accept any 'text' encoding. �It is perfectly
> reasonble for you to insist that the user provide you with some encoding
> that you are prepared to read reliably.

No, it is not perfectly reasonable, because you and the "user" may not
have the opportunity to negotiate encodings.  Think a collection of
archived CIF files on CD - maybe they all have different encodings,
and you just have to figure it out, for every file 32,000.  Hope your
Chinese is good enough to tell the difference between those 3 Chinese
encodings.  And wouldn't insisting on an encoding that you are
prepared to accept contradict your principle about 'respect' for the
way other people do things?  No, to be consistent you, Herbert, must
be prepared to accept absolutely every encoding that comes your way.
My own point of view is that this step of 'insist that the user
provide you with some encoding that you are prepared to read reliably'
is best done right here, when the standard is drafted.  That way,
there is no need for negotiation of encoding.

Frankly, I am amazed that I am the only one who thinks mandating a
single encoding is the obvious way forward.  Perhaps it is because I
worked in Japan for many years, and in addition have had to deal with
Russian text frequently.  Believe me when I say, I know the problems
caused by the simultaneous coexistence of multiple encodings. Why
anybody would want to pass up the opportunity to settle on one is
beyond me.

> Thus it would be fine for the IUCr
> to decline to handle, say, a 7-track CDC display code mag tape or a paper
> tape in 5-level Murray Code, but there is an amozing range of text
> encodings that can now be handled easily under Linux, Windows and Mac OS X.
> �For example, on my Mac the utility I use, Cyclone, is willing
> to convert the following encodings:
>
> �Unicode
> � �Unicode
> � �Unicode 2.1
> � �Unicode 3.0
> � �Unicode 3.2
> � �Unicode 4.0
> � �Unicode 5.0
> � � �With variants for UTF-16,
> � � � � � � � � � � � �UTF-16 Canonical Decomposition,
> � � � � � � � � � � � �UTF-16 Canonical Composition,
> � � � � � � � � � � � �UTF-16 HFS+ Decomposition,
> � � � � � � � � � � � �UTF-16 HFS+ Composition,
> � � � � � � � � � � � �UTF-7,
> � � � � � � � � � � � �UTF-7 Canonical Decomposition,
> � � � � � � � � � � � �UTF-7 Canonical Composition,
> � � � � � � � � � � � �UTF-7 HFS+ Decomposition,
> � � � � � � � � � � � �UTF-7 HFS+ Composition,
> � � � � � � � � � � � �UTF-8,
> � � � � � � � � � � � �UTF-8 Canonical Decomposition,
> � � � � � � � � � � � �UTF-8 Canonical Composition,
> � � � � � � � � � � � �UTF-8 HFS+ Decomposition,
> � � � � � � � � � � � �UTF-8 HFS+ Composition,
> � � � � � � � � � � � �UTF-7,
> � � � � � � � � � � � �UTF-7 Canonical Decomposition,
> � � � � � � � � � � � �UTF-7 Canonical Composition,
> � � � � � � � � � � � �UTF-7 HFS+ Decomposition,
> � � � � � � � � � � � �UTF-7 HFS+ Composition,
> � �Windows
> � � �Arabic (CP1256)
> � � �Baltic (CP1257)
> � � �Central European - Latin-2 (CP1250)
> � � �Chinese Simplified (CP 936)
> � � �Chinese Traditional (CP 950)
> � � �Cyrillic (CP1251)
> � � �Greek (CP1253)
> � � �Hebrew (CP1255)
> � � �Japanese (CP932)
> � � �Korean (CP949)
> � � �Thai (CP874)
> � � �Turkisk - Latin-5 (CP1254)
> � � �Vietnamese (CP1258)
> � � �Western - Latin-1 (CP1252)
> � 6 different Chinese Simplified ecodings
> � 3 Different Chinese Traditional encodings
> � etc., etc., etc.
>
> Under unix (or Mac OSX or MINGW) you can use the incov utility
>
> and the wikipedia will show you many more.
>
> Bottom line -- it makes sense to impose reasonable limits on what
> encodings to handle, and to ask people to be very clear about
> what encoding they used, but reasonable limits should cover a very wide
> range of encodings these days.
>
> Regards,
> � �Herbert
>
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
>
> On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>
>> OK, I think I'm starting to understand - by specifying CIF as 'text', we
>> are
>> obliged to accept any 'text' encoding and do the best with it as we can
>> (which is basically
>> what I've been thinking from a practical point of view).
>>
>> I'm happy enough to work with this (or anything that keeps me in gainful
>> employment :-),
>> but I would suggest that if this is the route that CIF2 takes, the
>> specification will need
>> to be
>> a bit more explicit.
>>
>> I still have reservations about having to employ heuristic encoding
>> determination for a
>> 'CIF standard', but in the end,it's all 'do-able'.
>>
>> Cheers
>>
>> Simon
>>
>>
>> ___________________________________________________________________________________________
>> From: Herbert J. Bernstein <[email protected]>
>> To: Group finalising DDLm and associated dictionaries
>> <[email protected]>
>> Sent: Tuesday, 22 June, 2010 21:49:12
>> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>>
>> Dear Simon,
>>
>> � No, the processor is compliant, but it is unable to process eveything.
>> To use that rather limited processing program, you first have to pass
>> files in other encoding through a filer to make it happy, just as,
>> if you now have an EBCDIC CIF1 program, which is happily compliant
>> on some older IBM system and sftp an ASCII CIF to it, then, even
>> though both the EBSDIC CIF1 program and the ASCII CIF are both
>> CIF compliant, before the file can be processed it has to go through
>> an ASCII to CIF convesion program.
>>
>> � This is sort of limitation is true of many programs and many data
>> standards.� The value to the community in having text-based standards
>> has, in general, outweighed the nuisance involved.
>>
>> � There has been far more trouble with binary standards.� Over the years
>> I have spent many happy hours cracking old binary files that were
>> written according to supposedly stable binary data file standards that,
>> after a decade or so, nobody could read anymore.
>>
>> � I am sure Unicode will survive for many decades.� I am not sure any
>> particular encoding of unicode will survive long term.� UTF6 is good and
>> useful, but it is not the last word.
>>
>> � Regards,
>> � � Herbert
>>
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>> � Dowling College, Kramer Science Center, KSC 121
>> � � � � Idle Hour Blvd, Oakdale, NY, 11769
>>
>> � � � � � � � � +1-631-244-3035
>> � � � � � � � � [email protected]
>> =====================================================
>>
>> On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>
>> > 1) So if a compliant CIF2 processing system can reject any non-UTF-8
>> > CIF,
>> > all non-UTF-8 CIFs are non-compliant?
>> >
>> > 2) So why not just state that only Unicode encodings are acceptable?
>> >
>> > Cheers
>> >
>> > Simon
>> >
>> > PS I totally accept the point you're making about how we are often
>> > oblivious to the
>> > underlying encoding used by our software,
>> > but it also demonstrates what can happen if you do not know� what the
>> > encoding is :-)
>> >
>> >
>> >
>>
>> > >__________________________________________________________________________________________
>> ___
>> > From: Herbert J. Bernstein <[email protected]>
>> > To: Group finalising DDLm and associated dictionaries
>> > <[email protected]>
>> > Sent: Tuesday, 22 June, 2010 20:57:20
>> > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>> >
>> > I am pleased to report that your UTF-8;quoted printable� encoded email
>> > is
>> > about as close to unreadble as alomost anything I have seen in recent
>> > years.� Well done!!!!
>> >
>> > Let me take one question at a time.
>> >
>> > 1.� Does this mean that a fully compliant CIF2 processing system need
>> > *not* accept text encoded in anything else?
>> >
>> > � Yes, in deference to those who wanted _just_ UTF8, I am proposing that
>> > we accept as compliant an CIF2 processing system that is unable to
>> > process anything else.
>> >
>> > 2.� What exactly do you mean by it is important to clearly specify
>> > the intended mapping to UTF-8?
>> >
>> > I mean "it is important to clearly specify the intended mapping to
>> > UTF-8".
>> > In other words, if you are working with CIFs and you are working with
>> > a text system for which it is not clear how to map the characters
>> > with which you are working to valid Unicode code points, and you would
>> > like anybody other than yourself to ever be able to work with that
>> > CIF, then you have a responsibilty for resolving the issue of that
>> > mapping.� Once you have made it to Unicode code point, the rest of the
>> > journey to UTF-8 is well specified.
>> >
>> > And thank you for demonstrating the normally invisible encodings with
>> > which we all have to work.
>> >
>> > Regards,
>> > � Herbert
>> >
>> >
>> >
>> >
>> > =====================================================
>> > � Herbert J. Bernstein, Professor of Computer Science
>> > � � Dowling College, Kramer Science Center, KSC 121
>> > � � � � Idle Hour Blvd, Oakdale, NY, 11769
>> >
>> > � � � � � � � � � +1-631-244-3035
>> > � � � � � � � � � [email protected]
>> > =====================================================
>> >
>> > On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>> >
>> > >> --===============1314147413==
>> > > Content-Type: multipart/alternative;
>> > > boundary="0-856724611-1277235714=:44070"
>> > >
>> > >> --0-856724611-1277235714=:44070
>> > > Content-Type: text/plain; charset=utf-8
>> > > Content-Transfer-Encoding: quoted-printable
>> > >
>> > > Dear Herbert=0A=0AI have to confess to not entirely understanding your
>> > > prop=
>> > > osed description.=0A=0ATwo questions:=0A=0A1) "all fully compliant
>> > > CIF2 pro=
>> > > cessing systems should, at a minimum be able to process=0Atext files
>> > > as uni=
>> > > code code points represented in UTF-8"=0A=0ADoes this mean that a
>> > > fully com=
>> > > pliant CIF2 processing system need *not* accept text encoded in
>> > > anything el=
>> > > se?=0A=0AMore importantly:=0A=0A2) What exactly do you mean by=0A=0A
>> > > "it is=
>> > > important to clearly specify the intended mapping to UTF-8"
>> > > ?=0A=0A=0AThan=
>> > >
>> > > ks=0A=0ASimon=0A=0A=0A=0A=0A=0A=0A=0A=0A________________________________=0A=
>> > > From: Herbert J. Bernstein <[email protected]>=0ATo: Group
>> > > final=
>> > > ising DDLm and associated dictionaries <[email protected]>=0ASent:
>> > > Tuesda=
>> > > y, 22 June, 2010 18:40:47=0ASubject: Re: [ddlm-group] options/text vs
>> > > binar=
>> > > y/end-of-line. .. ..� � .=0A=0ADear Colleagues,=0A=0A� Except when I
>> > > find =
>> > > the time to work with hardware, much of the science=0AI do ends up
>> > > involvin=
>> > > g a great deal of editing of documents -- and it=0Ais a royal waste of
>> > > time=
>> > > to tell somebody to learn new editing habits=0Awithout a very good
>> > > reason,=
>> > > so it is very much the case the such=0Amundane issues as encodings and
>> > > key=
>> > > board layouts are a large factor=0Ain how science gets done by many
>> > > people.=
>> > > =0A=0A� Most people don't even realize how many different text
>> > > encodings=
>> > > =0Athey use and how different the text encodings used by their
>> > > colleagues=
>> > > =0Amay be.� In going from system to system, e.g. by email, the
>> > > translations=
>> > > =0Aamong encodings are close to invisible.=0A=0A� Instead of focusing
>> > > on t=
>> > > he change document, could we please focus=0Aon what the CIF2
>> > > specification =
>> > > as a complete, coherent document should=0Asay.� Taking into account
>> > > what ha=
>> > > s been said thus far, here is a =0Aslightly revised version of what I
>> > > propo=
>> > >
>> > > sed:=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>> > >
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>> > >
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=
>> > > CIF2 is a specification for the interchange of text files.� Text
>> > > files=0Aha=
>> > > ve many possible system dependent representations and encodings.
>> > > To=0Aensu=
>> > > re clarity in the specification of CIF2, this document is written=0Ain
>> > > term=
>> > > s of a sequence of unicode code points, and all fully compliant=0ACIF2
>> > > proc=
>> > > essing systems should, at a minimum be able to process=0Atext files as
>> > > unic=
>> > > ode code points represented in UTF-8, subject to the=0AXML-based
>> > > restrictio=
>> > > ns below.� This approach is not meant to prevent=0Apeople from
>> > > preparing va=
>> > > lid CIF2 files with non-UTF-8-based text=0Aeditors, but, if a
>> > > non-UTF-8 fil=
>> > > e format is produced, it is important=0Ato clearly specify the
>> > > intended map=
>> > > ping to UTF-8.� Almost all modern=0Asystems have available a standard
>> > > mappi=
>> > > ng from their internal text=0Arepresentation to and from
>> > > UTF-8.=0A=0ASpecia=
>> > > l care is needed in dealing with end-of-line indicators
>> > > (see=0Ahttp://en.wi=
>> > > kipedia.org/wiki/Newline).� This document will only=0Arefer to LF
>> > > (line fe=
>> > > ed or newline) as the line terminator.� When handling=0ACIF2 files
>> > > produced=
>> > > under MS windows, CR-LF sequences should be accepted as=0Aan
>> > > alternative t=
>> > > o LF, and when handling CIF2 files produced under=0AMac OS, CR should
>> > > be ac=
>> > > cepted as an alternative to LF.� The safest policy=0Ais to accept any
>> > > of CR=
>> > > -LF or CR or LF and line terminators if possible,=0Aand to map all of
>> > > them =
>> > > to LF on reading a CIF.� Systems with other,=0Aadditional line
>> > > terminators =
>> > > should avoid introducing them into CIF2=0Afiles meant for
>> > > interchange.=0A=
>> > > =0ATo ensure compatibility with older Fortran text processing
>> > > software,=0Al=
>> > > ines in CIF2 files should be restricted to no more than 2048=0Acode
>> > > points =
>> > > in length, not including the line terminator itself.=0ANot that the
>> > > UTF-8 e=
>> > > ncoding of such a line may well be much longer."=0A
>> > > =3D=3D=3D=3D=3D=3D=3D=
>> > >
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>> > >
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>> > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=0A=0AAt 4:13 PM +0000 6/22/10,
>> > > SIMON W=
>> > > ESTRIP wrote:=0A>Perhaps John's compromise might be the way
>> > > forward?=0A>=0A=
>> > >>> =0A>=0A>=0A>From: "Bollinger, John C"
>> > >>> <[email protected]>=0A>To: G=
>> > > roup finalising DDLm and associated dictionaries
>> > > <[email protected]>=0A>S=
>> > > ent: Tuesday, 22 June, 2010 16:15:36=0A>Subject: Re: [ddlm-group]
>> > > options/t=
>> > > ext vs binary/end-of-line. .. .. .=0A>=0A>=0A>I prefer leaving the
>> > > issue of=
>> > > character encoding entirely out of the =0A>scope of the CIF format
>> > > specifi=
>> > > cation (effectively allowing any =0A>encoding).� On the other hand, I
>> > > think=
>> > > it's a bit of an =0A>aggrandizement to characterize UTF-16 / Shift-JIS
>> > > / e=
>> > > tc. as "ways in =0A>which many of our colleagues get their science
>> > > done."� =
>> > > In no way do =0A>I dispute that many of our colleagues indeed use
>> > > these enc=
>> > > odings =0A>routinely, but I am doubtful that editing Unicode text with
>> > > a te=
>> > > xt =0A>editor constitutes a significant part of many of their research
>> > > =0A>=
>> > > programs.� At least, few of my English-speaking colleagues edit flat
>> > > =0A>Un=
>> > > icode text files with any frequency, if ever they do at all.=0A>=0A>I
>> > > think=
>> > > there is already good software, some of it free (both =0A>senses), for
>> > > ope=
>> > > rating systems at least as old as Windows 9x, that =0A>supports
>> > > editing UTF=
>> > > -8 encoded text.� Most of it also supports a =0A>multitude of other
>> > > encodin=
>> > > gs.� We would leave no one out by =0A>requiring UTF-8, and I do not
>> > > see tha=
>> > > t respect for our colleagues =0A>demands that CIF2 be equally
>> > > convenient to=
>> > > create and edit with =0A>every text editor in current use.� If that is
>> > > dou=
>> > > btful, however, and =0A>respect is our goal, then wouldn't the most
>> > > respect=
>> > > ful thing be to =0A>*ask* a few of the people about whom we are
>> > > concerned?=
>> > > =0A>=0A>My issue here is different, and at least partly
>> > > philosophical.� The=
>> > > =0A>CIF format can and should be about the structure and meaning of
>> > > CIF =
>> > > =0A>text content.� Character encoding is on a different level: it's a
>> > > =0A>c=
>> > > haracteristic of storage and interchange.� Comingling these layers
>> > > =0A>is i=
>> > > nelegant and unnecessary.=0A>=0A>Moreover, a CIF2 requirement to
>> > > encode in =
>> > > UTF-8 will be small =0A>comfort when presented with a file that is
>> > > not, in =
>> > > fact, encoded =0A>that way.� What can you then do?� Either reject the
>> > > file =
>> > > or =0A>autodetect the encoding.� If CIF2 does not specify a particular
>> > > =0A>=
>> > > encoding, and you receive the same file, then what can you do?
>> > > =0A>Exactly =
>> > > the same things, but then it's more likely that the file's
>> > > =0A>provider wil=
>> > > l have also specified the encoding by some means. =0A>(Particularly so
>> > > if t=
>> > > he CIF2 spec calls attention to the need to do =0A>so.)=0A>=0A>Perhaps
>> > > some=
>> > > thing like this would be an acceptable compromise:=0A>a) Rewrite
>> > > change 2 t=
>> > > o remove the requirement for UTF-8=0A>b)
>> > > Add:=0A>=3D=3D=3D=3D=0A>CHANGE 9 -=
>> > > NEW (CIF Interchange Format)=0A>=0A>Many alternative encodings are
>> > > availab=
>> > > le for recording and =0A>exchanging Unicode character data via
>> > > byte-oriente=
>> > > d media.� The CIF =0A>format itself is encoding independent, but that
>> > > allow=
>> > >
>> > >
>> > > [*** Terminated Message ***]
>> > >
>> > _______________________________________________
>> > ddlm-group mailing list
>> > [email protected]
>> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> >
>> >
>>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Herbert J. Bernstein)

References:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .