Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

I agree completely that you cannot reliably autodetect encodings.
Worse, trying to do so is dangerous -- potentially creating
serious garbling of what was intended to be in a file.  You
also cannot autodetect incorrect use of units, mislabelled
bottle of chemical, and many other things in science.  That is
why is is a very good idea to label things, including encodings,
correctly.

For unicode files, that brings us to the use of BOMs, the
use of which I highly recommend.

For CIF files, that brings us to the use of magic numbers,
the use of which I highly recommend.

For engineers, that brings us to the need to clearly label handedness 
conventions for axes.

For writers of English, that brings us to the need to clearly
specify which of the several dialects of Engligh we are using
(e.g. Amercian versus UK versus Brooklyn).

In the heterogeneous world in which we live clear labelling is
a much better practice than falsely assuming a "standard" has
been adhered to.

All of which gets us no closer to having an agreed specification
for CIF2.  We need specific words for the documentation on the
subject of text versus binary and the character encoding(s)
to be used.

Regards,
   Herbert 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 23 Jun 2010, James Hester wrote:

> Hi Simon, I suggest you review the discussion at
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00068.html and
> http://www.iucr.org/__data/iucr/lists/ddlm-group/msg00043.html as
> starting points.  You can't autodetect encodings, and including
> explicit encoding information is prone to error as no widespread
> standards exist for doing this.  You will be in gainful employment for
> a very long time if your employer is willing to wait for you to come
> up with a reliable way of detecting all encodings, potentially based
> on only a couple of characters!
>
> I would suggest that it is not 'doable' at all.  What's more, now is
> your chance to influence this group so that you aren't forced to deal
> with multiple encodings any more than is absolutely necessary, and can
> spend time on more exciting IUCr projects.
>
> On Wed, Jun 23, 2010 at 7:20 AM, SIMON WESTRIP
> <simonwestrip@btinternet.com> wrote:
>> OK, I think I'm starting to understand - by specifying CIF as 'text', we are
>> obliged to accept any 'text' encoding and do the best with it as we can
>> (which is basically
>> what I've been thinking from a practical point of view).
>>
>> I'm happy enough to work with this (or anything that keeps me in gainful
>> employment :-),
>> but I would suggest that if this is the route that CIF2 takes, the
>> specification will need to be
>> a bit more explicit.
>>
>> I still have reservations about having to employ heuristic encoding
>> determination for a
>> 'CIF standard', but in the end,it's all 'do-able'.
>>
>> Cheers
>>
>> Simon
>>
>> ________________________________
>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>> Sent: Tuesday, 22 June, 2010 21:49:12
>> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>>
>> Dear Simon,
>>
>>   No, the processor is compliant, but it is unable to process eveything.
>> To use that rather limited processing program, you first have to pass
>> files in other encoding through a filer to make it happy, just as,
>> if you now have an EBCDIC CIF1 program, which is happily compliant
>> on some older IBM system and sftp an ASCII CIF to it, then, even
>> though both the EBSDIC CIF1 program and the ASCII CIF are both
>> CIF compliant, before the file can be processed it has to go through
>> an ASCII to CIF convesion program.
>>
>>   This is sort of limitation is true of many programs and many data
>> standards.  The value to the community in having text-based standards
>> has, in general, outweighed the nuisance involved.
>>
>>   There has been far more trouble with binary standards.  Over the years
>> I have spent many happy hours cracking old binary files that were
>> written according to supposedly stable binary data file standards that,
>> after a decade or so, nobody could read anymore.
>>
>>   I am sure Unicode will survive for many decades.  I am not sure any
>> particular encoding of unicode will survive long term.  UTF6 is good and
>> useful, but it is not the last word.
>>
>>   Regards,
>>     Herbert
>>
>> =====================================================
>> Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>
>>> 1) So if a compliant CIF2 processing system can reject any non-UTF-8 CIF,
>>> all non-UTF-8 CIFs are non-compliant?
>>>
>>> 2) So why not just state that only Unicode encodings are acceptable?
>>>
>>> Cheers
>>>
>>> Simon
>>>
>>> PS I totally accept the point you're making about how we are often
>>> oblivious to the
>>> underlying encoding used by our software,
>>> but it also demonstrates what can happen if you do not know  what the
>>> encoding is :-)
>>>
>>>
>>>
>>>
>>> _____________________________________________________________________________________________
>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>>> To: Group finalising DDLm and associated dictionaries
>>> <ddlm-group@iucr.org>
>>> Sent: Tuesday, 22 June, 2010 20:57:20
>>> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .
>>>
>>> I am pleased to report that your UTF-8;quoted printable  encoded email is
>>> about as close to unreadble as alomost anything I have seen in recent
>>> years.  Well done!!!!
>>>
>>> Let me take one question at a time.
>>>
>>> 1.  Does this mean that a fully compliant CIF2 processing system need
>>> *not* accept text encoded in anything else?
>>>
>>>   Yes, in deference to those who wanted _just_ UTF8, I am proposing that
>>> we accept as compliant an CIF2 processing system that is unable to
>>> process anything else.
>>>
>>> 2.  What exactly do you mean by it is important to clearly specify
>>> the intended mapping to UTF-8?
>>>
>>> I mean "it is important to clearly specify the intended mapping to UTF-8".
>>> In other words, if you are working with CIFs and you are working with
>>> a text system for which it is not clear how to map the characters
>>> with which you are working to valid Unicode code points, and you would
>>> like anybody other than yourself to ever be able to work with that
>>> CIF, then you have a responsibilty for resolving the issue of that
>>> mapping.  Once you have made it to Unicode code point, the rest of the
>>> journey to UTF-8 is well specified.
>>>
>>> And thank you for demonstrating the normally invisible encodings with
>>> which we all have to work.
>>>
>>> Regards,
>>>   Herbert
>>>
>>>
>>>
>>>
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                   +1-631-244-3035
>>>                   yaya@dowling.edu
>>> =====================================================
>>>
>>> On Tue, 22 Jun 2010, SIMON WESTRIP wrote:
>>>
>>>>> --===============1314147413==
>>>> Content-Type: multipart/alternative;
>>>> boundary="0-856724611-1277235714=:44070"
>>>>
>>>>> --0-856724611-1277235714=:44070
>>>> Content-Type: text/plain; charset=utf-8
>>>> Content-Transfer-Encoding: quoted-printable
>>>>
>>>> Dear Herbert=0A=0AI have to confess to not entirely understanding your
>>>> prop=
>>>> osed description.=0A=0ATwo questions:=0A=0A1) "all fully compliant CIF2
>>>> pro=
>>>> cessing systems should, at a minimum be able to process=0Atext files as
>>>> uni=
>>>> code code points represented in UTF-8"=0A=0ADoes this mean that a fully
>>>> com=
>>>> pliant CIF2 processing system need *not* accept text encoded in anything
>>>> el=
>>>> se?=0A=0AMore importantly:=0A=0A2) What exactly do you mean by=0A=0A "it
>>>> is=
>>>> important to clearly specify the intended mapping to UTF-8"
>>>> ?=0A=0A=0AThan=
>>>>
>>>> ks=0A=0ASimon=0A=0A=0A=0A=0A=0A=0A=0A=0A________________________________=0A=
>>>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>=0ATo: Group
>>>> final=
>>>> ising DDLm and associated dictionaries <ddlm-group@iucr.org>=0ASent:
>>>> Tuesda=
>>>> y, 22 June, 2010 18:40:47=0ASubject: Re: [ddlm-group] options/text vs
>>>> binar=
>>>> y/end-of-line. .. ..    .=0A=0ADear Colleagues,=0A=0A  Except when I
>>>> find =
>>>> the time to work with hardware, much of the science=0AI do ends up
>>>> involvin=
>>>> g a great deal of editing of documents -- and it=0Ais a royal waste of
>>>> time=
>>>> to tell somebody to learn new editing habits=0Awithout a very good
>>>> reason,=
>>>> so it is very much the case the such=0Amundane issues as encodings and
>>>> key=
>>>> board layouts are a large factor=0Ain how science gets done by many
>>>> people.=
>>>> =0A=0A  Most people don't even realize how many different text
>>>> encodings=
>>>> =0Athey use and how different the text encodings used by their
>>>> colleagues=
>>>> =0Amay be.  In going from system to system, e.g. by email, the
>>>> translations=
>>>> =0Aamong encodings are close to invisible.=0A=0A  Instead of focusing on
>>>> t=
>>>> he change document, could we please focus=0Aon what the CIF2
>>>> specification =
>>>> as a complete, coherent document should=0Asay.  Taking into account what
>>>> ha=
>>>> s been said thus far, here is a =0Aslightly revised version of what I
>>>> propo=
>>>>
>>>> sed:=0A=0A=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>>
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>>
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=
>>>> CIF2 is a specification for the interchange of text files.  Text
>>>> files=0Aha=
>>>> ve many possible system dependent representations and encodings.
>>>> To=0Aensu=
>>>> re clarity in the specification of CIF2, this document is written=0Ain
>>>> term=
>>>> s of a sequence of unicode code points, and all fully compliant=0ACIF2
>>>> proc=
>>>> essing systems should, at a minimum be able to process=0Atext files as
>>>> unic=
>>>> ode code points represented in UTF-8, subject to the=0AXML-based
>>>> restrictio=
>>>> ns below.  This approach is not meant to prevent=0Apeople from preparing
>>>> va=
>>>> lid CIF2 files with non-UTF-8-based text=0Aeditors, but, if a non-UTF-8
>>>> fil=
>>>> e format is produced, it is important=0Ato clearly specify the intended
>>>> map=
>>>> ping to UTF-8.  Almost all modern=0Asystems have available a standard
>>>> mappi=
>>>> ng from their internal text=0Arepresentation to and from
>>>> UTF-8.=0A=0ASpecia=
>>>> l care is needed in dealing with end-of-line indicators
>>>> (see=0Ahttp://en.wi=
>>>> kipedia.org/wiki/Newline).  This document will only=0Arefer to LF (line
>>>> fe=
>>>> ed or newline) as the line terminator.  When handling=0ACIF2 files
>>>> produced=
>>>> under MS windows, CR-LF sequences should be accepted as=0Aan alternative
>>>> t=
>>>> o LF, and when handling CIF2 files produced under=0AMac OS, CR should be
>>>> ac=
>>>> cepted as an alternative to LF.  The safest policy=0Ais to accept any of
>>>> CR=
>>>> -LF or CR or LF and line terminators if possible,=0Aand to map all of
>>>> them =
>>>> to LF on reading a CIF.  Systems with other,=0Aadditional line
>>>> terminators =
>>>> should avoid introducing them into CIF2=0Afiles meant for
>>>> interchange.=0A=
>>>> =0ATo ensure compatibility with older Fortran text processing
>>>> software,=0Al=
>>>> ines in CIF2 files should be restricted to no more than 2048=0Acode
>>>> points =
>>>> in length, not including the line terminator itself.=0ANot that the
>>>> UTF-8 e=
>>>> ncoding of such a line may well be much longer."=0A
>>>> =3D=3D=3D=3D=3D=3D=3D=
>>>>
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>>
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
>>>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=0A=0A=0A=0AAt 4:13 PM +0000 6/22/10,
>>>> SIMON W=
>>>> ESTRIP wrote:=0A>Perhaps John's compromise might be the way
>>>> forward?=0A>=0A=
>>>>>> =0A>=0A>=0A>From: "Bollinger, John C"
>>>>>> <John.Bollinger@STJUDE.ORG>=0A>To: G=
>>>> roup finalising DDLm and associated dictionaries
>>>> <ddlm-group@iucr.org>=0A>S=
>>>> ent: Tuesday, 22 June, 2010 16:15:36=0A>Subject: Re: [ddlm-group]
>>>> options/t=
>>>> ext vs binary/end-of-line. .. .. .=0A>=0A>=0A>I prefer leaving the issue
>>>> of=
>>>> character encoding entirely out of the =0A>scope of the CIF format
>>>> specifi=
>>>> cation (effectively allowing any =0A>encoding).  On the other hand, I
>>>> think=
>>>> it's a bit of an =0A>aggrandizement to characterize UTF-16 / Shift-JIS /
>>>> e=
>>>> tc. as "ways in =0A>which many of our colleagues get their science
>>>> done."  =
>>>> In no way do =0A>I dispute that many of our colleagues indeed use these
>>>> enc=
>>>> odings =0A>routinely, but I am doubtful that editing Unicode text with a
>>>> te=
>>>> xt =0A>editor constitutes a significant part of many of their research
>>>> =0A>=
>>>> programs.  At least, few of my English-speaking colleagues edit flat
>>>> =0A>Un=
>>>> icode text files with any frequency, if ever they do at all.=0A>=0A>I
>>>> think=
>>>> there is already good software, some of it free (both =0A>senses), for
>>>> ope=
>>>> rating systems at least as old as Windows 9x, that =0A>supports editing
>>>> UTF=
>>>> -8 encoded text.  Most of it also supports a =0A>multitude of other
>>>> encodin=
>>>> gs.  We would leave no one out by =0A>requiring UTF-8, and I do not see
>>>> tha=
>>>> t respect for our colleagues =0A>demands that CIF2 be equally convenient
>>>> to=
>>>> create and edit with =0A>every text editor in current use.  If that is
>>>> dou=
>>>> btful, however, and =0A>respect is our goal, then wouldn't the most
>>>> respect=
>>>> ful thing be to =0A>*ask* a few of the people about whom we are
>>>> concerned?=
>>>> =0A>=0A>My issue here is different, and at least partly philosophical.
>>>> The=
>>>> =0A>CIF format can and should be about the structure and meaning of CIF
>>>> =
>>>> =0A>text content.  Character encoding is on a different level: it's a
>>>> =0A>c=
>>>> haracteristic of storage and interchange.  Comingling these layers
>>>> =0A>is i=
>>>> nelegant and unnecessary.=0A>=0A>Moreover, a CIF2 requirement to encode
>>>> in =
>>>> UTF-8 will be small =0A>comfort when presented with a file that is not,
>>>> in =
>>>> fact, encoded =0A>that way.  What can you then do?  Either reject the
>>>> file =
>>>> or =0A>autodetect the encoding.  If CIF2 does not specify a particular
>>>> =0A>=
>>>> encoding, and you receive the same file, then what can you do?
>>>> =0A>Exactly =
>>>> the same things, but then it's more likely that the file's =0A>provider
>>>> wil=
>>>> l have also specified the encoding by some means. =0A>(Particularly so
>>>> if t=
>>>> he CIF2 spec calls attention to the need to do =0A>so.)=0A>=0A>Perhaps
>>>> some=
>>>> thing like this would be an acceptable compromise:=0A>a) Rewrite change
>>>> 2 t=
>>>> o remove the requirement for UTF-8=0A>b) Add:=0A>=3D=3D=3D=3D=0A>CHANGE
>>>> 9 -=
>>>> NEW (CIF Interchange Format)=0A>=0A>Many alternative encodings are
>>>> availab=
>>>> le for recording and =0A>exchanging Unicode character data via
>>>> byte-oriente=
>>>> d media.  The CIF =0A>format itself is encoding independent, but that
>>>> allow=
>>>>
>>>>
>>>> [*** Terminated Message ***]
>>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.