Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .

With the usual apologies for my tardiness in keeping up with this
correspondence...

> You will have to ask Brian what encodings the IUCr saw in Chester.

To the best of my knowledge, the only encodings we have had to deal
with have been ASCII and "Quoted-Printable" (and possibly other
base64 encodings), the latter having come either from broken mail
transmissions or by naive extraction of message bodies from CIFs sent
as (encoded) email messages.

I don't draw too profound a conclusion from this: some proportion may
in fact have originated on EBCDIC systems but been correctly
translated by ftp text-mode or other transmission protocols. (I am
aware that there is not a unique EBCDIC->ASCII translation, so
"correctly" implies the use somewhere along the way of heuristic
transformations.) Such things did exist, and worked OK in many
cases. As a historical note, in the very early days of CIF some of the
files we got will have reached us from BITNET hosts via JANET's
Coloured Book protocol converters - our very first connection to the
Internet was via an X.29 gateway.

To guard against character set conversions, we appended a 4-line
comment to our distribute template file:

# The following lines are used to test the character set of files sent by
# network email or other means. They are not part of the CIF data set.
# abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789
# !@#$%^&*()_+{}:"~<>?|\-=[];'`,./

(I wonder how that looks in your email reader!) I think the curly
braces were the most likely characters to change in EBCDIC/ASCII
translation. This signature is still found on the templates on our ftp
site. I don't know whether we still do a character check on this
signature, if it is found. In the early days at least, the comparison
would have been done by simple OS text-mode tools - grep, sed, string
equality tests in Bourne shell... One imagines that something nmore
sophisiticated would be needed if one wanted not only to test for
deviation from a canonical encoding, but to suggest the most likely
intended encoding.

> Brian should also be able to provide the history behind the note
> I cited on encodings in CIF1.

I can't readily locate the correspondence, but will try to do so if
anyone is very interested. To my recollection, there was a lengthy
correspondence touching on many of the same requirements for
accommodating authors whose working environment was beyond their
control, understanding or interest. Then, as now, there was no real
sense of conflicting interests, only a protracted exploration of how
best to express the desired outcome without over-complicating the
standard.

As has been stated many times, the ideal outcome (perfect, guaranteed
uncorrupted transmission of information) is never going to be
attainable because of the many layers of transmission protocols that
are implemented in the real world by different vendors, programmers
etc. We're still tussling with the optimal tradeoff between complexity
and functionality, heuristics and algorithms, respect and
authoritarianism.

James's summary, just received, lays out the dialectics quite nicely,
and I'll respond when I've properly digested it. My *inclination* at
this stage is towards establishing as compact a standard as possible
that is yet amenable to extension in the light of need, dictated by
real-world experience.

Best wishes
Brian

On Wed, Jun 23, 2010 at 06:48:27AM -0400, Herbert J. Bernstein wrote:
> Dear James,
> 
>   You seem to be asking for the specific encoding used for specific
> CIFs in the time window from 1991 to 2010.  Precisely because of
> the automatic shifts in encodings in the transfer of text files,
> I don't know the encodings that the CIFs I worked with from 1995
> on used.  All I know are the encodings I used in working with them,
> which were 7- and 8- bit ASCII, CDC display code and several different
> code-page based encodings.  Personally, I tried to avoid EBCDIC.  You will 
> have to ask Brian what encodings the IUCr saw in Chester.  I do know that 
> I had great difficulty nailing down the representation of anything in 
> Unicode until just a few years ago, and that just 2 years ago I had 
> serious trouble  under Windows-XP with confusion between q and a when 
> working with an English keyboard on a French-localized system.
> 
>   Brian should also be able to provide the history behind the note
> I cited on encodings in CIF1.
> 
>   Regards,
>     Herbert
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
> 
>                  +1-631-244-3035
>                  yaya@dowling.edu
> =====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.