Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .

There are practical as well as philosophical issues here. My other
post on handling ASCII 'null' had two purposes. One was to elicit
information about the validity or otherwise of a null byte (thanks
John for the clear answer to that). The other was to draw attention
to the original vcif's difficulty in handling this specific issue
because it was using C stdio library routines that recognised null
as a string terminator - not its intended role in this context.

In similar vein, I suppose that many CIF applications will implement
procedures such as testing for validity of character values through
reliance on existing libraries rather than direct bitwise comparisons.
Experience suggests that there will be subtle (or not so subtle)
differences between library implementations from GNU, IBM, Microsoft,
Sun, Apple, Perl, Python, English, American, Japanese, ... authors.
Mostly people will get by with applications that suit their purpose
most of the time, perhaps occasionally helped along by a bit of
stream-editing or opening a file in some editor and saving it out
again in some other format. Simon's mangled mail is another example
of how things do not always work in the real world.

Expecting every CIF application to be robustly able to handle every
conceivable - or even every reasonable - encoding is (what's the
word?) "optimistic", and places a heavy burden on application

Consider instead the approach of defining the CIF standard as a
text file and using UTF-8 for a "canonical" description of low-level
representations. Supply a set of such canonical CIFs in the
next-generation trip test suite. Require a "compliant" CIF
application to handle the trip tests with the canonical encoding.
Permit - indeed encourage - applications developers to accommodate
other encodings to the extent they can easily do with their standard
text-processing libraries/utilities/tools. Encourage or perhaps
commission a "canonicalisation" suite for use in contexts where
an application cannot natively handle a submitted encoding. Note that
such a suite might have a combination of automatic converters, where
the required translations are fully deterministic; but it may also
require interactive visual tools if there are non-deterministic
translations. It doesn't exclude the possibility of moving to a
different canonical encoding in subsequent revisions.

This isn't a radical new suggestion; it seems to me to encapsulate
many of the points of common ground around which we're still
negotiating our points of principle or philosophy, but I would hope it
can help us to move forward.

Best wishes

On Wed, Jun 23, 2010 at 11:04:45AM +1000, James Hester wrote:
> Thanks John for putting in the effort to come up with a decent
> compromise proposal.  I would add something along the lines of
> 'Compliant CIF2 processors should at a minimum be able to deal with
> files in CIF interchange format'.  And somewhere I would really like
> to warn people of the dangers of using anything else for storage.  But
> I think I could live with what you've come up with, as it looks like
> I'm unlikely to get support for anything more restrictive.
> On Wed, Jun 23, 2010 at 1:15 AM, Bollinger, John C
> <John.Bollinger@stjude.org> wrote:
>> I prefer leaving the issue of character encoding entirely out of
>> the scope of the CIF format specification (effectively allowing
>> any encoding).  On the other hand, I think it's a bit of an
>> aggrandizement to characterize UTF-16 / Shift-J IS / etc. as "ways
>> in which many of our colleagues get their science done."  In no way
>> do I dispute that many of our colleagues indeed use these encodings
>> routinely, but I am doubtful that editing Unicode text with a text
>> editor constitutes a significant part of many of their research
>> programs.  At least, few of my English-speaking colleagues edit
>> flat Unicode text files with any frequency, if ever they do at all.
>> I think there is already good software, some of it free (both
>> senses), for operating systems at least as old as Windows 9x, that
>> supports editing UTF-8 encoded text.  Most of it also supports a
>> multitude of other encodings.  We would leave no one out by
>> requiring UTF-8, and I do not see that respect for our colleagues
>> demands that CIF2 be equally convenient to create and edit with
>> every text editor in current use.  If that is doubtful, however,
>> and respect is our goal, then wouldn't the most respectful thing
>> be to *ask* a few of the people about whom we are concerned?
>> My issue here is different, and at least partly philosophical. The
>> CIF format can and should be about the structure and meaning of
>> CIF text content.  Character encoding is on a different level:
>> it's a characteristic of storage and interchange.  Comingling
>> these layers is inelegant and unnecessary.
>> Moreover, a CIF2 requirement to encode in UTF-8 will be small
>> comfort when presented with a file that is not, in fact, encoded
>> that way.  What can you then do?  Either reject the file or
>> autodetect the encoding.  If CIF2 does not specify a particular
>> encoding, and you receive the same file, then what can you do? 
>> Exactly the same things, but then it's more likely that the file's
>> provider will have also specified the encoding by some
>> means.  (Particularly so if the CIF2 spec calls attention to the
>> need to do so.)
>> Perhaps something like this would be an acceptable compromise:
>> a) Rewrite change 2 to remove the requirement for UTF-8
>> b) Add:
>> ====
>> CHANGE 9 - NEW (CIF Interchange Format)
>> Many alternative encodings are available for recording and
>> exchanging Unicode character data via byte-oriented media. The
>> CIF format itself is encoding independent, but that allows for
>> uncertainty as to how to handle putative CIF data unaccompanied by
>> encoding information.  We therefore define a simple, binary CIF
>> Interchange Format, consisting of CIF2 text encoded in UTF-8, with an
>> optional initial UTF-8 byte-order mark.  CIF Interchange Format is
>> intended as a storage and interchange standard for CIF2.  Its use is
>> strongly encouraged, but its existence should not be taken as a
>> prohibition against use of alternative storage and interchange
>> formats among agreeing parties.
>> The standard file name extension for CIF Interchange Format files is .cif.
>> ====

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.