[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Let's all take a deep breath...

Dear Colleagues,

    As one might expect, I respectfully disagree with almost everything
James has said, but the really critical point of disagreement is

>(iii) It is extremely misleading to think that simply substituting
>UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the
>same results as we had for CIF1.  The 'any encoding' clause in the
>CIF1 standard was essentially irrelevant - encodings used in the
>overwhelming majority of systems producing CIF1 files coincided with
>ASCII for CIF text, as I have said many times before, so software had
>no trouble in turning a stream of CIF bytes from any unknown source
>into the same text that the CIF writer was working from.  If I repeat
>this point endlessly, it is only because the CIF1 approach continues
>to be invoked like magic fairy dust that will make everything OK, when
>in fact the magic fairy dust was the dominance of ASCII encoding for
>ASCII codepoints.  There is *no such uniformity* in encoding of
>Unicode codepoints.  We have a new problem for CIF, and whatever we do
>will have *new* consequences, and that very much includes the 'as for
>CIF1' proposal.  So please, enough with the 'CIF1 has served us well
>for 15 years' line.

I vigorously disagree on this point.  If the only change we were to
make in going to CIF2 were to be that we were inserting UTF8 in place
of ASCII there would be absolutely _no_ impact on any existing
CIF application or CIF data file, because for the characters that
are formally legal under CIF1, UTF8 and ASCII are identical encodings.
The relevant portion of the CIF1 syntax specification is:

"22. Characters within a CIF are restricted to certain printable or
white-space characters. Specifically, these are the ones located in
the ASCII character set at decimal positions 09 (HT or horizontal
tab), 10 (LF or line feed), 13 (CR or carriage return) and the
letters, numerals and punctuation marks at positions 32-126."

Any existing data file or application that conforms to that restriction
in _any_ encoding, will be indistinguishable with the "UTF8 in place of ASCII"
change. For those applications and data file, this is not a change.

As James implies, the only new problems that arise are in introducing
characters into CIFs drawn from codepoints 128 ff, but we already have
that problem under CIF1.  The use of "UTF8 in place of ASCII" simply
allows us to coherently consider how to handle those characters in the
future.  If we don't use them in any IUCr-sanctioned dictionary tags
for the moment, we are in no worse shape going forward under my proposal
with Brian's recommendations added than we are staying with CIF1 and
ASCII, and, I believe, in much better shape.

This is a serious matter, not appropriate from sarcastic "fairy dust"
comments.  It really is true that "CIF1 has served us well
for 15 years," and we should take our time on the encoding issue
and be certain we are really improving things, not making them
worse by what we propose.  I agree that we need to discuss and
resolve the encoding issue, but it is not a new problem suddenly
introduced by using UTF8.  In my opinion, however, a hasty, ill-considered
resolution to that serious problem would be a very bad idea, but
delaying all of CIF2 in order to wait until we work our way out of
a thicket that has no clear exit yet also seems to me to be a very bad
idea.  I expect if we ever manage to meet face to face, or even in
a series of Skype meetings, we could come to closure fairly quickly,
but as things now stand it seems unlikely that we will have a chance
to do that before the IUCr meeting.

I use CIF as a text format and I use it as a binary format.  I use
both DDL1 CIF and DDL2 CIF. I am also a cross-platform and cross
version CIF programmer.  I do not fool myself into thinking CIF1 to be
perfect.  It is not.  But it is a very useful tool, and I would like
to be certain that what we propose as CIF2 is a least as useful as
CIF1 and hopefully more so.  I do not believe that options 3, 4 or
5 are far enough along to provide such utility, and by being too
prescriptive at this stage, may well do harm.

I urge all concerned to support either options 1 or 2 or both, so we
have get CIF2 out for the IUCr meeting this coming summer, and
to let the encoding issue take its own time.  If by some chance
we come up with a solution before summer 2011, so much the better,
but please don't make the perfect (CIF2 with all issues including
the encoding issue resolved) the enemy of the good (the CIF2 we have
now with the encoding issue left open).

Regards,
   Herbert


At 9:46 AM +1000 9/27/10, James Hester wrote:
>Well, I didn't even manage to properly call a vote and everybody has
>piled in, Simon even managed to vote twice (and that's quite OK Simon,
>we are trying to determine what the will of the group is and so I
>think it only reasonable that if somebody's assessment of the
>situation changes that they can 'update' their vote).  I am however
>unhappy that both Brian and Simon introduced new concerns and nobody
>has had a chance to comment on how the various proposals under
>consideration might affect those concerns.  I would therefore like to
>suggest that the voting period continues until the end of this week,
>and that we all endeavour to express any concerns or comments that we
>need to make in a timely fashion.  I will be commenting on Brian and
>Simon's concerns presently, and also on Herbert's proposal, which I
>have not subjected to my hopefully not too long-winded scrutiny.
>
>None of us should feel steamrolled by a certain artifical urgency that
>has appeared in the dialogue - while we do need to wrap things up in a
>timely fashion, it has only been 4 days since I even started
>discussing the vote.
>
>Some initial general comments (I will comment separately on Brian and
>Simon's issues).
>
>(i) We are *not* in an infinite loop.  The last few months have seen
>several proposals analysed and explored, and it is my perception that
>these discussions have led at least some participants (including
>myself) to a better understanding of the consequences of what they are
>proposing.  So nobody should feel that throwing out a new criticism of
>an old or new proposal is somehow hindering progress by looping over
>old ground.  Quite the reverse, it is making progress.  What *is*
>important is to get your comments into the mix in a timely fashion,
>because time is indeed short.
>
>(ii) It is not correct to assume that we can figure out the encoding
>issues later.  Maybe we can, but maybe we can't. Once CIF2 files are
>produced and software is distributed, you can't put the genie back in
>the bottle, by which I mean you can't easily change the way that
>distributed software behaves, and how files are interpreted.  We have
>to therefore be confident that the standard we promulgate does not
>close off an avenue we need for solving encoding issues.
>
>(iii) It is extremely misleading to think that simply substituting
>UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the
>same results as we had for CIF1.  The 'any encoding' clause in the
>CIF1 standard was essentially irrelevant - encodings used in the
>overwhelming majority of systems producing CIF1 files coincided with
>ASCII for CIF text, as I have said many times before, so software had
>no trouble in turning a stream of CIF bytes from any unknown source
>into the same text that the CIF writer was working from.  If I repeat
>this point endlessly, it is only because the CIF1 approach continues
>to be invoked like magic fairy dust that will make everything OK, when
>in fact the magic fairy dust was the dominance of ASCII encoding for
>ASCII codepoints.  There is *no such uniformity* in encoding of
>Unicode codepoints.  We have a new problem for CIF, and whatever we do
>will have *new* consequences, and that very much includes the 'as for
>CIF1' proposal.  So please, enough with the 'CIF1 has served us well
>for 15 years' line.
>
>(iv) The majority are currently in favour of the 'as for CIF1'
>approach, which if nobody changes their vote by the end of the week,
>is what we will be taking to the DDLm group and COMCIFS.  This means
>we will have a pure text standard, and I mean really pure, because
>there is no predictable link between this beautiful textual castle in
>the sky and the solid ground of bytes on disk.
>
>I am a cross-platform CIF programmer. Looking forward to the halcyon
>'as for CIF1' days that await us, a small question occupies my mind.
>As my program does not operate in that glorious abstract space
>occupied by pure text standards that are most certainly not anybody's
>laughing stock, my program will be forced to (as briefly as possible)
>deal with humble plebiean bytes according to some encoding to obtain
>the exalted CIF text.   Under the 'as for CIF1' proposal, how does my
>program turn these bytes into text in the way that the writer of the
>bytes intended?  If that is not yet resolved, how can anybody even
>write a CIF2 program?
>
>--
>T +61 (02) 9717 9907
>F +61 (02) 9717 3145
>M +61 (04) 0249 4148
>_______________________________________________
>cif2-encoding mailing list
>cif2-encoding@iucr.org
>http://scripts.iucr.org/mailman/listinfo/cif2-encoding


-- 
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]