[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Let's all take a deep breath...

To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
Subject: Re: [Cif2-encoding] Let's all take a deep breath...
From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 27 Sep 2010 10:12:58 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]>

Dear James,

   Unless you run filter code in your CIF applications that recognizes
and reports characters beyond 126 as errors, people can slip in
UTF8 or various code page representations of accented characters
right now under CIF1 and I seem to recall an earlier message
in this discussion from Simon or Brian reporting exactly that
problem already hitting the journal workflows.

   How does the UTF8 in place of ASCII proposal tell users to
start using those characters in their journal submissions?  Unless
we put UTF8 characters in tags in the new dictionaries, the
only issue is for data values that are intended to be free-form
text, an area in which the control is _not_ at the CIF level
but in the advice to authors and the type-setting programs.
How does anything for a journal submission change because of the use of 
UTF8 in place of ASCII if the advice to authors and the type-setting
programs remain based on Brian's current elides, except to get better,
in that there is a slightly better chance to figuring out what an
uncooperative author who doesn't read instructions meant by the
strange characters he chose to introduce?

   James, we can go back and forth by email this way for years, each
thinking the other is not understanding the obvious.  It is an
unfortunate effect of email dicussions.  We _need_ to talk face-to-face
to resolve this.  If we cannot get together for a meeting, how about
using Skype?  Right now this really is an infinite loop.

   To keep this from growing to infinite size, I'll just respond to
one last point -- am I proposing that we not use the UTF8 characters
in tags?  I am _not_ proposing that as part of the CIF2 specification.
In order to get from where we are now to a system with reasonable
handling of characters about code point 126, we need to have room
in the specification to try approaches out.  What I am proposing
is that we be careful in the current crop of dictionaries to not
introduce such tags yet for IUCr official dictionaries.  I am not
proposing to restrict values in the specification for the same
reason, but I am suggesting that the IUCr and the PDB be careful
not to encourage deposition of CIFs with characters with code
points above 126 until there is a clear understanding of how
to handle them and software to support them.

   I am trying to separate problems, to make the transition to CIF2
modular, so that is has a good chance of success.

   Please consider a Skype meeting.  It might not work, but I don't
think it can make things any worse than they are right now.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Mon, 27 Sep 2010, James Hester wrote:

> See interpolated comments.
> 
> On Mon, Sep 27, 2010 at 11:36 AM, Herbert J. Bernstein
> <[email protected]> wrote:
>       Dear Colleagues,
>
>       � �As one might expect, I respectfully disagree with almost
>       everything
>       James has said, but the really critical point of disagreement is
>
>       >(iii) It is extremely misleading to think that simply
>       substituting
>       >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately
>       the
>       >same results as we had for CIF1. �The 'any encoding' clause in
>       the
>       >CIF1 standard was essentially irrelevant - encodings used in
>       the
>       >overwhelming majority of systems producing CIF1 files coincided
>       with
>       >ASCII for CIF text, as I have said many times before, so
>       software had
>       >no trouble in turning a stream of CIF bytes from any unknown
>       source
>       >into the same text that the CIF writer was working from. �If I
>       repeat
>       >this point endlessly, it is only because the CIF1 approach
>       continues
>       >to be invoked like magic fairy dust that will make everything
>       OK, when
>       >in fact the magic fairy dust was the dominance of ASCII
>       encoding for
>       >ASCII codepoints. �There is *no such uniformity* in encoding of
>       >Unicode codepoints. �We have a new problem for CIF, and
>       whatever we do
>       >will have *new* consequences, and that very much includes the
>       'as for
>       >CIF1' proposal. �So please, enough with the 'CIF1 has served us
>       well
>       >for 15 years' line.
> 
> I vigorously disagree on this point. �If the only change we were to
> make in going to CIF2 were to be that we were inserting UTF8 in place
> of ASCII there would be absolutely _no_ impact on any existing
> CIF application or CIF data file, because for the characters that
> are formally legal under CIF1, UTF8 and ASCII are identical encodings.
> The relevant portion of the CIF1 syntax specification is:
> 
> "22. Characters within a CIF are restricted to certain printable or
> white-space characters. Specifically, these are the ones located in
> the ASCII character set at decimal positions 09 (HT or horizontal
> tab), 10 (LF or line feed), 13 (CR or carriage return) and the
> letters, numerals and punctuation marks at positions 32-126."
> 
> Any existing data file or application that conforms to that
> restriction
> in _any_ encoding, will be indistinguishable with the "UTF8 in place
> of ASCII"
> change. For those applications and data file, this is not a change.
> 
> 
> Up to here, I absolutely agree with you.
>
>       As James implies, the only new problems that arise are in
>       introducing
>       characters into CIFs drawn from codepoints 128 ff, but we
>       already have
>       that problem under CIF1. �
> 
> 
> No we don't have that problem at all in CIF1, because we don't accept
> characters from above codepoint 128 in CIF1.� While it is indeed the only
> new problem, it is a huge one.
> �
>       The use of "UTF8 in place of ASCII" simply
>       allows us to coherently consider how to handle those characters
>       in the
>       future. �
> 
> 
> How does it create this breathing space?� Because as far as I can see, 'as
> for CIF1' is allowing any encoding to be used for Unicode codepoints. This
> is where I was forced to invoke magic fairy dust, because the 'UTF8 in place
> of ASCII' approach advocates a formulation that was largely irrelevant in
> CIF1 to solve a problem that was not present in CIF1 in the first place,
> with no justification beyond "it worked for CIF1", when in fact it worked
> about as well as my most excellent elephant repellent does (look, no
> elephants, it must work!).� Sorry for being obtuse, but I don't follow your
> logic at all.
> �
>       If we don't use them in any IUCr-sanctioned dictionary tags
>       for the moment, we are in no worse shape going forward under my
>       proposal
>       with Brian's recommendations added than we are staying with CIF1
>       and
>       ASCII, and, I believe, in much better shape.
> 
> 
> Is this also part of your proposal - that the IUCr hold off on using
> non-ASCII characters in tags (and, I assume, values) until we sort this
> out?�
> 
>
>       This is a serious matter, not appropriate from sarcastic "fairy
>       dust"
>       comments. �
> 
> 
> You will note that I am taking it extremely seriously (my wife despairs), to
> the point that I am forced to write about castles in the sky.� Which is why
> I really need to understand the thinking behind your statements and those of
> others.
> �
>       It really is true that "CIF1 has served us well
>       for 15 years,"
> 
> 
> Yes it has.� This however does not justify adopting the same approach to
> encoding.� See above for why.
> �
>       and we should take our time on the encoding issue
>       and be certain we are really improving things, not making them
>       worse by what we propose.
> 
> 
> By adopting Unicode we immediately create big new problems, to which we need
> new solutions to even get close to the current situation in CIF1.
> �
>       �I agree that we need to discuss and
>       resolve the encoding issue, but it is not a new problem suddenly
>       introduced by using UTF8. �
> 
> 
> (I assume you mean Unicode, not UTF8).� No, the opposite is true. This is a
> new problem arising entirely from adoption of Unicode, because in the CIF1
> ASCII range most encodings are identical.� This is not so for Unicode.
> �
>       In my opinion, however, a hasty, ill-considered
>       resolution to that serious problem would be a very bad idea, but
>       delaying all of CIF2 in order to wait until we work our way out
>       of
>       a thicket that has no clear exit yet also seems to me to be a
>       very bad
>       idea. �
> 
> 
> You are right, although hasty is a bit of an exaggeration for the pointy end
> of a several-month-long discussion.� Unfortunately, your proposal does not
> in my opinion create breathing room but simply ducks the new problem of
> multiple encodings.� If you want breathing room, you allow only
> self-identifying encodings now (UTF8 and UTF16) and then work on how you
> would allow other encodings.� Something like: "Only self-identifying
> encodings are currently supported for use in CIF2, but other encodings may
> become available at a later date".� Then you can spend all the time in the
> world working on systems for incorporating other encodings.
> �
>       I expect if we ever manage to meet face to face, or even in
>       a series of Skype meetings, we could come to closure fairly
>       quickly,
>       but as things now stand it seems unlikely that we will have a
>       chance
>       to do that before the IUCr meeting.
> 
> 
> I have bought a web cam and when I spend less time researching encoding
> issues I should be able to get set up and running.
>
>       I use CIF as a text format and I use it as a binary format. �I
>       use
>       both DDL1 CIF and DDL2 CIF. I am also a cross-platform and cross
>       version CIF programmer. �I do not fool myself into thinking CIF1
>       to be
>       perfect. �It is not. �But it is a very useful tool, and I would
>       like
>       to be certain that what we propose as CIF2 is a least as useful
>       as
>       CIF1 and hopefully more so. �I do not believe that options 3, 4
>       or
>       5 are far enough along to provide such utility, and by being too
>       prescriptive at this stage, may well do harm.
> 
> 
> Why can't we be open-ended, and say that "UTF8 and UTF16 are acceptable in
> the whole Unicode range. All other encodings are acceptable in the ASCII
> range only.� We are investigating ways of extending the range of
> applicability of non UTF encodings".� Given that UTF8 and UTF16 are
> definitely encodings that CIF2 will use, and require no extra CIF2 machinery
> (hashcodes etc.) to identify them, they are ready for use now.� I am more
> than happy for us to continue to investigate ways of allowing a wider or
> infinite variety of encodings.
>
>       I urge all concerned to support either options 1 or 2 or both,
>       so we
>       have get CIF2 out for the IUCr meeting this coming summer, and
>       to let the encoding issue take its own time. �If by some chance
>       we come up with a solution before summer 2011, so much the
>       better,
>       but please don't make the perfect (CIF2 with all issues
>       including
>       the encoding issue resolved) the enemy of the good (the CIF2 we
>       have
>       now with the encoding issue left open).
> 
> 
> Both options 3 and 4 can also be used as the basis of a proposal that allows
> further work on the encoding issue (see previous paragraph).� Indeed,
> proposals based on 3 and 4 allow Unicode code points to be used in a
> controlled way and do not encourage proliferation of encodings before we are
> able to manage that proliferation.
> 
> I invite you to answer the question at the end of my previous email
> (reproduced below).� Note that under proposals 3, 4 and 5 it has a simple
> answer.
> 
> >Under the 'as for CIF1' proposal, how does my
> >program turn these bytes into text in the way that the writer of the
> >bytes intended? �If that is not yet resolved, how can anybody even
> >write a CIF2 program?
> 
> all the best,
> James.
>
>       Regards,
>       � Herbert
> 
> 
> At 9:46 AM +1000 9/27/10, James Hester wrote:
> >Well, I didn't even manage to properly call a vote and everybody has
> >piled in, Simon even managed to vote twice (and that's quite OK
> Simon,
> >we are trying to determine what the will of the group is and so I
> >think it only reasonable that if somebody's assessment of the
> >situation changes that they can 'update' their vote). �I am however
> >unhappy that both Brian and Simon introduced new concerns and nobody
> >has had a chance to comment on how the various proposals under
> >consideration might affect those concerns. �I would therefore like to
> >suggest that the voting period continues until the end of this week,
> >and that we all endeavour to express any concerns or comments that we
> >need to make in a timely fashion. �I will be commenting on Brian and
> >Simon's concerns presently, and also on Herbert's proposal, which I
> >have not subjected to my hopefully not too long-winded scrutiny.
> >
> >None of us should feel steamrolled by a certain artifical urgency
> that
> >has appeared in the dialogue - while we do need to wrap things up in
> a
> >timely fashion, it has only been 4 days since I even started
> >discussing the vote.
> >
> >Some initial general comments (I will comment separately on Brian and
> >Simon's issues).
> >
> >(i) We are *not* in an infinite loop. �The last few months have seen
> >several proposals analysed and explored, and it is my perception that
> >these discussions have led at least some participants (including
> >myself) to a better understanding of the consequences of what they
> are
> >proposing. �So nobody should feel that throwing out a new criticism
> of
> >an old or new proposal is somehow hindering progress by looping over
> >old ground. �Quite the reverse, it is making progress. �What *is*
> >important is to get your comments into the mix in a timely fashion,
> >because time is indeed short.
> >
> >(ii) It is not correct to assume that we can figure out the encoding
> >issues later. �Maybe we can, but maybe we can't. Once CIF2 files are
> >produced and software is distributed, you can't put the genie back in
> >the bottle, by which I mean you can't easily change the way that
> >distributed software behaves, and how files are interpreted. �We have
> >to therefore be confident that the standard we promulgate does not
> >close off an avenue we need for solving encoding issues.
> >
> >(iii) It is extremely misleading to think that simply substituting
> >UTF8 in CIF2 for ASCII in CIF1 will lead to even approximately the
> >same results as we had for CIF1. �The 'any encoding' clause in the
> >CIF1 standard was essentially irrelevant - encodings used in the
> >overwhelming majority of systems producing CIF1 files coincided with
> >ASCII for CIF text, as I have said many times before, so software had
> >no trouble in turning a stream of CIF bytes from any unknown source
> >into the same text that the CIF writer was working from. �If I repeat
> >this point endlessly, it is only because the CIF1 approach continues
> >to be invoked like magic fairy dust that will make everything OK,
> when
> >in fact the magic fairy dust was the dominance of ASCII encoding for
> >ASCII codepoints. �There is *no such uniformity* in encoding of
> >Unicode codepoints. �We have a new problem for CIF, and whatever we
> do
> >will have *new* consequences, and that very much includes the 'as for
> >CIF1' proposal. �So please, enough with the 'CIF1 has served us well
> >for 15 years' line.
> >
> >(iv) The majority are currently in favour of the 'as for CIF1'
> >approach, which if nobody changes their vote by the end of the week,
> >is what we will be taking to the DDLm group and COMCIFS. �This means
> >we will have a pure text standard, and I mean really pure, because
> >there is no predictable link between this beautiful textual castle in
> >the sky and the solid ground of bytes on disk.
> >
> >I am a cross-platform CIF programmer. Looking forward to the halcyon
> >'as for CIF1' days that await us, a small question occupies my mind.
> >As my program does not operate in that glorious abstract space
> >occupied by pure text standards that are most certainly not anybody's
> >laughing stock, my program will be forced to (as briefly as possible)
> >deal with humble plebiean bytes according to some encoding to obtain
> >the exalted CIF text. � Under the 'as for CIF1' proposal, how does my
> >program turn these bytes into text in the way that the writer of the
> >bytes intended? �If that is not yet resolved, how can anybody even
> >write a CIF2 program?
> >
> >--
> >T +61 (02) 9717 9907
> >F +61 (02) 9717 3145
> >M +61 (04) 0249 4148
> >_______________________________________________
> >cif2-encoding mailing list
> >[email protected]
> >http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> --
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � �Dowling College, Kramer Science Center, KSC 121
> � � � � Idle Hour Blvd, Oakdale, NY, 11769
> 
> � � � � � � � � �+1-631-244-3035
> � � � � � � � � �[email protected]
> =====================================================
> _______________________________________________
> cif2-encoding mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>

_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]

References:

[Cif2-encoding] Let's all take a deep breath... (James Hester)

Re: [Cif2-encoding] Let's all take a deep breath... (James Hester)

Prev by Date: Re: [Cif2-encoding] How we wrap this up

Next by Date: Re: [Cif2-encoding] How we wrap this up

Prev by thread: Re: [Cif2-encoding] Let's all take a deep breath...

Next by thread: Re: [Cif2-encoding] Let's all take a deep breath.... .

Index(es):

Date

Thread

Discussion List Archives

Re: [Cif2-encoding] Let's all take a deep breath...