Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Revised Motion

Dear Herbert,

On Thursday, September 30, 2010 1:05 PM, Herbert J. Bernstein wrote:

>It appears you are proposing to add the words
>"Reference to text files means binary representations of sequences of
>characters, either in a system-dependent form, provided that the
>characters are all drawn from the ASCII set, or alternatively as the
>sequence of bytes resulting from encoding the character sequence according
>to UTF-8."

Yes.  I am open to variations on the wording, but I'm looking for something along those lines to be added to the spec.  Am I wrong that yesterday we were close to doing just that via James's proposal?

>Is, unfortunately, inaccurate and confusing and gets us back into the
>looping dicussion of binary versus text.  It opens up exactly the
>issues we just tried to get away from of making it appear that
>CIF2 is going to invalidate encodings that happen to be neither
>ASCII nor UTF8.  I realize that is not what you intend, but that
>is what your paragraph seems to imply.

I can accept that the wording may be confusing, and I would welcome constructive criticism on that topic.  You cannot sustain a claim that my text is inaccurate, however, without providing at least a partial alternative definition that conflicts.  In other words, what's inaccurate about it?  This is a highly relevant question, because I find my text to be entirely reasonable, and I might well program according to that interpretation without some guidance otherwise.  If I don't use that, then what *do* I use?

As for binary vs. text, I have realized that's a false dichotomy in our context.  Every computer file is binary, in the sense that it is a sequence of bytes.  Some are a _particular type_ of binary that we call "text" (but can't seem to define).  The two are not mutually exclusive.  This is quite different from the traditional binary vs. text issue, which relates to questions such as whether to represent the number 12345 in IEEE 32-bit floating-point format or as five decimal digits.

>This is no an easy concept to define.  I just went through a large
>number of text file definitions on the web, and it is amazing how
>flawed they are are in one way or another.

That is precisely why I am so persistent about putting a definition in the spec.  If I choose the definition I think best, and you the one you think best, and James and Simon likewise, then will any of our programs be fully compatible with each other?  Simon likes identifiable encodings, so maybe he'll feel free to write UTF-32LE CIFs.  Will your programs accept those?  Should they?  To be prepared to process all conformant CIFs, does my program need to be able to handle KOI8-R and Shift-JIS CIFs?  If I use MS Word to create CIFs, and I save them in Rich *Text* Format, then should I be upset when James's software rejects them?

I don't think it's correct to say that the concept is difficult to define.  I could write half a dozen definitions in as many minutes, each appropriate for some particular purpose.  It's more accurate to say that there are many alternative definitions in use, none of them completely compatible with the others.  There is no reason why we can't choose the one we find most suitable, or write one of our own.


>Coming to an acceptable  formal resolution on the meaning of "text" would
seem likely to take a very, very long time.

You already provided a definition that was good enough for me.  My proposed text summarizes and abridges it, perhaps too much, as "a system-dependent form".  I would be content to replace that phrase with your full text, or with the functionally equivalent text I labeled "local".

> We need to move on.

We need to answer the question.  Or COMCIFS does, if we're not up to it.  The spec is incomplete and inadequate without an answer.

>Please recall that what we are discussing is a revision to the existing.
>larger CIF 1.1 syntax definition to create the CIF2 syntax definition,
>and are just trying to get a clear enough definition of what users and
>software developers need to do to cope with the extension of the
>number of code points past 126.

And what definition are we then providing?  The only clear thing I see is that users and developers are *probably* safe if they write UTF-8.  If UTF-8 is the only safe option for CIFs with non-ASCII characters, then how does that differ from my proposal?

>I would suggest that we go forward with the motion as it stands now
>and that we all carefully read CIF 1.1 syntax definition to see if
>and where it might make sense to insert some clear, agreed definition of
>a text file at some future time, but I really don't think most users or
>software developers will have a serious problem in getting started with
>CIF2 leaving the any ambiguty about the concept of a text file at the same
>level it has been under CIF1 with this motion added.

This area presents a much greater problem for CIF2 and its expanded character set than it did for CIF1.  I quite agree that most users and software developers would get started with CIF2 despite the ambiguity.  I cannot see how we would then avoid a slew of problems of the form "X software doesn't handle my CIF" and "Y software produces broken CIFs" and "Z software is incompatible with W software".  I do not see how that could be construed as a win for CIF2.

>Once we have a clear, agreed understanding of the more metaphysical
>aspects of what text is, we can then share that with the
>community.  Meanwhile, they hopefully will already be using CIF2.

This is not an arcane subject that we need to "understand", it is a question that we have the opportunity to *answer* for the purposes of CIF.  We do not need a definition that everyone, everywhere will acclaim as the full and perfect meaning of "text".  We just need to be clear what the specification means by the term.  If we don't know what the specification means by the term, then we should be embarrassed to advance it.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.