[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Fundamental source of disagreement

Dear Colleagues,

   James presents an interesting view, but one that, with respect to
text encodings is without support in fact.

   HTML may have begun as a markup languages, but it has become one
of the most important standards for interchange of documents we
have and is, in the current XHTML form, an important dialect of

  We all seem to agree that XML is an interchange standard, and
that CIF1 is an interchange standard.

   Now let us look at what is and is not, in fact optional in all
three, beginning with encodings and beginning with XML and its
pious hope to avoid having optional features:

   For our reference on XML let us confine our attention to the XML-5 
document at


and ignore any deviations from that standard that have arisen in
practice.  The point at issue is, of course, having optional alternative
in character encodings, on which the document says:

"The mechanism for encoding character code points into bit patterns may 
vary from entity to entity. All XML processors MUST accept the UTF-8 and 
UTF-16 encodings of Unicode [Unicode]; the mechanisms for signaling which 
of the two is in use, or for bringing other encodings into play, are 
discussed later, in 4.3.3 Character Encoding in Entities."

and let us count a few of the optional features in that document:

   Section 3:  "optional white space"
   Section 3.2:  "At user option, an XML processor MAY issue a warning when 
a declaration mentions an element type for which no declaration is 
provided, but this is not an error."
   Section 3.2.1:  "An element type has element content when elements of 
that type MUST contain only child elements (no character data), optionally 
separated by white space"
   Section 4.6: "the double escaping here is OPTIONAL but harmless"

There are many more in the document.  Yes, the document expresses the
design goal (i.e. pious hope) that "#

"The number of optional features in XML is to be kept to the absolute 
minimum, ideally zero."

but the fact is that XML has many options, including the option of
being expressed in a wide range of character encodings.  Lest anyone
think this an abberation, please note the highly complex optional
behavior of XML with respect to DTDs and the various "non-normative"
practices at the end of the document.

For XML, the standard specifies many options.

For HTML we are all aware of how many optional features and encodings
there are, even more than with XML.  There is an effort of tighten
this up in XHTML with strict conformance.

As for CIF1, I recommen reading the specification, which was written
to allow writing of CIFs as text dociments on a wide range of computers
with a with range of character encodings (incuding even EBCDIC).  Just
to pick one obvious example of optional behavior -- look at the
stripping of optional blanks at the ends of lines.

Once again, with all due respect to the adherents to goal of not
having any options -- the interchange specifications that are working
and in common use have options (including for encodings) and the
well-established practice of liberal readers and strict writers.
Those are the facts, and, while it is fine for each of us to form
our own opions, it is not viable for each of us to create our
own facts.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 16 Aug 2010, James Hester wrote:

> I'm not sure that you have identified the fundamental source of
> disagreement, but if we disagree on our approaches to optional
> behaviour we will have trouble finalising the standard, so I have
> addressed Herb's comments below.
> On Tue, Aug 10, 2010 at 8:42 PM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> With all due respect to James and others who adhere to the view that:
>> "There is no such thing as 'optional' for an information interchange standard."
>> I believe this the fundamental source of our disagreement on the
>> the direction for CIF2.
>> Optional features are common in almost all current successful standards
>> for information interchange, including HTML4, XMF and CIF1.  As a
>> practical matter, one tries to have strict writers and liberal readers
>> for interchange standards to encourage migration to as common a
>> convention as possible.  Even so, if we are too strict in our rules
>> for what is and is not a proper CIF, we will probably encourage
>> the growth of multiple unofficial, unmanaged and non-interchangeable
>> CIF2 dialects.
> I dispute all three unsupported statements in the above paragraph.
> Taking the first one, where HTML, XML and CIF1 are put forward as
> successful standards for information interchange that have optional
> features:
> (1) HTML is no more an information interchange standard that Rich Text
> Format.  It is primarily a standard for marking up documents for
> presentation to the human reader.  If you wish to argue by analogy
> with HTML, you will need to draw a much tighter parallel to the goals
> of CIF.
> (2) I agree that the goals of XML are similar to those of CIF, and I
> would be pleased if we adopted their approach to optional behaviour.
> The fifth of the 10 design goals for XML was (see the XML 1.1 standard
> at http://www.w3.org/TR/2006/REC-xml11-20060816):
> "5.The number of optional features in XML is to be kept to the
> absolute minimum, ideally zero."
> So, if XML is to be our guiding light, then we should avoid optional behaviour.
> (3) As for CIF1 having optional behaviours, what might those be?  I
> would assert that regardless of the wording of the standard, those
> optional features are either never supported, or else always
> supported, or else irrelevant to the core use of CIF.
> So: I don't think that appealling to HTML or XML proves that optional
> behaviour is a good thing, and lacking supporting argument, the appeal
> to CIF1 does not prove it either.
> Moving on to your second assertion about liberal readers and strict
> writers: while that philosophy has its adherents, an alternative
> philosophy also exists: readers should exit gracefully on standards
> violations.  I quote a recent Linux Weekly News article which bears on
> this discussion:
> "The notion that one should be liberal in what one accepts while being
> conservative in what one sends is often expressed in the networking
> field, but it shows up in a number of other areas as well. Often,
> though, it can make more sense to be conservative on the accepting
> side; the condition of many web pages would have been far better had
> early browsers not been so forgiving of bad HTML."
> (http://lwn.net/Articles/394175/)
> So: which approach you adopt to writing standards-conformant readers
> requires some thought, particularly given the possibility that liberal
> readers will encourage liberal writers.
> The final assertion about the consequences of being too strict might
> in theory be true, but it will require clear use-cases to support it
> rather than simply asserting it as a truism.  I would suggest that we
> are nowhere near the point of forcing incompatible dialects to emerge,
> given that the addition of UTF8 to the standard does not meaningfully
> restrict the choices offered to CIF users, and any other restrictions
> that we have introduced into CIF2 relative to CIF1 are very minor.
> Based on this observation, my expectation is that CIF2 will no more
> produce incompatible dialects than CIF1, *provided we have no optional
> behaviour*.
> I will address what I believe is the real source of disagreement on
> this point, which is my statement that "Standards-conformant readers
> must be able to read all files produced by standards-conformant
> writers", in an answer to John B's other post.
>> As for John's hashing scheme, I suspect some variation of it will find signficant use in major archives, just as associating MD5 checksums
>> with tarballs does for many software distributors, but that we also
>> will need some easier-to-generate-and-transfer _optional_ encoding
>> hint schemes, such as the accented "o's".  One simple way to handle
>> it would be:
>>  1.  Put some variant of the accented "o's" into the _optional_
>> magic number; and
>>  2.  Adopt the tarball approach to MD5 checksums by having it not
>> in the header but in a separate file, simply generating it from
>> a canonical UTF8 representation of the CIF2 file.
>> The accented o's are easy to carry along as an encoding hint, and
>> if you get the encoding hint right, then you will easily be able
>> to generate a canonical UTF8 file to validate the MD5 checksum against
>> if you wish for a critical file transfer, e.g. to an archive or a journal.
>> Regards,
>>   Herbert
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
cif2-encoding mailing list

Reply to: [list | sender only]