[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .

John B's analysis (here and elsewhere) is very pertinent and provides
a very helpful conceptual framework.

There are a few other considerations regarding encodings that come to
my mind.

First, any assertion of an encoding is at best a hint. Properly, a
trusted archive should validate any such assertion and correct it if
necessary. Validation may not always be possible through entirely
automatic procedures. Copying and pasting text from one system into
another may transfer bytes without transcoding the intended characters
(see use case suggested in the next paragraph). Transmission checksums
will help, but will not catch everything. There will arise different
levels of trust associated with experience of particular programs,
operating systems or perhaps authors.

Settling on a single mandated encoding for CIF doesn't solve all the
problems of CIF author/editing applications, assuming some
interaction with other text processing software is allowed or
expected. An author - let us say Polish, since we do currently publish
names with Polish accents - uploads a CIF submission to Chester via web
and subsequently emails to me a Word document saying "Please
replace my abstract by this revised version." I save the email
from my Solaris account to a Samba filesystem, open it in Word, copy
the text to the Windows clipboard, and paste it via my Windows X server
to the publCIF clipboard which I am running on a Linux machine via
an ssh session from my Solaris host. Does publCIF correctly write the
Unicode code point U+0142 ("Polish letter l with stroke") into
the CIF? If not, what has gone wrong and where, and (how) can it be

(I have forgotten.) Are we proposing to allow also the "old" CIF1
character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and
have we a procedure for distinguishing when it has been applied?

The elephant in the room: the discussion has so far addressed
alphabetic character encodings. To retain at least the functionality
of CIF1, we will need to specify schemes for handling other constructs
such as sub/superscripts and possibly some other mathematical
constructs - though most of the CIF1 maths is at the level of
character representations which may be catered for by Unicode code
points. Will procedures for identifying which of these schemes apply
(possibly at the level of individual CIF data items) be orthogonal
to specification of character encodings in the CIF, will the two
interfere destructively, or can they somehow be handled in much the
same way in terms of attaching "metadata about encoding" of some sort
to the relevant objects?


> [...] I maintain that a stream of bytes
> divorced from any explicit or implicit metadata about its encoding
> is binary, not text.  This complication of electronic text handling
> is not new, but it has assumed much more prominence as
> internationalization issues have gained importance.
> Implicit encoding metadata commonly takes the form of the text in
> question being encoded according to the default scheme for the
> system or tool.  It could, in one sense, also take the form of
> a requirement in the format specification, but that is meaningful
> only for tools specific to the format, which rather moots the
> text vs. binary question.  It could also take the form of local
> policy, such as "all CIFs in this archive are encoded in CESU-8,"
> which would be useful to tools configured for the relevant
> environment (e.g. a web server).
> Explicit metadata can be carried by the file itself or conveyed
> out-of-band.  XML's encoding attribute is an example of the former,
> and HTTP's content-type header is an example of the latter.  These
> are useful only to certain tools, specific to a particular format,
> environment, or exchange mechanism.
> One of the upshots of all this is that transcoding must in general be
> a routine aspect of text file exchange, as that can make explicit
> encoding metadata implicit.  As Simon has shown, transcoding not
> automatic in many contexts, so it may require extra work on the
> receiving end.  To the extent that there is a current assumption and
> practice of CIFs being stored and forwarded byte-for-byte as received
> (i.e. without transcoding or explicit metadata), CIF is already being
> treated as a binary format.  In a sense, perhaps, it is being treated
> simultaneously as several distinct binary formats.
> ***
>> By extending the character set beyond ASCII, we have to accept that
>> not all general-purpose text tools are going to be applicable as CIF
>> editors/viewers.
> That's a valid perspective, but I would sharpen it: as part of
> extending the character set beyond ASCII, we abandon the premise that
> CIF is a text format, though under some circumstances it may still be
> possible to manipulate CIFs with tools designed for text.
> Alternatively, I have been advocating essentially this: by extending
> the character set beyond ASCII, we magnify the importance of
> exchanging and storing CIFs according to text conventions, including
> correctly communicating encodings as necessary and transcoding as
> appropriate.
> I hope the latter position adequately encompasses Herb's view as well.
> Each position carries additional baggage, which I have omitted to
> focus on the essential ideas.  If wider comment is sought, then I
> submit that these alternatives provide a suitable basis for soliciting
> such.
> Whichever position prevails, I should like to see something
> substantially similar to the corresponding position statement above
> be inserted into the spec.
> Regards,
> John
ddlm-group mailing list

Reply to: [list | sender only]