[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .

We have no need to disallow the previous set of 'encodings'.  Such
markup is relevant to a different level of the specification, as John
B points out.  How the contents of any given datavalue are dealt with
is a dictionary-level concern.

On Wed, Jun 30, 2010 at 10:07 PM, Brian McMahon <bm@iucr.org> wrote:
> John B's analysis (here and elsewhere) is very pertinent and provides
> a very helpful conceptual framework.
> There are a few other considerations regarding encodings that come to
> my mind.
> First, any assertion of an encoding is at best a hint. Properly, a
> trusted archive should validate any such assertion and correct it if
> necessary. Validation may not always be possible through entirely
> automatic procedures. Copying and pasting text from one system into
> another may transfer bytes without transcoding the intended characters
> (see use case suggested in the next paragraph). Transmission checksums
> will help, but will not catch everything. There will arise different
> levels of trust associated with experience of particular programs,
> operating systems or perhaps authors.
> Settling on a single mandated encoding for CIF doesn't solve all the
> problems of CIF author/editing applications, assuming some
> interaction with other text processing software is allowed or
> expected. An author - let us say Polish, since we do currently publish
> names with Polish accents - uploads a CIF submission to Chester via web
> and subsequently emails to me a Word document saying "Please
> replace my abstract by this revised version." I save the email
> from my Solaris account to a Samba filesystem, open it in Word, copy
> the text to the Windows clipboard, and paste it via my Windows X server
> to the publCIF clipboard which I am running on a Linux machine via
> an ssh session from my Solaris host. Does publCIF correctly write the
> Unicode code point U+0142 ("Polish letter l with stroke") into
> the CIF? If not, what has gone wrong and where, and (how) can it be
> fixed?
> (I have forgotten.) Are we proposing to allow also the "old" CIF1
> character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and
> have we a procedure for distinguishing when it has been applied?
> The elephant in the room: the discussion has so far addressed
> alphabetic character encodings. To retain at least the functionality
> of CIF1, we will need to specify schemes for handling other constructs
> such as sub/superscripts and possibly some other mathematical
> constructs - though most of the CIF1 maths is at the level of
> character representations which may be catered for by Unicode code
> points. Will procedures for identifying which of these schemes apply
> (possibly at the level of individual CIF data items) be orthogonal
> to specification of character encodings in the CIF, will the two
> interfere destructively, or can they somehow be handled in much the
> same way in terms of attaching "metadata about encoding" of some sort
> to the relevant objects?
> Regards
> Brian
>> [...] I maintain that a stream of bytes
>> divorced from any explicit or implicit metadata about its encoding
>> is binary, not text.  This complication of electronic text handling
>> is not new, but it has assumed much more prominence as
>> internationalization issues have gained importance.
>> Implicit encoding metadata commonly takes the form of the text in
>> question being encoded according to the default scheme for the
>> system or tool.  It could, in one sense, also take the form of
>> a requirement in the format specification, but that is meaningful
>> only for tools specific to the format, which rather moots the
>> text vs. binary question.  It could also take the form of local
>> policy, such as "all CIFs in this archive are encoded in CESU-8,"
>> which would be useful to tools configured for the relevant
>> environment (e.g. a web server).
>> Explicit metadata can be carried by the file itself or conveyed
>> out-of-band.  XML's encoding attribute is an example of the former,
>> and HTTP's content-type header is an example of the latter.  These
>> are useful only to certain tools, specific to a particular format,
>> environment, or exchange mechanism.
>> One of the upshots of all this is that transcoding must in general be
>> a routine aspect of text file exchange, as that can make explicit
>> encoding metadata implicit.  As Simon has shown, transcoding not
>> automatic in many contexts, so it may require extra work on the
>> receiving end.  To the extent that there is a current assumption and
>> practice of CIFs being stored and forwarded byte-for-byte as received
>> (i.e. without transcoding or explicit metadata), CIF is already being
>> treated as a binary format.  In a sense, perhaps, it is being treated
>> simultaneously as several distinct binary formats.
>> ***
>>> By extending the character set beyond ASCII, we have to accept that
>>> not all general-purpose text tools are going to be applicable as CIF
>>> editors/viewers.
>> That's a valid perspective, but I would sharpen it: as part of
>> extending the character set beyond ASCII, we abandon the premise that
>> CIF is a text format, though under some circumstances it may still be
>> possible to manipulate CIFs with tools designed for text.
>> Alternatively, I have been advocating essentially this: by extending
>> the character set beyond ASCII, we magnify the importance of
>> exchanging and storing CIFs according to text conventions, including
>> correctly communicating encodings as necessary and transcoding as
>> appropriate.
>> I hope the latter position adequately encompasses Herb's view as well.
>> Each position carries additional baggage, which I have omitted to
>> focus on the essential ideas.  If wider comment is sought, then I
>> submit that these alternatives provide a suitable basis for soliciting
>> such.
>> Whichever position prevails, I should like to see something
>> substantially similar to the corresponding position statement above
>> be inserted into the spec.
>> Regards,
>> John
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]