[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Wed, 30 Jun 2010 08:30:45 -0400 (EDT)
In-Reply-To: <20100630120755.GA7943@emerald.iucr.org>
References: <563298.52532.qm@web87005.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA54166122952C@SJMEMXMBS11.stjude.sjcrh.local><520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@[192.168.2.104]><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><663654.63888.qm@web87001.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local><20100630120755.GA7943@emerald.iucr.org>

Brian asked about sub and superscripts.  There us a limited set
of subscripts and superscript digits in unicode in addition to the
usual accents.  This capability is not, however, a replacement for
markup language.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 30 Jun 2010, Brian McMahon wrote:

> John B's analysis (here and elsewhere) is very pertinent and provides
> a very helpful conceptual framework.
>
> There are a few other considerations regarding encodings that come to
> my mind.
>
> First, any assertion of an encoding is at best a hint. Properly, a
> trusted archive should validate any such assertion and correct it if
> necessary. Validation may not always be possible through entirely
> automatic procedures. Copying and pasting text from one system into
> another may transfer bytes without transcoding the intended characters
> (see use case suggested in the next paragraph). Transmission checksums
> will help, but will not catch everything. There will arise different
> levels of trust associated with experience of particular programs,
> operating systems or perhaps authors.
>
> Settling on a single mandated encoding for CIF doesn't solve all the
> problems of CIF author/editing applications, assuming some
> interaction with other text processing software is allowed or
> expected. An author - let us say Polish, since we do currently publish
> names with Polish accents - uploads a CIF submission to Chester via web
> and subsequently emails to me a Word document saying "Please
> replace my abstract by this revised version." I save the email
> from my Solaris account to a Samba filesystem, open it in Word, copy
> the text to the Windows clipboard, and paste it via my Windows X server
> to the publCIF clipboard which I am running on a Linux machine via
> an ssh session from my Solaris host. Does publCIF correctly write the
> Unicode code point U+0142 ("Polish letter l with stroke") into
> the CIF? If not, what has gone wrong and where, and (how) can it be
> fixed?
>
> (I have forgotten.) Are we proposing to allow also the "old" CIF1
> character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and
> have we a procedure for distinguishing when it has been applied?
>
> The elephant in the room: the discussion has so far addressed
> alphabetic character encodings. To retain at least the functionality
> of CIF1, we will need to specify schemes for handling other constructs
> such as sub/superscripts and possibly some other mathematical
> constructs - though most of the CIF1 maths is at the level of
> character representations which may be catered for by Unicode code
> points. Will procedures for identifying which of these schemes apply
> (possibly at the level of individual CIF data items) be orthogonal
> to specification of character encodings in the CIF, will the two
> interfere destructively, or can they somehow be handled in much the
> same way in terms of attaching "metadata about encoding" of some sort
> to the relevant objects?
>
> Regards
> Brian
>
>
>> [...] I maintain that a stream of bytes
>> divorced from any explicit or implicit metadata about its encoding
>> is binary, not text.  This complication of electronic text handling
>> is not new, but it has assumed much more prominence as
>> internationalization issues have gained importance.
>>
>> Implicit encoding metadata commonly takes the form of the text in
>> question being encoded according to the default scheme for the
>> system or tool.  It could, in one sense, also take the form of
>> a requirement in the format specification, but that is meaningful
>> only for tools specific to the format, which rather moots the
>> text vs. binary question.  It could also take the form of local
>> policy, such as "all CIFs in this archive are encoded in CESU-8,"
>> which would be useful to tools configured for the relevant
>> environment (e.g. a web server).
>>
>> Explicit metadata can be carried by the file itself or conveyed
>> out-of-band.  XML's encoding attribute is an example of the former,
>> and HTTP's content-type header is an example of the latter.  These
>> are useful only to certain tools, specific to a particular format,
>> environment, or exchange mechanism.
>>
>> One of the upshots of all this is that transcoding must in general be
>> a routine aspect of text file exchange, as that can make explicit
>> encoding metadata implicit.  As Simon has shown, transcoding not
>> automatic in many contexts, so it may require extra work on the
>> receiving end.  To the extent that there is a current assumption and
>> practice of CIFs being stored and forwarded byte-for-byte as received
>> (i.e. without transcoding or explicit metadata), CIF is already being
>> treated as a binary format.  In a sense, perhaps, it is being treated
>> simultaneously as several distinct binary formats.
>>
>> ***
>>
>>> By extending the character set beyond ASCII, we have to accept that
>>> not all general-purpose text tools are going to be applicable as CIF
>>> editors/viewers.
>>
>> That's a valid perspective, but I would sharpen it: as part of
>> extending the character set beyond ASCII, we abandon the premise that
>> CIF is a text format, though under some circumstances it may still be
>> possible to manipulate CIFs with tools designed for text.
>>
>> Alternatively, I have been advocating essentially this: by extending
>> the character set beyond ASCII, we magnify the importance of
>> exchanging and storing CIFs according to text conventions, including
>> correctly communicating encodings as necessary and transcoding as
>> appropriate.
>>
>> I hope the latter position adequately encompasses Herb's view as well.
>> Each position carries additional baggage, which I have omitted to
>> focus on the essential ideas.  If wider comment is sought, then I
>> submit that these alternatives provide a suitable basis for soliciting
>> such.
>>
>> Whichever position prevails, I should like to see something
>> substantially similar to the corresponding position statement above
>> be inserted into the spec.
>>
>> Regards,
>> John
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. . (SIMON WESTRIP)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. . (Brian McMahon)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .

Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. ...... .. .. .. .. .. .. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. ...... .. .. .. .. .. .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .