[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Wed, 30 Jun 2010 08:30:45 -0400 (EDT)
- In-Reply-To: <20100630120755.GA7943@emerald.iucr.org>
- References: <563298.52532.qm@web87005.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA54166122952C@SJMEMXMBS11.stjude.sjcrh.local><520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@[192.168.2.104]><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><663654.63888.qm@web87001.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local><20100630120755.GA7943@emerald.iucr.org>
Brian asked about sub and superscripts. There us a limited set of subscripts and superscript digits in unicode in addition to the usual accents. This capability is not, however, a replacement for markup language. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 30 Jun 2010, Brian McMahon wrote: > John B's analysis (here and elsewhere) is very pertinent and provides > a very helpful conceptual framework. > > There are a few other considerations regarding encodings that come to > my mind. > > First, any assertion of an encoding is at best a hint. Properly, a > trusted archive should validate any such assertion and correct it if > necessary. Validation may not always be possible through entirely > automatic procedures. Copying and pasting text from one system into > another may transfer bytes without transcoding the intended characters > (see use case suggested in the next paragraph). Transmission checksums > will help, but will not catch everything. There will arise different > levels of trust associated with experience of particular programs, > operating systems or perhaps authors. > > Settling on a single mandated encoding for CIF doesn't solve all the > problems of CIF author/editing applications, assuming some > interaction with other text processing software is allowed or > expected. An author - let us say Polish, since we do currently publish > names with Polish accents - uploads a CIF submission to Chester via web > and subsequently emails to me a Word document saying "Please > replace my abstract by this revised version." I save the email > from my Solaris account to a Samba filesystem, open it in Word, copy > the text to the Windows clipboard, and paste it via my Windows X server > to the publCIF clipboard which I am running on a Linux machine via > an ssh session from my Solaris host. Does publCIF correctly write the > Unicode code point U+0142 ("Polish letter l with stroke") into > the CIF? If not, what has gone wrong and where, and (how) can it be > fixed? > > (I have forgotten.) Are we proposing to allow also the "old" CIF1 > character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and > have we a procedure for distinguishing when it has been applied? > > The elephant in the room: the discussion has so far addressed > alphabetic character encodings. To retain at least the functionality > of CIF1, we will need to specify schemes for handling other constructs > such as sub/superscripts and possibly some other mathematical > constructs - though most of the CIF1 maths is at the level of > character representations which may be catered for by Unicode code > points. Will procedures for identifying which of these schemes apply > (possibly at the level of individual CIF data items) be orthogonal > to specification of character encodings in the CIF, will the two > interfere destructively, or can they somehow be handled in much the > same way in terms of attaching "metadata about encoding" of some sort > to the relevant objects? > > Regards > Brian > > >> [...] I maintain that a stream of bytes >> divorced from any explicit or implicit metadata about its encoding >> is binary, not text. This complication of electronic text handling >> is not new, but it has assumed much more prominence as >> internationalization issues have gained importance. >> >> Implicit encoding metadata commonly takes the form of the text in >> question being encoded according to the default scheme for the >> system or tool. It could, in one sense, also take the form of >> a requirement in the format specification, but that is meaningful >> only for tools specific to the format, which rather moots the >> text vs. binary question. It could also take the form of local >> policy, such as "all CIFs in this archive are encoded in CESU-8," >> which would be useful to tools configured for the relevant >> environment (e.g. a web server). >> >> Explicit metadata can be carried by the file itself or conveyed >> out-of-band. XML's encoding attribute is an example of the former, >> and HTTP's content-type header is an example of the latter. These >> are useful only to certain tools, specific to a particular format, >> environment, or exchange mechanism. >> >> One of the upshots of all this is that transcoding must in general be >> a routine aspect of text file exchange, as that can make explicit >> encoding metadata implicit. As Simon has shown, transcoding not >> automatic in many contexts, so it may require extra work on the >> receiving end. To the extent that there is a current assumption and >> practice of CIFs being stored and forwarded byte-for-byte as received >> (i.e. without transcoding or explicit metadata), CIF is already being >> treated as a binary format. In a sense, perhaps, it is being treated >> simultaneously as several distinct binary formats. >> >> *** >> >>> By extending the character set beyond ASCII, we have to accept that >>> not all general-purpose text tools are going to be applicable as CIF >>> editors/viewers. >> >> That's a valid perspective, but I would sharpen it: as part of >> extending the character set beyond ASCII, we abandon the premise that >> CIF is a text format, though under some circumstances it may still be >> possible to manipulate CIFs with tools designed for text. >> >> Alternatively, I have been advocating essentially this: by extending >> the character set beyond ASCII, we magnify the importance of >> exchanging and storing CIFs according to text conventions, including >> correctly communicating encodings as necessary and transcoding as >> appropriate. >> >> I hope the latter position adequately encompasses Herb's view as well. >> Each position carries additional baggage, which I have omitted to >> focus on the essential ideas. If wider comment is sought, then I >> submit that these alternatives provide a suitable basis for soliciting >> such. >> >> Whichever position prevails, I should like to see something >> substantially similar to the corresponding position statement above >> be inserted into the spec. >> >> Regards, >> John > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. . (Brian McMahon)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. ...... .. .. .. .. .. .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. ...... .. .. .. .. .. .. .
- Index(es):