[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Thu, 24 Jun 2010 13:53:30 -0400
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA541661229529@SJMEMXMBS11.stjude.sjcrh.local>
- References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229521@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229523@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229526@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231550410.30894@ep silon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229527@SJMEMXMBS11.stjude.sjcrh.local> <a06240802c848414681ef@[192.168.2.104]><[email protected]><8F77913624F7524AACD2A92EAF3BFA541661229529@SJMEMXMBS11.stjude.sjcrh.local>
Dear Colleagues,
It is an unfortunate reality that we seem unable to agree on this issue
and perhaps others related to CIF2 and DDLm. Perhaps we need a meeting.
If enough of us are at the ACA meeting in Chicago, and a few others
could join in via Skype, maybe we could make some progress.
Right now we seem to be going in circles.
Regards,
Herbert
At 12:24 PM -0500 6/24/10, Bollinger, John C wrote:
>On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
>>I've attempted to take a step back and look at the encoding problem
>>from the perspective of my working experience.
>
>Fair enough.
>
>[...]
>
>>To start with, please indulge me by putting aside the
>>philosophical/respectful ('internationalization') considerations.
>>What are the short/medium term benefits of extending CIF beyond ASCII text?
>>
>>1) With regard to the promise of DDLm (all ASCII) - none?
>
>I'm insufficiently informed to respond to that one.
>
>>2) With regard to processing crystallographic data output by e.g.
>>refinement software - none?
>
>As far as I know, no current refinement software outputs non-ASCII
>CIF content, except by using the limited and somewhat arcane system
>of ASCII elides described among the CIF 1.1 "Common Semantic
>Features" (and which technically is not part of the CIF 1.1 spec).
>If there are any that do otherwise then the files they produce may
>not conform to CIF 1.1. Any existing processing software that
>consumes CIFs therefore either will assume the character set to be
>restricted to ASCII, or will make some specific local provision for
>handling non-standard CIFs. Some such software may be able to
>immediately take advantage of the larger character repertoire
>afforded by Unicode, but a lot of software will need to be updated
>to make any use of it.
>
>I'm not sure any of that answers the question, though. What
>behaviors count as "processing"? To the extent that few
>crystallographic computations can be performed on non-numeric data,
>I see no special benefit for that kind of processing.
>
>On the other hand, I do see certain advantages to CIF being able to
>represent personal names without transliteration, as variant
>transliteration approaches applied to the same name sometimes
>produce different results. If the "processing" in question involves
>storing CIF data in a database then there are searching and
>normalization advantages to having names, at least, written in their
>native script. (The elide system covers many of these cases, at
>least for European names, but not all possible cases.)
>
>>3) With regard to richer content within data values - minimal?
>
>Again, names.
>
>Also, deprecating the elide system -- I understand that it is
>designed to be mnemonic, and it *is* easier to read than Unicode
>escape codes would be, but it's still limited and hard to read. I
>contend that this one thing that is broken in CIF1 (whether you
>characterize the problem as an insufficient character repertoire or
>as an insufficient elide system).
>
>Plus, there are various non-ASCII characters in routine use in
>crystallography and related fields that it would be nice to
>represent directly, among them the degree symbol and many upper- and
>lower-case Greek letters. The elide system currently covers these,
>but again, it's uncomfortable and not an official standard.
>
>Furthermore, if there is some hope or expectation of CIF2 as an
>electronic representation of non-English manuscripts, then that
>virtually requires direct support for all the characters of the
>scripts in which such manuscripts will be written. The elide system
>is workable for short pieces of text, but only via machine
>translation could it be comfortable for longer texts.
>
>I think these amount to more than a minimal advantage for Unicode in
>data values.
>
>>In the latter case an extended character set can be represented
>>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on
>>my experience (and in light of the issues we've been discussing),
>>it will probably be considerably easier for a user to adapt to a
>>few extra ASCII control sequences than asking them to pay any
>>attention to the underlying text encodings. The same applies from a
>>developers point of view - i.e. its far easier to accept extended
>>ASCII control sequences than to try to determine the text encoding
>>(unless of course the encodings are unambiguously identifiable).
>
>Java / Python-style Unicode escapes have the advantages of covering
>all of Unicode, of providing an unambiguous encoding of an
>underlying Unicode text model, and of embedding that encoding in an
>ASCII-based host format.
>
>They have the disadvantages of being difficult for a human to
>directly read or edit, and of introducing their own set of issues.
>For example, consider the following potential CIF2 fragment:
>
> _foo \u000A;bar\u000A;
>
>What is the value assigned to data name _foo? If the Unicode
>escapes are processed according to the Java model (i.e. as if
>replaced by the corresponding character prior to lexical analysis),
>then the value is bar. If the escapes are processed later, then the
>value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1
>calls them, but containing <LF> characters (in fact, this particular
>value cannot be represented in CIF 1 at all).
>
>These issues do not by any means block Unicode escapes from being
>adopted for CIF, but they do mean that taking such an approach
>requires some additional details to be settled, and that there will
>be interesting gotchas involved in adapting some existing CIF1
>software for CIF2.
>
>>Furthermore, extending the character set (however represented) does
>>not address issues such as representing mathematical
>>content in a CIF data value, nor images (imgCIF will not be fully
>>compliant with CIF2 - but please correct me if I'm wrong). There
>>are yet unexplored alternatives to enabling richer publication and
>>archival content using CIF, but they do not concern the fundamental
>>syntax/encoding.
>
>By "mathematical content" I suppose you mean formulae. I agree,
>formulae, images, and various other content types that might be of
>interest are not supported by a Unicode character model alone,
>however encoded. It was never my understanding that supporting such
>content types was a reason for switching to a Unicode character
>model, however much (or little) it might be advantageous to imgCIF.
>
>>So the leading ('forward thinking') motivation for basing CIF2 on
>>unicode lies in 'internationalization'. In the short/medium term I
>>don't imagine that introducing an extended character set through
>>unicode or multiple encodings is going to lead to any one/group
>>adopting the new CIF2 as the basis of their private/public data
>>archive/retrieval system. Hopefully they will take advantage of
>>what DDLm has to offer, though most likely by using third-party
>>software.
>
>I think that's missing the point. CIF already has to deal with
>internationalization issues, which it does, as best it can, via the
>elide system. Even in English it has to in some way provide a
>character model that extends beyond ASCII.
>
>>At this point in my train of thought, I might say stick to ASCII as
>>'internationalization' has not been widely called for by the
>>community and has minimal benefits at this time.
>
>As a practical matter, CIF already goes beyond ASCII. The usual
>manner in which it does so, however, is explicitly NOT standardized.
>Personally, I find this a sorry state of affairs indeed.
>
>> However, I think CIF should move forward in this respect. So how
>>do we achieve this? Unicode is the accepted answer? Unicode was
>>designed for this and has some established unambiguous encodings?
>
>I think Unicode or (almost) equivalently, ISO-10646, is indeed the
>accepted answer, at least inasmuch as ISO-10646 is an international
>standard. As far as I know, there is no competing standard of
>comparable scope.
>
>> The majority (including Microsoft) recommend adopting UTF-8 in
>>preference to other encodings?
>
>XML gives special status to UTF-8 as the encoding to assume in the
>absence of internal or external metadata directing otherwise.
>Nevertheless, XML also requires conformant processors to be able to
>recognize and handle UTF-16 (though not necessarily UTF-16LE,
>UTF-16BE, or other variants). I believe Microsoft NT-based
>operating systems internally use UCS-2 or UTF-16 for file names,
>depending on OS version and patch level. Microsoft and many others
>provide decent support for creating, reading, and editing Unicode
>text files encoded in UTF-8, but this frequently is not the default
>encoding. I am not aware of Microsoft in particular promoting UTF-8
>above locale-specific code pages, but it is my general, personal
>perception that UTF-8 use is broad, expanding, and widely
>recommended. However, I do not see UTF-8 or any other encoding ever
>being preferred over all others for all purposes.
>
>>So in the light of current CIF practice (i.e. unspecified-encoding
>>of ASCII text, where the encoding has never to my knowledge been a
>>problem), why not specify UTF-8 only, don't accommodate any
>>non-ASCII code points in the dictionaries (which is what is
>>proposed anyway?), and see what happens? :-) At worst a few users
>>will find that existing software will not handle the non-ASCII text
>>they have diligently included in their UTF-8 CIF (but this is
>>inevitable once you extend beyond ASCII). At best their text will
>>be handled as UTF-8 by CIF2 software.
>
>That is a possible way forward, and indeed, it is basically what is
>in the current spec. The main problem I see with it is that in
>practice, many people will create, use, and exchange (successfully
>or not) "CIFs" that are not UTF-8 encoded, regardless of what the
>spec says about that. Although it is certainly possible to declare
>that such files are not compliant CIFs, I don't see how that
>provides any benefit.
>
>>So what about the issue of accessing archived UTF-8 CIFs? Make it
>>clear to the recipient that the CIF will be encoded in UTF-8; if
>>for some reason they have trouble reading the CIF, point them at
>>appropriate UTF-8 software (preferably provide them with a fully
>>compliant CIF2 editor/viewer that introduces them to the benefits
>>of CIF2 and its support for unicode:-)
>
>And that is exactly the same thing that would be done if CIF2 did
>not specify a particular encoding.
>
>>Similarly, with day-to-day transmission of a CIF, if the CIF
>>doesn't contain any characters beyond the ASCII set, the chances
>>are there wont be any issues (there havn't been in the past?). If a
>>diligent user has followed the spec and prepared a UTF-8 CIF, again
>>the chances are it will be interpretted as UTF-8 (very few modern
>>systems struggle with UTF-8?).
>
>I'm not in a position to know how many encoding-related issues there
>may have been in the past. UTF-16 variants and EBCDIC variants are
>the only encodings I know that are in wide use and might present an
>interchange problem for CIF 1.1 compliant CIFs. They would present
>exactly the same problems if used to encode ASCII-only CIF2 text.
>
>>I fully expect to be 'shot down' on any number of my thoughts -
>>but, given the amount of emails it has generated, I dont think it
>>is unreasonable to put this issue in the context of perceived
>>current practice (however narrow the viewpoint - others have
>>referred to CIF systems that I have no idea about)?
>
>It is not my goal to "shoot you down", or anyone else. I am not
>debating for the sake of the debate. I want CIF2 to be as
>technically sound and as practically useful as possible, and I don't
>foresee a lot of latitude for tweaking or revising it after it is
>adopted.
>
>I started by probing several areas where the draft spec seemed to
>give too little consideration to the implications of expanding the
>CIF character repertoire to all of Unicode. For the most part these
>have been resolved easily, but the issue of embedded U+FEFF
>characters was contentious (and still has not been resolved). That
>led into the related area of character encoding and text vs. binary,
>which has become such a brouhaha.
>
>Much of the disagreement over these contentious issues arises from
>CIF's split-personality design. It has always been promoted as a
>human-readable text format, yet it is intended largely to be
>produced and primarily to be consumed by computers. Humans and
>computers have different requirements, and it is not always possible
>to align them. XML followed a similar path, and nowadays the
>prevailing opinion seems to be that XML isn't well suited to direct
>human reading or modification. Opinion of CIF has not reached that
>point yet, and it's unclear whether it ever will do.
>
>Best,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>
>
>Email Disclaimer: www.stjude.org/emaildisclaimer
>_______________________________________________
>ddlm-group mailing list
>[email protected]
>http://scripts.iucr.org/mailman/listinfo/ddlm-group
--
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. . (Bollinger, John C)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Index(es):

