[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 24 Jun 2010 13:53:30 -0400
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA541661229529@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTilolZk4SzLF8mzqOz4EagFJcEHDKOAblGMnoqpW@mail.gmail.com><alpine.BSF.2.00.1006212120510.91069@epsilon.pair.com><AANLkTiklvzlKquqlRQIrpPGZjJfuRzLqiv2E6Stcq6wd@mail.gmail.com><alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com><AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com><alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA541661229521@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229523@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229526@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1006231550410.30894@ep silon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229527@SJMEMXMBS11.stjude.sjcrh.local> <a06240802c848414681ef@[192.168.2.104]><381469.52475.qm@web87004.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA541661229529@SJMEMXMBS11.stjude.sjcrh.local>
Dear Colleagues, It is an unfortunate reality that we seem unable to agree on this issue and perhaps others related to CIF2 and DDLm. Perhaps we need a meeting. If enough of us are at the ACA meeting in Chicago, and a few others could join in via Skype, maybe we could make some progress. Right now we seem to be going in circles. Regards, Herbert At 12:24 PM -0500 6/24/10, Bollinger, John C wrote: >On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote: >>I've attempted to take a step back and look at the encoding problem >>from the perspective of my working experience. > >Fair enough. > >[...] > >>To start with, please indulge me by putting aside the >>philosophical/respectful ('internationalization') considerations. >>What are the short/medium term benefits of extending CIF beyond ASCII text? >> >>1) With regard to the promise of DDLm (all ASCII) - none? > >I'm insufficiently informed to respond to that one. > >>2) With regard to processing crystallographic data output by e.g. >>refinement software - none? > >As far as I know, no current refinement software outputs non-ASCII >CIF content, except by using the limited and somewhat arcane system >of ASCII elides described among the CIF 1.1 "Common Semantic >Features" (and which technically is not part of the CIF 1.1 spec). >If there are any that do otherwise then the files they produce may >not conform to CIF 1.1. Any existing processing software that >consumes CIFs therefore either will assume the character set to be >restricted to ASCII, or will make some specific local provision for >handling non-standard CIFs. Some such software may be able to >immediately take advantage of the larger character repertoire >afforded by Unicode, but a lot of software will need to be updated >to make any use of it. > >I'm not sure any of that answers the question, though. What >behaviors count as "processing"? To the extent that few >crystallographic computations can be performed on non-numeric data, >I see no special benefit for that kind of processing. > >On the other hand, I do see certain advantages to CIF being able to >represent personal names without transliteration, as variant >transliteration approaches applied to the same name sometimes >produce different results. If the "processing" in question involves >storing CIF data in a database then there are searching and >normalization advantages to having names, at least, written in their >native script. (The elide system covers many of these cases, at >least for European names, but not all possible cases.) > >>3) With regard to richer content within data values - minimal? > >Again, names. > >Also, deprecating the elide system -- I understand that it is >designed to be mnemonic, and it *is* easier to read than Unicode >escape codes would be, but it's still limited and hard to read. I >contend that this one thing that is broken in CIF1 (whether you >characterize the problem as an insufficient character repertoire or >as an insufficient elide system). > >Plus, there are various non-ASCII characters in routine use in >crystallography and related fields that it would be nice to >represent directly, among them the degree symbol and many upper- and >lower-case Greek letters. The elide system currently covers these, >but again, it's uncomfortable and not an official standard. > >Furthermore, if there is some hope or expectation of CIF2 as an >electronic representation of non-English manuscripts, then that >virtually requires direct support for all the characters of the >scripts in which such manuscripts will be written. The elide system >is workable for short pieces of text, but only via machine >translation could it be comfortable for longer texts. > >I think these amount to more than a minimal advantage for Unicode in >data values. > >>In the latter case an extended character set can be represented >>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on >>my experience (and in light of the issues we've been discussing), >>it will probably be considerably easier for a user to adapt to a >>few extra ASCII control sequences than asking them to pay any >>attention to the underlying text encodings. The same applies from a >>developers point of view - i.e. its far easier to accept extended >>ASCII control sequences than to try to determine the text encoding >>(unless of course the encodings are unambiguously identifiable). > >Java / Python-style Unicode escapes have the advantages of covering >all of Unicode, of providing an unambiguous encoding of an >underlying Unicode text model, and of embedding that encoding in an >ASCII-based host format. > >They have the disadvantages of being difficult for a human to >directly read or edit, and of introducing their own set of issues. >For example, consider the following potential CIF2 fragment: > > _foo \u000A;bar\u000A; > >What is the value assigned to data name _foo? If the Unicode >escapes are processed according to the Java model (i.e. as if >replaced by the corresponding character prior to lexical analysis), >then the value is bar. If the escapes are processed later, then the >value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1 >calls them, but containing <LF> characters (in fact, this particular >value cannot be represented in CIF 1 at all). > >These issues do not by any means block Unicode escapes from being >adopted for CIF, but they do mean that taking such an approach >requires some additional details to be settled, and that there will >be interesting gotchas involved in adapting some existing CIF1 >software for CIF2. > >>Furthermore, extending the character set (however represented) does >>not address issues such as representing mathematical >>content in a CIF data value, nor images (imgCIF will not be fully >>compliant with CIF2 - but please correct me if I'm wrong). There >>are yet unexplored alternatives to enabling richer publication and >>archival content using CIF, but they do not concern the fundamental >>syntax/encoding. > >By "mathematical content" I suppose you mean formulae. I agree, >formulae, images, and various other content types that might be of >interest are not supported by a Unicode character model alone, >however encoded. It was never my understanding that supporting such >content types was a reason for switching to a Unicode character >model, however much (or little) it might be advantageous to imgCIF. > >>So the leading ('forward thinking') motivation for basing CIF2 on >>unicode lies in 'internationalization'. In the short/medium term I >>don't imagine that introducing an extended character set through >>unicode or multiple encodings is going to lead to any one/group >>adopting the new CIF2 as the basis of their private/public data >>archive/retrieval system. Hopefully they will take advantage of >>what DDLm has to offer, though most likely by using third-party >>software. > >I think that's missing the point. CIF already has to deal with >internationalization issues, which it does, as best it can, via the >elide system. Even in English it has to in some way provide a >character model that extends beyond ASCII. > >>At this point in my train of thought, I might say stick to ASCII as >>'internationalization' has not been widely called for by the >>community and has minimal benefits at this time. > >As a practical matter, CIF already goes beyond ASCII. The usual >manner in which it does so, however, is explicitly NOT standardized. >Personally, I find this a sorry state of affairs indeed. > >> However, I think CIF should move forward in this respect. So how >>do we achieve this? Unicode is the accepted answer? Unicode was >>designed for this and has some established unambiguous encodings? > >I think Unicode or (almost) equivalently, ISO-10646, is indeed the >accepted answer, at least inasmuch as ISO-10646 is an international >standard. As far as I know, there is no competing standard of >comparable scope. > >> The majority (including Microsoft) recommend adopting UTF-8 in >>preference to other encodings? > >XML gives special status to UTF-8 as the encoding to assume in the >absence of internal or external metadata directing otherwise. >Nevertheless, XML also requires conformant processors to be able to >recognize and handle UTF-16 (though not necessarily UTF-16LE, >UTF-16BE, or other variants). I believe Microsoft NT-based >operating systems internally use UCS-2 or UTF-16 for file names, >depending on OS version and patch level. Microsoft and many others >provide decent support for creating, reading, and editing Unicode >text files encoded in UTF-8, but this frequently is not the default >encoding. I am not aware of Microsoft in particular promoting UTF-8 >above locale-specific code pages, but it is my general, personal >perception that UTF-8 use is broad, expanding, and widely >recommended. However, I do not see UTF-8 or any other encoding ever >being preferred over all others for all purposes. > >>So in the light of current CIF practice (i.e. unspecified-encoding >>of ASCII text, where the encoding has never to my knowledge been a >>problem), why not specify UTF-8 only, don't accommodate any >>non-ASCII code points in the dictionaries (which is what is >>proposed anyway?), and see what happens? :-) At worst a few users >>will find that existing software will not handle the non-ASCII text >>they have diligently included in their UTF-8 CIF (but this is >>inevitable once you extend beyond ASCII). At best their text will >>be handled as UTF-8 by CIF2 software. > >That is a possible way forward, and indeed, it is basically what is >in the current spec. The main problem I see with it is that in >practice, many people will create, use, and exchange (successfully >or not) "CIFs" that are not UTF-8 encoded, regardless of what the >spec says about that. Although it is certainly possible to declare >that such files are not compliant CIFs, I don't see how that >provides any benefit. > >>So what about the issue of accessing archived UTF-8 CIFs? Make it >>clear to the recipient that the CIF will be encoded in UTF-8; if >>for some reason they have trouble reading the CIF, point them at >>appropriate UTF-8 software (preferably provide them with a fully >>compliant CIF2 editor/viewer that introduces them to the benefits >>of CIF2 and its support for unicode:-) > >And that is exactly the same thing that would be done if CIF2 did >not specify a particular encoding. > >>Similarly, with day-to-day transmission of a CIF, if the CIF >>doesn't contain any characters beyond the ASCII set, the chances >>are there wont be any issues (there havn't been in the past?). If a >>diligent user has followed the spec and prepared a UTF-8 CIF, again >>the chances are it will be interpretted as UTF-8 (very few modern >>systems struggle with UTF-8?). > >I'm not in a position to know how many encoding-related issues there >may have been in the past. UTF-16 variants and EBCDIC variants are >the only encodings I know that are in wide use and might present an >interchange problem for CIF 1.1 compliant CIFs. They would present >exactly the same problems if used to encode ASCII-only CIF2 text. > >>I fully expect to be 'shot down' on any number of my thoughts - >>but, given the amount of emails it has generated, I dont think it >>is unreasonable to put this issue in the context of perceived >>current practice (however narrow the viewpoint - others have >>referred to CIF systems that I have no idea about)? > >It is not my goal to "shoot you down", or anyone else. I am not >debating for the sake of the debate. I want CIF2 to be as >technically sound and as practically useful as possible, and I don't >foresee a lot of latitude for tweaking or revising it after it is >adopted. > >I started by probing several areas where the draft spec seemed to >give too little consideration to the implications of expanding the >CIF character repertoire to all of Unicode. For the most part these >have been resolved easily, but the issue of embedded U+FEFF >characters was contentious (and still has not been resolved). That >led into the related area of character encoding and text vs. binary, >which has become such a brouhaha. > >Much of the disagreement over these contentious issues arises from >CIF's split-personality design. It has always been promoted as a >human-readable text format, yet it is intended largely to be >produced and primarily to be consumed by computers. Humans and >computers have different requirements, and it is not always possible >to align them. XML followed a similar path, and nowadays the >prevailing opinion seems to be that XML isn't well suited to direct >human reading or modification. Opinion of CIF has not reached that >point yet, and it's unclear whether it ever will do. > >Best, > >John >-- >John C. Bollinger, Ph.D. >Department of Structural Biology >St. Jude Children's Research Hospital > > > > >Email Disclaimer: www.stjude.org/emaildisclaimer >_______________________________________________ >ddlm-group mailing list >ddlm-group@iucr.org >http://scripts.iucr.org/mailman/listinfo/ddlm-group -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. . (Bollinger, John C)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Index(es):