[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 25 Jun 2010 12:47:22 +1000
- In-Reply-To: <a06240801c84949b70cb7@192.168.27.100>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTilolZk4SzLF8mzqOz4EagFJcEHDKOAblGMnoqpW@mail.gmail.com><alpine.BSF.2.00.1006212120510.91069@epsilon.pair.com><AANLkTiklvzlKquqlRQIrpPGZjJfuRzLqiv2E6Stcq6wd@mail.gmail.com><alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com><AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com><alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com><AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><a06240802c848414681ef@192.168.2.104><381469.52475.qm@web87004.mail.ird.yahoo.com><a06240801c84949b70cb7@192.168.27.100>
I don't think we are quite going around in circles; but it is very time-consuming exploring every point that is made to determine its value and relevance. Such methodical work can be done in a more considered fashion by email, or even better with a wiki page. To that end, I plan to collect together a summary of all the points of view that have been expressed so far, as a basis for further discussion. On Fri, Jun 25, 2010 at 3:53 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > It is an unfortunate reality that we seem unable to agree on this issue > and perhaps others related to CIF2 and DDLm. Perhaps we need a meeting. > If enough of us are at the ACA meeting in Chicago, and a few others > could join in via Skype, maybe we could make some progress. > > Right now we seem to be going in circles. > > Regards, > Herbert > > > > At 12:24 PM -0500 6/24/10, Bollinger, John C wrote: >>On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote: >>>I've attempted to take a step back and look at the encoding problem >>>from the perspective of my working experience. >> >>Fair enough. >> >>[...] >> >>>To start with, please indulge me by putting aside the >>>philosophical/respectful ('internationalization') considerations. >>>What are the short/medium term benefits of extending CIF beyond ASCII text? >>> >>>1) With regard to the promise of DDLm (all ASCII) - none? >> >>I'm insufficiently informed to respond to that one. >> >>>2) With regard to processing crystallographic data output by e.g. >>>refinement software - none? >> >>As far as I know, no current refinement software outputs non-ASCII >>CIF content, except by using the limited and somewhat arcane system >>of ASCII elides described among the CIF 1.1 "Common Semantic >>Features" (and which technically is not part of the CIF 1.1 spec). >>If there are any that do otherwise then the files they produce may >>not conform to CIF 1.1. Any existing processing software that >>consumes CIFs therefore either will assume the character set to be >>restricted to ASCII, or will make some specific local provision for >>handling non-standard CIFs. Some such software may be able to >>immediately take advantage of the larger character repertoire >>afforded by Unicode, but a lot of software will need to be updated >>to make any use of it. >> >>I'm not sure any of that answers the question, though. What >>behaviors count as "processing"? To the extent that few >>crystallographic computations can be performed on non-numeric data, >>I see no special benefit for that kind of processing. >> >>On the other hand, I do see certain advantages to CIF being able to >>represent personal names without transliteration, as variant >>transliteration approaches applied to the same name sometimes >>produce different results. If the "processing" in question involves >>storing CIF data in a database then there are searching and >>normalization advantages to having names, at least, written in their >>native script. (The elide system covers many of these cases, at >>least for European names, but not all possible cases.) >> >>>3) With regard to richer content within data values - minimal? >> >>Again, names. >> >>Also, deprecating the elide system -- I understand that it is >>designed to be mnemonic, and it *is* easier to read than Unicode >>escape codes would be, but it's still limited and hard to read. I >>contend that this one thing that is broken in CIF1 (whether you >>characterize the problem as an insufficient character repertoire or >>as an insufficient elide system). >> >>Plus, there are various non-ASCII characters in routine use in >>crystallography and related fields that it would be nice to >>represent directly, among them the degree symbol and many upper- and >>lower-case Greek letters. The elide system currently covers these, >>but again, it's uncomfortable and not an official standard. >> >>Furthermore, if there is some hope or expectation of CIF2 as an >>electronic representation of non-English manuscripts, then that >>virtually requires direct support for all the characters of the >>scripts in which such manuscripts will be written. The elide system >>is workable for short pieces of text, but only via machine >>translation could it be comfortable for longer texts. >> >>I think these amount to more than a minimal advantage for Unicode in >>data values. >> >>>In the latter case an extended character set can be represented >>>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on >>>my experience (and in light of the issues we've been discussing), >>>it will probably be considerably easier for a user to adapt to a >>>few extra ASCII control sequences than asking them to pay any >>>attention to the underlying text encodings. The same applies from a >>>developers point of view - i.e. its far easier to accept extended >>>ASCII control sequences than to try to determine the text encoding >>>(unless of course the encodings are unambiguously identifiable). >> >>Java / Python-style Unicode escapes have the advantages of covering >>all of Unicode, of providing an unambiguous encoding of an >>underlying Unicode text model, and of embedding that encoding in an >>ASCII-based host format. >> >>They have the disadvantages of being difficult for a human to >>directly read or edit, and of introducing their own set of issues. >>For example, consider the following potential CIF2 fragment: >> >> _foo \u000A;bar\u000A; >> >>What is the value assigned to data name _foo? If the Unicode >>escapes are processed according to the Java model (i.e. as if >>replaced by the corresponding character prior to lexical analysis), >>then the value is bar. If the escapes are processed later, then the >>value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1 >>calls them, but containing <LF> characters (in fact, this particular >>value cannot be represented in CIF 1 at all). >> >>These issues do not by any means block Unicode escapes from being >>adopted for CIF, but they do mean that taking such an approach >>requires some additional details to be settled, and that there will >>be interesting gotchas involved in adapting some existing CIF1 >>software for CIF2. >> >>>Furthermore, extending the character set (however represented) does >>>not address issues such as representing mathematical >>>content in a CIF data value, nor images (imgCIF will not be fully >>>compliant with CIF2 - but please correct me if I'm wrong). There >>>are yet unexplored alternatives to enabling richer publication and >>>archival content using CIF, but they do not concern the fundamental >>>syntax/encoding. >> >>By "mathematical content" I suppose you mean formulae. I agree, >>formulae, images, and various other content types that might be of >>interest are not supported by a Unicode character model alone, >>however encoded. It was never my understanding that supporting such >>content types was a reason for switching to a Unicode character >>model, however much (or little) it might be advantageous to imgCIF. >> >>>So the leading ('forward thinking') motivation for basing CIF2 on >>>unicode lies in 'internationalization'. In the short/medium term I >>>don't imagine that introducing an extended character set through >>>unicode or multiple encodings is going to lead to any one/group >>>adopting the new CIF2 as the basis of their private/public data >>>archive/retrieval system. Hopefully they will take advantage of >>>what DDLm has to offer, though most likely by using third-party >>>software. >> >>I think that's missing the point. CIF already has to deal with >>internationalization issues, which it does, as best it can, via the >>elide system. Even in English it has to in some way provide a >>character model that extends beyond ASCII. >> >>>At this point in my train of thought, I might say stick to ASCII as >>>'internationalization' has not been widely called for by the >>>community and has minimal benefits at this time. >> >>As a practical matter, CIF already goes beyond ASCII. The usual >>manner in which it does so, however, is explicitly NOT standardized. >>Personally, I find this a sorry state of affairs indeed. >> >>> However, I think CIF should move forward in this respect. So how >>>do we achieve this? Unicode is the accepted answer? Unicode was >>>designed for this and has some established unambiguous encodings? >> >>I think Unicode or (almost) equivalently, ISO-10646, is indeed the >>accepted answer, at least inasmuch as ISO-10646 is an international >>standard. As far as I know, there is no competing standard of >>comparable scope. >> >>> The majority (including Microsoft) recommend adopting UTF-8 in >>>preference to other encodings? >> >>XML gives special status to UTF-8 as the encoding to assume in the >>absence of internal or external metadata directing otherwise. >>Nevertheless, XML also requires conformant processors to be able to >>recognize and handle UTF-16 (though not necessarily UTF-16LE, >>UTF-16BE, or other variants). I believe Microsoft NT-based >>operating systems internally use UCS-2 or UTF-16 for file names, >>depending on OS version and patch level. Microsoft and many others >>provide decent support for creating, reading, and editing Unicode >>text files encoded in UTF-8, but this frequently is not the default >>encoding. I am not aware of Microsoft in particular promoting UTF-8 >>above locale-specific code pages, but it is my general, personal >>perception that UTF-8 use is broad, expanding, and widely >>recommended. However, I do not see UTF-8 or any other encoding ever >>being preferred over all others for all purposes. >> >>>So in the light of current CIF practice (i.e. unspecified-encoding >>>of ASCII text, where the encoding has never to my knowledge been a >>>problem), why not specify UTF-8 only, don't accommodate any >>>non-ASCII code points in the dictionaries (which is what is >>>proposed anyway?), and see what happens? :-) At worst a few users >>>will find that existing software will not handle the non-ASCII text >>>they have diligently included in their UTF-8 CIF (but this is >>>inevitable once you extend beyond ASCII). At best their text will >>>be handled as UTF-8 by CIF2 software. >> >>That is a possible way forward, and indeed, it is basically what is >>in the current spec. The main problem I see with it is that in >>practice, many people will create, use, and exchange (successfully >>or not) "CIFs" that are not UTF-8 encoded, regardless of what the >>spec says about that. Although it is certainly possible to declare >>that such files are not compliant CIFs, I don't see how that >>provides any benefit. >> >>>So what about the issue of accessing archived UTF-8 CIFs? Make it >>>clear to the recipient that the CIF will be encoded in UTF-8; if >>>for some reason they have trouble reading the CIF, point them at >>>appropriate UTF-8 software (preferably provide them with a fully >>>compliant CIF2 editor/viewer that introduces them to the benefits >>>of CIF2 and its support for unicode:-) >> >>And that is exactly the same thing that would be done if CIF2 did >>not specify a particular encoding. >> >>>Similarly, with day-to-day transmission of a CIF, if the CIF >>>doesn't contain any characters beyond the ASCII set, the chances >>>are there wont be any issues (there havn't been in the past?). If a >>>diligent user has followed the spec and prepared a UTF-8 CIF, again >>>the chances are it will be interpretted as UTF-8 (very few modern >>>systems struggle with UTF-8?). >> >>I'm not in a position to know how many encoding-related issues there >>may have been in the past. UTF-16 variants and EBCDIC variants are >>the only encodings I know that are in wide use and might present an >>interchange problem for CIF 1.1 compliant CIFs. They would present >>exactly the same problems if used to encode ASCII-only CIF2 text. >> >>>I fully expect to be 'shot down' on any number of my thoughts - >>>but, given the amount of emails it has generated, I dont think it >>>is unreasonable to put this issue in the context of perceived >>>current practice (however narrow the viewpoint - others have >>>referred to CIF systems that I have no idea about)? >> >>It is not my goal to "shoot you down", or anyone else. I am not >>debating for the sake of the debate. I want CIF2 to be as >>technically sound and as practically useful as possible, and I don't >>foresee a lot of latitude for tweaking or revising it after it is >>adopted. >> >>I started by probing several areas where the draft spec seemed to >>give too little consideration to the implications of expanding the >>CIF character repertoire to all of Unicode. For the most part these >>have been resolved easily, but the issue of embedded U+FEFF >>characters was contentious (and still has not been resolved). That >>led into the related area of character encoding and text vs. binary, >>which has become such a brouhaha. >> >>Much of the disagreement over these contentious issues arises from >>CIF's split-personality design. It has always been promoted as a >>human-readable text format, yet it is intended largely to be >>produced and primarily to be consumed by computers. Humans and >>computers have different requirements, and it is not always possible >>to align them. XML followed a similar path, and nowadays the >>prevailing opinion seems to be that XML isn't well suited to direct >>human reading or modification. Opinion of CIF has not reached that >>point yet, and it's unclear whether it ever will do. >> >>Best, >> >>John >>-- >>John C. Bollinger, Ph.D. >>Department of Structural Biology >>St. Jude Children's Research Hospital >> >> >> >> >>Email Disclaimer: www.stjude.org/emaildisclaimer >>_______________________________________________ >>ddlm-group mailing list >>ddlm-group@iucr.org >>http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > -- > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .
- Index(es):