[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .

I don't think we are quite going around in circles; but it is very
time-consuming exploring every point that is made to determine its
value and relevance.  Such methodical work can be done in a more
considered fashion by email, or even better with a wiki page.  To that
end, I plan to collect together a summary of all the points of view
that have been expressed so far, as a basis for further discussion.

On Fri, Jun 25, 2010 at 3:53 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
>   It is an unfortunate reality that we seem unable to agree on this issue
> and perhaps others related to CIF2 and DDLm.  Perhaps we need a meeting.
> If enough of us are at the ACA meeting in Chicago, and a few others
> could join in via Skype, maybe we could make some progress.
>
>   Right now we seem to be going in circles.
>
>   Regards,
>     Herbert
>
>
>
> At 12:24 PM -0500 6/24/10, Bollinger, John C wrote:
>>On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
>>>I've attempted to take a step back and look at the encoding problem
>>>from the perspective of my working experience.
>>
>>Fair enough.
>>
>>[...]
>>
>>>To start with, please indulge me by putting aside the
>>>philosophical/respectful ('internationalization') considerations.
>>>What are the short/medium term benefits of extending CIF beyond ASCII text?
>>>
>>>1) With regard to the promise of DDLm (all ASCII) - none?
>>
>>I'm insufficiently informed to respond to that one.
>>
>>>2) With regard to processing crystallographic data output by e.g.
>>>refinement software - none?
>>
>>As far as I know, no current refinement software outputs non-ASCII
>>CIF content, except by using the limited and somewhat arcane system
>>of ASCII elides described among the CIF 1.1 "Common Semantic
>>Features" (and which technically is not part of the CIF 1.1 spec).
>>If there are any that do otherwise then the files they produce may
>>not conform to CIF 1.1.  Any existing processing software that
>>consumes CIFs therefore either will assume the character set to be
>>restricted to ASCII, or will make some specific local provision for
>>handling non-standard CIFs.  Some such software may be able to
>>immediately take advantage of the larger character repertoire
>>afforded by Unicode, but a lot of software will need to be updated
>>to make any use of it.
>>
>>I'm not sure any of that answers the question, though.  What
>>behaviors count as "processing"?  To the extent that few
>>crystallographic computations can be performed on non-numeric data,
>>I see no special benefit for that kind of processing.
>>
>>On the other hand, I do see certain advantages to CIF being able to
>>represent personal names without transliteration, as variant
>>transliteration approaches applied to the same name sometimes
>>produce different results.  If the "processing" in question involves
>>storing CIF data in a database then there are searching and
>>normalization advantages to having names, at least, written in their
>>native script.  (The elide system covers many of these cases, at
>>least for European names, but not all possible cases.)
>>
>>>3) With regard to richer content within data values - minimal?
>>
>>Again, names.
>>
>>Also, deprecating the elide system -- I understand that it is
>>designed to be mnemonic, and it *is* easier to read than Unicode
>>escape codes would be, but it's still limited and hard to read.  I
>>contend that this one thing that is broken in CIF1 (whether you
>>characterize the problem as an insufficient character repertoire or
>>as an insufficient elide system).
>>
>>Plus, there are various non-ASCII characters in routine use in
>>crystallography and related fields that it would be nice to
>>represent directly, among them the degree symbol and many upper- and
>>lower-case Greek letters.  The elide system currently covers these,
>>but again, it's uncomfortable and not an official standard.
>>
>>Furthermore, if there is some hope or expectation of CIF2 as an
>>electronic representation of non-English manuscripts, then that
>>virtually requires direct support for all the characters of the
>>scripts in which such manuscripts will be written.  The elide system
>>is workable for short pieces of text, but only via machine
>>translation could it be comfortable for longer texts.
>>
>>I think these amount to more than a minimal advantage for Unicode in
>>data values.
>>
>>>In the latter case an extended character set can be represented
>>>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on
>>>my experience (and in light of the issues we've been discussing),
>>>it will probably be considerably easier for a user to adapt to a
>>>few extra ASCII control sequences than asking them to pay any
>>>attention to the underlying text encodings. The same applies from a
>>>developers point of view - i.e. its far easier to accept extended
>>>ASCII control sequences than to try to determine the text encoding
>>>(unless of course the encodings are unambiguously identifiable).
>>
>>Java / Python-style Unicode escapes have the advantages of covering
>>all of Unicode, of providing an unambiguous encoding of an
>>underlying Unicode text model, and of embedding that encoding in an
>>ASCII-based host format.
>>
>>They have the disadvantages of being difficult for a human to
>>directly read or edit, and of introducing their own set of issues.
>>For example, consider the following potential CIF2 fragment:
>>
>>         _foo \u000A;bar\u000A;
>>
>>What is the value assigned to data name _foo?  If the Unicode
>>escapes are processed according to the Java model (i.e. as if
>>replaced by the corresponding character prior to lexical analysis),
>>then the value is bar.  If the escapes are processed later, then the
>>value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1
>>calls them, but containing <LF> characters (in fact, this particular
>>value cannot be represented in CIF 1 at all).
>>
>>These issues do not by any means block Unicode escapes from being
>>adopted for CIF, but they do mean that taking such an approach
>>requires some additional details to be settled, and that there will
>>be interesting gotchas involved in adapting some existing CIF1
>>software for CIF2.
>>
>>>Furthermore, extending the character set (however represented) does
>>>not address issues such as representing mathematical
>>>content in a CIF data value, nor images (imgCIF will not be fully
>>>compliant with CIF2 - but please correct me if I'm wrong). There
>>>are yet unexplored alternatives to enabling richer publication and
>>>archival content using CIF, but they do not concern the fundamental
>>>syntax/encoding.
>>
>>By "mathematical content" I suppose you mean formulae.  I agree,
>>formulae, images, and various other content types that might be of
>>interest are not supported by a Unicode character model alone,
>>however encoded.  It was never my understanding that supporting such
>>content types was a reason for switching to a Unicode character
>>model, however much (or little) it might be advantageous to imgCIF.
>>
>>>So the leading ('forward thinking') motivation for basing CIF2 on
>>>unicode lies in 'internationalization'. In the short/medium term I
>>>don't imagine that introducing an extended character set through
>>>unicode or multiple encodings is going to lead to any one/group
>>>adopting the new CIF2 as the basis of their private/public data
>>>archive/retrieval system. Hopefully they will take advantage of
>>>what DDLm has to offer, though most likely by using third-party
>>>software.
>>
>>I think that's missing the point.  CIF already has to deal with
>>internationalization issues, which it does, as best it can, via the
>>elide system.  Even in English it has to in some way provide a
>>character model that extends beyond ASCII.
>>
>>>At this point in my train of thought, I might say stick to ASCII as
>>>'internationalization' has not been widely called for by the
>>>community and has minimal benefits at this time.
>>
>>As a practical matter, CIF already goes beyond ASCII.  The usual
>>manner in which it does so, however, is explicitly NOT standardized.
>>Personally, I find this a sorry state of affairs indeed.
>>
>>>  However, I think CIF should move forward in this respect. So how
>>>do we achieve this? Unicode is the accepted answer? Unicode was
>>>designed for this and has some established unambiguous encodings?
>>
>>I think Unicode or (almost) equivalently, ISO-10646, is indeed the
>>accepted answer, at least inasmuch as ISO-10646 is an international
>>standard.  As far as I know, there is no competing standard of
>>comparable scope.
>>
>>>  The majority (including Microsoft) recommend adopting UTF-8 in
>>>preference to other encodings?
>>
>>XML gives special status to UTF-8 as the encoding to assume in the
>>absence of internal or external metadata directing otherwise.
>>Nevertheless, XML also requires conformant processors to be able to
>>recognize and handle UTF-16 (though not necessarily UTF-16LE,
>>UTF-16BE, or other variants).  I believe Microsoft NT-based
>>operating systems internally use UCS-2 or UTF-16 for file names,
>>depending on OS version and patch level.  Microsoft and many others
>>provide decent support for creating, reading, and editing Unicode
>>text files encoded in UTF-8, but this frequently is not the default
>>encoding.  I am not aware of Microsoft in particular promoting UTF-8
>>above locale-specific code pages, but it is my general, personal
>>perception that UTF-8 use is broad, expanding, and widely
>>recommended.  However, I do not see UTF-8 or any other encoding ever
>>being preferred over all others for all purposes.
>>
>>>So in the light of current CIF practice (i.e. unspecified-encoding
>>>of ASCII text, where the encoding has never to my knowledge been a
>>>problem), why not specify UTF-8 only, don't accommodate any
>>>non-ASCII code points in the dictionaries (which is what is
>>>proposed anyway?), and see what happens? :-) At worst a few users
>>>will find that existing software will not handle the non-ASCII text
>>>they have diligently included in their UTF-8 CIF (but this is
>>>inevitable once you extend beyond ASCII). At best their text will
>>>be handled as UTF-8 by CIF2 software.
>>
>>That is a possible way forward, and indeed, it is basically what is
>>in the current spec.  The main problem I see with it is that in
>>practice, many people will create, use, and exchange (successfully
>>or not) "CIFs" that are not UTF-8 encoded, regardless of what the
>>spec says about that.  Although it is certainly possible to declare
>>that such files are not compliant CIFs, I don't see how that
>>provides any benefit.
>>
>>>So what about the issue of accessing archived UTF-8 CIFs? Make it
>>>clear to the recipient that the CIF will be encoded in UTF-8; if
>>>for some reason they have trouble reading the CIF, point them at
>>>appropriate UTF-8 software (preferably provide them with a fully
>>>compliant CIF2 editor/viewer that introduces them to the benefits
>>>of CIF2 and its support for unicode:-)
>>
>>And that is exactly the same thing that would be done if CIF2 did
>>not specify a particular encoding.
>>
>>>Similarly, with day-to-day transmission of a CIF, if the CIF
>>>doesn't contain any characters beyond the ASCII set, the chances
>>>are there wont be any issues (there havn't been in the past?). If a
>>>diligent user has followed the spec and prepared a UTF-8 CIF, again
>>>the chances are it will be interpretted as UTF-8 (very few modern
>>>systems struggle with UTF-8?).
>>
>>I'm not in a position to know how many encoding-related issues there
>>may have been in the past.  UTF-16 variants and EBCDIC variants are
>>the only encodings I know that are in wide use and might present an
>>interchange problem for CIF 1.1 compliant CIFs.  They would present
>>exactly the same problems if used to encode ASCII-only CIF2 text.
>>
>>>I fully expect to be 'shot down' on any number of my thoughts -
>>>but, given the amount of emails it has generated, I dont think it
>>>is unreasonable to put this issue in the context of perceived
>>>current practice (however narrow the viewpoint - others have
>>>referred to CIF systems that I have no idea about)?
>>
>>It is not my goal to "shoot you down", or anyone else.  I am not
>>debating for the sake of the debate.  I want CIF2 to be as
>>technically sound and as practically useful as possible, and I don't
>>foresee a lot of latitude for tweaking or revising it after it is
>>adopted.
>>
>>I started by probing several areas where the draft spec seemed to
>>give too little consideration to the implications of expanding the
>>CIF character repertoire to all of Unicode.  For the most part these
>>have been resolved easily, but the issue of embedded U+FEFF
>>characters was contentious (and still has not been resolved).  That
>>led into the related area of character encoding and text vs. binary,
>>which has become such a brouhaha.
>>
>>Much of the disagreement over these contentious issues arises from
>>CIF's split-personality design.  It has always been promoted as a
>>human-readable text format, yet it is intended largely to be
>>produced and primarily to be consumed by computers.  Humans and
>>computers have different requirements, and it is not always possible
>>to align them.  XML followed a similar path, and nowadays the
>>prevailing opinion seems to be that XML isn't well suited to direct
>>human reading or modification.  Opinion of CIF has not reached that
>>point yet, and it's unclear whether it ever will do.
>>
>>Best,
>>
>>John
>>--
>>John C. Bollinger, Ph.D.
>>Department of Structural Biology
>>St. Jude Children's Research Hospital
>>
>>
>>
>>
>>Email Disclaimer:  www.stjude.org/emaildisclaimer
>>_______________________________________________
>>ddlm-group mailing list
>>ddlm-group@iucr.org
>>http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>
> --
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                  +1-631-244-3035
>                  yaya@dowling.edu
> =====================================================
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]