Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .

There is _no_ perfect solution, but for XML and HTML, having both a BOM and a
clear indicator in the body of the text seems to have worked out 
reasonably well.  If you add my transmission check field you will 
then catch a some
of the cases that are now missed, including the one Simon just cited.

At 5:41 PM +0000 6/25/10, SIMON WESTRIP wrote:
>Its using a field for specifying the encoding that worries me.
>Who is to make such a declaration in the CIF - an author who may be 
>blissfully unaware of the encoding they're using?
>Or an author who is preparing a new CIF by editing an old one, again 
>unaware that the text editor they are using is about to save
>the CIF in some other encoding? At least with UTF BOM's we have a 
>fighting chance - I'd rather only accept these.
>We're also further restricting the number of non-CIF-aware programs 
>that can be used to read the text.
>You've also mentioned that we should learn from HTML - just because 
>HTML has an encoding declaration does not mean it is correct,
>which is why browsers seem to apply there own heuristics to 
>determine the encoding.
>From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
>Sent: Friday, 25 June, 2010 18:08:26
>Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. 
>.. .. .. .. .. .
>To help you understand how really difficult this problem is, let us
>just consider and ordinary 7 bit ASCII file made in the USA with
>a string with the three characters "{|}" being read using the ISO
>"-B" encoding for each of the following countries.  What the user
>will see will be:
>     Belgium-B  e-acute, ij, e-grave
>     France-B  e-acute, u-grave, e-grave
>     Italy-B    a-grave, e-acute, e-grave
>     Sweden    a-umlaut, o-umlaut, a-ring
>and many, many, more possibilities.
>So, even staying with 7 bit "ASCII" is not an answer.  If you want files
>to be read reliably, you really do have to specify the encoding used
>to write it, clearly, unambiguously, and tied to the file.  That it
>why XML has that field for encodings.
>At 9:12 AM -0700 6/25/10, SIMON WESTRIP wrote:
>>Hi James, my experiments were fairly unsophisticated I'm afraid -
>>I created a UTF-8 encoded text file containing a string of accented 
>>using an appropriate editor on Linux. Mailed it to myself. Switched
>>my machine over to
>>Windows (XP), then tried opening the file with a selection of text
>>editors (Notepad, Wordpad, MinGW (an editor for C++), Word2000 and
>>MinGW and Word2000 failed to recognize UTF-8; Word 2010 prompted me
>>for an encoding. Notepad recognized UTF-8 but couldnt handled the
>>unix line feeds.
>>I then used Notepad and the Windows character map tool to create a
>>similar file, but just saved it as the default Windows encoding (no
>>choice in this
>>using Notepad), then mailed this to myself, switched back to linux
>>and tried opening it with a couple of editors - they failed to
>>recognize windows cp-1252
>>(i.e. defaulted to UTF-8).
>>While back in Linux I created a similar file but encoded using one
>>of the Cyrillic code pages - mailed it to myself, then switched to
>>windows, and tried it with the same editors.
>>In this case none of them managed to recognize the encoding (not 
>>These arent very sophisticated tests I know, but do demonstrate to
>>me anyway, that once we extend beyond ASCII, users will have to be
>>aware that many basic text editors
>>may no longer be suitable for work with such CIFs.
>>I didnt encounter any transmission errors, as far as I am aware,
>>though the mail tool I use (web-based) didnt render the windows
>>cp-1252 text appropriatley in its viewing pane,
>>but when I downloaded it and opened it Kate (linux) and manually
>  >requested cp-1252, all seemed well - i.e. nothing had been corrupted
>>by transmission.
>>In many respects I dont think these issues with existing text
>>editors not being able to render text encodings beyond the default
>>should necessarily stop us from adopting
>>e.g. UTF-8. It would be fairly trivial these days to create a basic
>>text editor that could read/write utf-8 and convert from the native
>>encoding to utf-8. Not so trivial to determine other encodings
>>though, as demonstrated by Word2010 prompting for the encoding.
>>I'll have a think about the transmission issue - my tests werent
>>really looking at this - to be honest I do not entirely understand
>>why file transfer should be so problematic.
>>From: James Hester <<mailto:jamesrhester@gmail.com>jamesrhester@gmail.com>
>>To: Group finalising DDLm and associated dictionaries 
>>Sent: Friday, 25 June, 2010 14:32:25
>>Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. ..
>>.. .. .. .. .. .
>>Hi Simon - I was actually going to suggest that we organise ourselves
>>a little experiment in passing around files containing non-ASCII
>>characters using different mechanisms (ftp, http, email) to satisfy
>>ourselves (or me, at least) that files get mangled.  Perhaps you could
>>document for the record exactly what you did and how it went wrong?
>>On Fri, Jun 25, 2010 at 10:25 PM, SIMON WESTRIP
>>   > I think this is a very good idea.
>>>   The more we discuss this issue, the more I realize just how 
>>>fundamental this
>>>   proposed change is.
>>>   As CIF users are used to being able to work with CIF in its 
>>>'raw' form using
>>>   any text editor, and
>>>   many text editors default to the system encoding, and provide no means of
>>>   switching encoding,
>>>   the likelihood is that any non-ASCII text will not be rendered properly if
>>>   the CIF is encoded in anything but
>>>   the system encoding. I've been experimenting with passing 
>>>variously encoded
>>>   text files between linux and windows - in the majority of cases the text
>>>   editor failed to render the text correctly.
>>>   So by specifying one particular encoding, or any number of encodings, we
>>>   will be asking
>>>   CIF users to change the way they treat CIF quite fundamentally, requiring
>>>   them to use a narrower range of software to
>>>   edit/view CIFs. Assuming they accept this restriction, should we 
>>>then burden
>>>   them further by saying you will also have to
>>>   be prepared to accept a number of encodings, so your software will either
>>>   need to be able to confidently identify those encodings
>>>   or provide you with a means to switch between them until you find one that
>>>   appears to render correctly or matches the declared
>>>   encoding (which could be in error anyway)? Or do we lessen the burden by
>>>   specifying only one encoding?
>>>   Cheers
>>>   Simon
>>>   ________________________________
>>>   From: James Hester 
>>>   To: Group finalising DDLm and associated dictionaries
>>>   Sent: Friday, 25 June, 2010 3:47:22
>>>   Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. 
>>>.. .. .. ..
>>>   .. .. .
>>>   I don't think we are quite going around in circles; but it is very
>>>   time-consuming exploring every point that is made to determine its
>>>   value and relevance.  Such methodical work can be done in a more
>>>   considered fashion by email, or even better with a wiki page.  To that
>>>   end, I plan to collect together a summary of all the points of view
>>>   that have been expressed so far, as a basis for further discussion.
>>>   On Fri, Jun 25, 2010 at 3:53 AM, Herbert J. Bernstein
>  >>>  Dear Colleagues,
>>>>     It is an unfortunate reality that we seem unable to agree on this issue
>>>>   and perhaps others related to CIF2 and DDLm.  Perhaps we need a meeting.
>>>>   If enough of us are at the ACA meeting in Chicago, and a few others
>>>>   could join in via Skype, maybe we could make some progress.
>>>>     Right now we seem to be going in circles.
>>>>     Regards,
>>>>       Herbert
>>>>   At 12:24 PM -0500 6/24/10, Bollinger, John C wrote:
>>>>>On Wednesday, June 23, 2010 8:24 PM, SIMON WESTRIP wrote:
>>>>>>I've attempted to take a step back and look at the encoding problem
>>>>>>from the perspective of my working experience.
>>>>>Fair enough.
>>>>>>To start with, please indulge me by putting aside the
>>>>>>philosophical/respectful ('internationalization') considerations.
>>>>>>What are the short/medium term benefits of extending CIF beyond ASCII
>>>>>>   text?
>>>>>>1) With regard to the promise of DDLm (all ASCII) - none?
>>>>>I'm insufficiently informed to respond to that one.
>>>>>>2) With regard to processing crystallographic data output by e.g.
>>>>>>refinement software - none?
>>>>>As far as I know, no current refinement software outputs non-ASCII
>>>>>CIF content, except by using the limited and somewhat arcane system
>>>>>of ASCII elides described among the CIF 1.1 "Common Semantic
>>>>>Features" (and which technically is not part of the CIF 1.1 spec).
>>>>>If there are any that do otherwise then the files they produce may
>>>>>not conform to CIF 1.1.  Any existing processing software that
>>>>>consumes CIFs therefore either will assume the character set to be
>>>>>restricted to ASCII, or will make some specific local provision for
>>>>>handling non-standard CIFs.  Some such software may be able to
>>>>>immediately take advantage of the larger character repertoire
>>   >>>afforded by Unicode, but a lot of software will need to be updated
>>>>>to make any use of it.
>>>>>I'm not sure any of that answers the question, though.  What
>>>>>behaviors count as "processing"?  To the extent that few
>>>>>crystallographic computations can be performed on non-numeric data,
>>>>>I see no special benefit for that kind of processing.
>>>>>On the other hand, I do see certain advantages to CIF being able to
>>>>>represent personal names without transliteration, as variant
>>>>>transliteration approaches applied to the same name sometimes
>>>>>produce different results.  If the "processing" in question involves
>>>>>storing CIF data in a database then there are searching and
>>>>>normalization advantages to having names, at least, written in their
>>>>>native script.  (The elide system covers many of these cases, at
>>>>>least for European names, but not all possible cases.)
>>>>>>3) With regard to richer content within data values - minimal?
>>>>>Again, names.
>>>>>Also, deprecating the elide system -- I understand that it is
>>>>>designed to be mnemonic, and it *is* easier to read than Unicode
>>>>>escape codes would be, but it's still limited and hard to read.  I
>>>>>contend that this one thing that is broken in CIF1 (whether you
>>>>>characterize the problem as an insufficient character repertoire or
>>>>>as an insufficient elide system).
>>>>>Plus, there are various non-ASCII characters in routine use in
>>>>>crystallography and related fields that it would be nice to
>>>>>represent directly, among them the degree symbol and many upper- and
>>>>>lower-case Greek letters.  The elide system currently covers these,
>>>>>but again, it's uncomfortable and not an official standard.
>>>>>Furthermore, if there is some hope or expectation of CIF2 as an
>>>>>electronic representation of non-English manuscripts, then that
>>>>>virtually requires direct support for all the characters of the
>>>>>scripts in which such manuscripts will be written.  The elide system
>>>>>is workable for short pieces of text, but only via machine
>>>>>translation could it be comfortable for longer texts.
>>>>>I think these amount to more than a minimal advantage for Unicode in
>>>>>data values.
>  >>>>
>>>>>>In the latter case an extended character set can be represented
>>>>>>using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on
>>>>>>my experience (and in light of the issues we've been discussing),
>>>>>>it will probably be considerably easier for a user to adapt to a
>>>>>>few extra ASCII control sequences than asking them to pay any
>>>>>>attention to the underlying text encodings. The same applies from a
>>>>>>developers point of view - i.e. its far easier to accept extended
>>>>>>ASCII control sequences than to try to determine the text encoding
>>>>>>(unless of course the encodings are unambiguously identifiable).
>>>>>Java / Python-style Unicode escapes have the advantages of covering
>>>>>all of Unicode, of providing an unambiguous encoding of an
>>>>>underlying Unicode text model, and of embedding that encoding in an
>>>>>ASCII-based host format.
>>>>>They have the disadvantages of being difficult for a human to
>>>>>directly read or edit, and of introducing their own set of issues.
>>>>>For example, consider the following potential CIF2 fragment:
>>>>>           _foo \u000A;bar\u000A;
>>>>>What is the value assigned to data name _foo?  If the Unicode
>>>>>escapes are processed according to the Java model (i.e. as if
>>>>>replaced by the corresponding character prior to lexical analysis),
>>>>>then the value is bar.  If the escapes are processed later, then the
>>>>>value is <LF>;bar<LF>;, apparently a "simple data value" as CIF 1.1
>>>>>calls them, but containing <LF> characters (in fact, this particular
>>>>>value cannot be represented in CIF 1 at all).
>>>>>These issues do not by any means block Unicode escapes from being
>>>>>adopted for CIF, but they do mean that taking such an approach
>>>>>requires some additional details to be settled, and that there will
>>>>>be interesting gotchas involved in adapting some existing CIF1
>>>>>software for CIF2.
>>>>>>Furthermore, extending the character set (however represented) does
>>   >>>>not address issues such as representing mathematical
>>>>>>content in a CIF data value, nor images (imgCIF will not be fully
>>>>>>compliant with CIF2 - but please correct me if I'm wrong). There
>>>>>>are yet unexplored alternatives to enabling richer publication and
>>>>>>archival content using CIF, but they do not concern the fundamental
>>>>>By "mathematical content" I suppose you mean formulae.  I agree,
>>>>>formulae, images, and various other content types that might be of
>>>>>interest are not supported by a Unicode character model alone,
>>>>>however encoded.  It was never my understanding that supporting such
>>>>>content types was a reason for switching to a Unicode character
>>>>>model, however much (or little) it might be advantageous to imgCIF.
>>>>>>So the leading ('forward thinking') motivation for basing CIF2 on
>>>>>>unicode lies in 'internationalization'. In the short/medium term I
>>>>>>don't imagine that introducing an extended character set through
>>>>>>unicode or multiple encodings is going to lead to any one/group
>>>>>>adopting the new CIF2 as the basis of their private/public data
>>>>>>archive/retrieval system. Hopefully they will take advantage of
>>>>>>what DDLm has to offer, though most likely by using third-party
>>>>>I think that's missing the point.  CIF already has to deal with
>>>>>internationalization issues, which it does, as best it can, via the
>>>>>elide system.  Even in English it has to in some way provide a
>>>>>character model that extends beyond ASCII.
>>>>>>At this point in my train of thought, I might say stick to ASCII as
>>>>>>'internationalization' has not been widely called for by the
>>>>>>community and has minimal benefits at this time.
>>>>>As a practical matter, CIF already goes beyond ASCII.  The usual
>>>>>manner in which it does so, however, is explicitly NOT standardized.
>>>>>Personally, I find this a sorry state of affairs indeed.
>>>>>>   However, I think CIF should move forward in this respect. So how
>>>>>>do we achieve this? Unicode is the accepted answer? Unicode was
>  >>>>>designed for this and has some established unambiguous encodings?
>>>>>I think Unicode or (almost) equivalently, ISO-10646, is indeed the
>>>>>accepted answer, at least inasmuch as ISO-10646 is an international
>>>>>standard.  As far as I know, there is no competing standard of
>>>>>comparable scope.
>>>>>>   The majority (including Microsoft) recommend adopting UTF-8 in
>>>>>>preference to other encodings?
>>>>>XML gives special status to UTF-8 as the encoding to assume in the
>>>>>absence of internal or external metadata directing otherwise.
>>>>>Nevertheless, XML also requires conformant processors to be able to
>>>>>recognize and handle UTF-16 (though not necessarily UTF-16LE,
>>>>>UTF-16BE, or other variants).  I believe Microsoft NT-based
>>>>>operating systems internally use UCS-2 or UTF-16 for file names,
>>>>>depending on OS version and patch level.  Microsoft and many others
>>>>>provide decent support for creating, reading, and editing Unicode
>>>>>text files encoded in UTF-8, but this frequently is not the default
>>>>>encoding.  I am not aware of Microsoft in particular promoting UTF-8
>>>>>above locale-specific code pages, but it is my general, personal
>>>>>perception that UTF-8 use is broad, expanding, and widely
>>>>>recommended.  However, I do not see UTF-8 or any other encoding ever
>>>>>being preferred over all others for all purposes.
>>>>>>So in the light of current CIF practice (i.e. unspecified-encoding
>>>>>>of ASCII text, where the encoding has never to my knowledge been a
>>>>>>problem), why not specify UTF-8 only, don't accommodate any
>>>>>>non-ASCII code points in the dictionaries (which is what is
>>>>>>proposed anyway?), and see what happens? :-) At worst a few users
>>>>>>will find that existing software will not handle the non-ASCII text
>>>>>>they have diligently included in their UTF-8 CIF (but this is
>>>>>>inevitable once you extend beyond ASCII). At best their text will
>>>>>>be handled as UTF-8 by CIF2 software.
>>>>>That is a possible way forward, and indeed, it is basically what is
>>>>>in the current spec.  The main problem I see with it is that in
>>   >>>practice, many people will create, use, and exchange (successfully
>>>>>or not) "CIFs" that are not UTF-8 encoded, regardless of what the
>>>>>spec says about that.  Although it is certainly possible to declare
>>>>>that such files are not compliant CIFs, I don't see how that
>>>>>provides any benefit.
>>>>>>So what about the issue of accessing archived UTF-8 CIFs? Make it
>>>>>>clear to the recipient that the CIF will be encoded in UTF-8; if
>>>>>>for some reason they have trouble reading the CIF, point them at
>>>>>>appropriate UTF-8 software (preferably provide them with a fully
>>>>>>compliant CIF2 editor/viewer that introduces them to the benefits
>>>>>>of CIF2 and its support for unicode:-)
>>>>>And that is exactly the same thing that would be done if CIF2 did
>>>>>not specify a particular encoding.
>>>>>>Similarly, with day-to-day transmission of a CIF, if the CIF
>>>>>>doesn't contain any characters beyond the ASCII set, the chances
>>>>>>are there wont be any issues (there havn't been in the past?). If a
>>>>>>diligent user has followed the spec and prepared a UTF-8 CIF, again
>>>>>>the chances are it will be interpretted as UTF-8 (very few modern
>>>>>>systems struggle with UTF-8?).
>>>>>I'm not in a position to know how many encoding-related issues there
>>>>>may have been in the past.  UTF-16 variants and EBCDIC variants are
>>>>>the only encodings I know that are in wide use and might present an
>>>>>interchange problem for CIF 1.1 compliant CIFs.  They would present
>>>>>exactly the same problems if used to encode ASCII-only CIF2 text.
>>>>>>I fully expect to be 'shot down' on any number of my thoughts -
>>>>>>but, given the amount of emails it has generated, I dont think it
>>>>>>is unreasonable to put this issue in the context of perceived
>>>>>>current practice (however narrow the viewpoint - others have
>>>>>>referred to CIF systems that I have no idea about)?
>>>>>It is not my goal to "shoot you down", or anyone else.  I am not
>  >>>>debating for the sake of the debate.  I want CIF2 to be as
>>>>>technically sound and as practically useful as possible, and I don't
>>>>>foresee a lot of latitude for tweaking or revising it after it is
>>>>>I started by probing several areas where the draft spec seemed to
>>>>>give too little consideration to the implications of expanding the
>>>>>CIF character repertoire to all of Unicode.  For the most part these
>>>>>have been resolved easily, but the issue of embedded U+FEFF
>>>>>characters was contentious (and still has not been resolved).  That
>>>>>led into the related area of character encoding and text vs. binary,
>>>>>which has become such a brouhaha.
>>>>>Much of the disagreement over these contentious issues arises from
>>>>>CIF's split-personality design.  It has always been promoted as a
>>>>>human-readable text format, yet it is intended largely to be
>>>>>produced and primarily to be consumed by computers.  Humans and
>>>>>computers have different requirements, and it is not always possible
>>>>>to align them.  XML followed a similar path, and nowadays the
>>>>>prevailing opinion seems to be that XML isn't well suited to direct
>>>>>human reading or modification.  Opinion of CIF has not reached that
>>>>>point yet, and it's unclear whether it ever will do.
>>>>>John C. Bollinger, Ph.D.
>>>>>Department of Structural Biology
>>>>>St. Jude Children's Research Hospital
>>>>>Email Disclaimer:
>>>>>ddlm-group mailing list
>>>>   --
>>>>   =====================================================
>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>           Idle Hour Blvd, Oakdale, NY, 11769
>>>>                   +1-631-244-3035
>>>>   =====================================================
>>>>   _______________________________________________
>>   >> ddlm-group mailing list
>>>   --
>>>   T +61 (02) 9717 9907
>>>   F +61 (02) 9717 3145
>>>   M +61 (04) 0249 4148
>>>   _______________________________________________
>>>   ddlm-group mailing list
>>>   _______________________________________________
>>>   ddlm-group mailing list
>>T +61 (02) 9717 9907
>>F +61 (02) 9717 3145
>>M +61 (04) 0249 4148
>>ddlm-group mailing list
>  >
>>ddlm-group mailing list
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>                   +1-631-244-3035
>                   <mailto:yaya@dowling.edu>yaya@dowling.edu
>ddlm-group mailing list
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.