Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .

Dear all

I've attempted to take a step back and look at the encoding problem from the perspective of my working experience.
This will undoubtedly differ from most of yours (for example I'm relatively new to dealing with CIF as a developer, though I have a number of years experience working with them). I have to stress that the following are just my perceptions of the use of CIF for publication purposes -
no doubt there are workflows that use CIF entirely differently.

To start with, please indulge me by putting aside the philosophical/respectful ('internationalization') considerations.
What are the short/medium term benefits of extending CIF beyond ASCII text?

1) With regard to the promise of DDLm (all ASCII) - none?

2) With regard to processing crystallographic data output by e.g. refinement software - none?

3) With regard to richer content within data values - minimal?

In the latter case an extended character set can be represented using an ASCII representation of Unicode (\\[ux]xxxxxx). Based on my experience (and in light of the issues we've been discussing), it will probably be considerably easier for a user to adapt to a few extra ASCII control sequences than asking them to pay any attention to the underlying text encodings. The same applies from a developers point of view - i.e. its far easier to accept extended ASCII control sequences than to try to determine the text encoding (unless of course the encodings are unambiguously identifiable).
Furthermore, extending the character set (however represented) does not address issues such as representing mathematical
content in a CIF data value, nor images (imgCIF will not be fully compliant with CIF2 - but please correct me if I'm wrong). There are yet unexplored alternatives to enabling richer publication and archival content using CIF, but they do not concern the fundamental syntax/encoding.

So the leading ('forward thinking') motivation for basing CIF2 on unicode lies in 'internationalization'. In the short/medium term I don't imagine that introducing an extended character set through unicode or multiple encodings is going to lead to any one/group adopting the new CIF2 as the basis of their private/public data archive/retrieval system. Hopefully they will take advantage of what DDLm has to offer, though most likely by using third-party software.

At this point in my train of thought, I might say stick to ASCII as 'internationalization' has not been widely called for by the community and has minimal benefits at this time. However, I think CIF should move forward in this respect. So how do we achieve this? Unicode is the accepted answer? Unicode was designed for this and has some established unambiguous encodings? The majority (including Microsoft) recommend adopting UTF-8 in preference to other encodings? So in the light of current CIF practice (i.e. unspecified-encoding of ASCII text, where the encoding has never to my knowledge been a problem), why not specify UTF-8 only, don't accommodate any non-ASCII code points in the dictionaries (which is what is proposed anyway?), and see what happens? :-) At worst a few users will find that existing software will not handle the non-ASCII text they have diligently included in their UTF-8 CIF (but this is inevitable once you extend beyond ASCII). At best their text will be handled as UTF-8 by CIF2 software.

So what about the issue of accessing archived UTF-8 CIFs? Make it clear to the recipient that the CIF will be encoded in UTF-8; if for some reason they have trouble reading the CIF, point them at appropriate UTF-8 software (preferably provide them with a fully compliant CIF2 editor/viewer that introduces them to the benefits of CIF2 and its support for unicode:-)

Similarly, with day-to-day transmission of a CIF, if the CIF doesn't contain any characters beyond the ASCII set, the chances are there wont be any issues (there havn't been in the past?). If a diligent user has followed the spec and prepared a UTF-8 CIF, again the chances are it will be interpretted as UTF-8 (very few modern systems struggle with UTF-8?).

I fully expect to be 'shot down' on any number of my thoughts - but, given the amount of emails it has generated, I dont think it is unreasonable to put this issue in the context of perceived current practice (however narrow the viewpoint - others have referred to CIF systems that I have no idea about)?

Cheers

Simon


From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Thursday, 24 June, 2010 0:13:03
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .

The solution I propose is tuned to files that were intentionally created
as UTF-8 CIF2 files but, because they were created on a system with
a different default encoding, were transferred as if they were in
that different encoding with automatic conversion to the default
encoding on the receiving end.  There is no problem if the CIF had been
created in the default encoding for the originating system.  It would
just be received in the default encoding for the receiving system
which could reliably convert the file to UTF-8 or whatever other
encoding it wishes to use.  The case I raised is particularly tricky
because the sender has no way to see their own error.  The file
looks like and is a valid UTF-8 file at their end.  If the receiver
sends it back by the reverse path, the copy that comes back will
compare perfectly.  Without the transmission check for the receiver
to use, the result of enforcing a UTF-8 canonical encoding on
machines that are not fully UTF-8 aware is to produce an undetectable
and therefore uncorrectable error.

Users of KOI8-R just use other code pages for text that goes beyond
that code page, just as I use CP1251 when I am working on a code-page
based system and need to send cyrillic.  It is a major nuisance having
to switch code pages -- that is why Unicode is a better idea, and why
it is highly likely that a user on a KOI8-R or CP1251-based system
is virtually certain to use special editors to work with unicode,
and get trapped precisely by the case I suggested when they try
to send that UTF-8 file as a text file.  They really need the tc
field.

If you really want to stick to code-pages with chucks of text in
multiple encodings, you are likely to end up using multi-part
MIME with different parts in different encodings.  Then we
might want to look at multiple transmission check strings.




At 5:11 PM -0500 6/23/10, Bollinger, John C wrote:
>On Wednesday, June 23, 2010 3:00 PM, Herbert J. Bernstein wrote:
>
>>    John is mistaken about the value of the transmission check for most
>>code pages.  Most of the Cyrillic code pages have quite distinctive
>>printable charcaters for these characters values, and none of them would
>>return the transmission check as the intended characters.  At worst you
>>would get :: in place of the :<tc>: (e.g. with KOI0 or KOI7)  Try it with
>>a few encodings and you will see it works pretty well.  I chose the
>>accented o's to get in a region of the code tables that are well
>>populated, so for most encodings you would really see the problem,
>>but even the case of :: is pretty clear.
>
>I think you misunderstand my point, but perhaps I'm just not getting
>it.  Here's what I think I understand about the transmission check
>proposal: the idea is to
>
>1) include a fixed, well-known code at a recognizable position in
>the magic number.
>
>2) The code will be constructed using a sufficient number of chosen
>non-ASCII characters such that
>
>3) decoding the byte stream according to the wrong encoding will
>result in a transmission check string that differs from the
>expected, well-known one.
>
>Have I got it?
>
>Supposing that I understand the proposal, I quite agree that if a
>CIF encoded in UTF-8 and bearing the proposed transmission check
>signature were misinterpreted as being encoded in pretty much any
>other encoding, then the TC would reveal the mistake.  And that
>would be valuable.
>
>The proposed scheme and specific signature would also work for a CIF
>encoded in ISO-8859-1, ISO-8859-15, and perhaps a couple others of
>the ISO-8859 family, except that it would not distinguish these one
>from another.  As I understand it, however, that the same procedure
>could not be used for CIFs using most other encodings.  At best,
>different TC codes would be required for most encodings.  For
>example, KOI8-R does not have a way to express *any* of the
>characters having Unicode code points U+00F2 - U+00F6, so CIF text
>encoded in KOI8-R simply could not include the TC signature at all.
>There are no codes for the requisite characters.  That does not stop
>detection of UTF-8 misinterpreted as KOI8-R; rather, it makes it
>impossible to even attempt the reverse.
>
>I am unaware of a single non-ASCII character that is representable
>in substantially all encodings we might reasonably expect to be used
>for CIF text.  A different TC signature could be chosen to use for
>KOI8-R, ISO-8859-various, Shift-JIS, etc.  Each would provide a
>decent check that a CIF encoded in the corresponding form was not
>misinterpreted as being in some other form, provided that the
>intended encoding were also known (for looking up the appropriate TC
>code).  That's doable, but it's suddenly a lot more complicated,
>requiring a lot of bookkeeping.
>
>As I said, I do not oppose the transmission check idea.  If UTF-8
>were indeed designated the canonical encoding for CIF among many
>alternatives, then I would expect TC to be applicable to a large
>number of the important cases.  I doubt there is a simple
>alternative that is significantly more general, but that's what I
>would prefer if such could be found.
>
>
>Regards,
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>
>_______________________________________________
>ddlm-group mailing list
>ddlm-group@iucr.org
>http://scripts.iucr.org/mailman/listinfo/ddlm-group


--
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.