[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: "Herbert J. Bernstein" <[email protected]>
Date: Thu, 29 Oct 2009 07:59:38 -0400 (EDT)
In-Reply-To: <C70E10D8.1221C%[email protected]>
References: <C70E10D8.1221C%[email protected]>

Dear Nick,

   Unlike the rest of the group, I am not much for mandatory standards, so
I expect that CBF will muddle along with a mixture of UTF-8 and UCS-2 and
its pure binary forms as well, with only the UTF-8 versions being 
officially a CIF.  Inasmuch a the UCS-2 versions will not be officially
CIFs, there is no harm in them having BOMs (which is an essential part
of UCS-2 files) as well as the readable encoding comment as the second
line, which is what I suspect people will actually do.

   In 5 or 10 years, when there as been some experience with this and the 
world as a whole has decided what to do with the C world if UTF-8 versus
the java world of UCS-2, we can revisit the issue.  For the moment, I
would suggest we turn a blind eye to the non-CIF, but CIF-like variants
of imgCIF files, just as we turn a blind eye to the not-exactly-CIF
practices at the IUCr and the PDB, so people can get their work done.

   Bottom line:  A CIF is in UTF-8 and does not have a BOM or encoding
comment, and something with a BOM or encoding comment (or both) may
be useful and may have a trasliteration into a CIF, but is not itself
a CIF, and should use a file extension other than ".cif", such as ".cbf"
As in the past, the cbf files will continue to have multiple alternate
representations of the same information.

   Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Wed, 28 Oct 2009, Nick Spadaccini wrote:

> OK. Well I think this discussion has dies its death. One of the benefits (to
> Herb mostly) is that with UCS-2 encoding, the CBF comes in to the imgCIF
> fold. Herb�s group has a binary->UCS-2 encoding with a 7% added footprint so
> it is efficient. At the moment (or at least shortly) the alternative is to
> go binary->BASE64 which has a 33% added footprint. I guess binary->UTF-8
> would have the same footprint. By being able to switch encodings Herb can
> achieve a single file for his work. The price is that we have to adopt a
> switching encoding option � not attractive to most people here apparently.
>
> It seems the consensus is that UTF-8 is the encoding standard and everything
> else is a broken file. Not with standing that any application may parse it,
> attempt to determine the encoding and produce a corrected file. That is
> nice, but not a required part of the standard (which is the standard, to
> quote David, quoting me).
>
> It seems to me Herb, the easiest thing you can do with regard to CBF (since
> it can�t be a real CIF) is to adopt everything from CIF-2 except that CBF
> must be encoded as UCS-2. That may give you a bigger footprint per ASCII
> character, but since these are a small number of data values in a CBF it may
> well be worth the price. That way at least the binary part is efficiently
> encoded.
>
> If in the DDLm we manage to flag order dependency in the loop_ header (along
> with a flag to indicate row ordering is important in the loop_ - something
> John W wanted) then CBF will essentially be CIF-2 except that the encoding
> is UCS-2.
>
>
> On 26/10/09 10:43 PM, "David Brown" <[email protected]> wrote:
>
>> James has asked for the views of those of us who have just been watching this
>> discussion (it seems to go by faster than I can follow).
>>
>> For what is is worth, my stance is strongly the same as James'.  He has laid
>> out all the argumemts very succinctly - I will just briefly reinforce those I
>> think most importatn.
>>
>> 1. Permitting one or more other encodings (presumably these must constitute a
>> well defined list as their names must be recognized) immediately invites
>> people to use them.  To quote Nick, if we have a standard it should be a
>> standard.  If people decide to bend the standard they do so at their own
>> peril.
>>
>> 2. If other encodings are depricated (yes, that word again) in order to
>> encourage people to use the real standard, they are likely to be unaware that
>> their program has used a depricated encoding and therefore has failed to
>> identify it.  If the writer of the CIF knows that it is using a depricated
>> standard it can make the conversion.  Only if it does not know it is using a
>> depricated standard and therefore neither converts nor identifies the encoding
>> will the depricated encoding get through.
>>
>> 3. Providing a space (at this stage) for identifying encodings that may (or
>> may not) later become part of the standard is unnecessary since it can be
>> added if and when such other encodings are allowed.  UTF-8 then becomes the
>> default.
>>
>> 4. Having a single standard requires that the readers need only consider one
>> encoding and the writers need only support a conversion from the native to the
>> CIF2 standard.  Allowing 5 or 10 other encodings makes life easier for the
>> writer since it does not have to provide any conversion, but there is a price.
>> Every reader must be able to read 5 or 10 different encodings because it is
>> not allowed to reject any of the depricated standardss.  With a single
>> standard the IUCr may or may not decide that they will handle different
>> encodings, but that is their choice.  Making different encodings legal removes
>> that choice from the reader: it has to handle all possibilities, which is only
>> likely to discourage people writing local programs for occasional use.
>>
>> David
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: [email protected]
>
>
>
>

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] CIF-2 changes

Next by Date: Re: [ddlm-group] New syntax: 'marker' characters

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8