Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Dear Nick,

   Unlike the rest of the group, I am not much for mandatory standards, so
I expect that CBF will muddle along with a mixture of UTF-8 and UCS-2 and
its pure binary forms as well, with only the UTF-8 versions being 
officially a CIF.  Inasmuch a the UCS-2 versions will not be officially
CIFs, there is no harm in them having BOMs (which is an essential part
of UCS-2 files) as well as the readable encoding comment as the second
line, which is what I suspect people will actually do.

   In 5 or 10 years, when there as been some experience with this and the 
world as a whole has decided what to do with the C world if UTF-8 versus
the java world of UCS-2, we can revisit the issue.  For the moment, I
would suggest we turn a blind eye to the non-CIF, but CIF-like variants
of imgCIF files, just as we turn a blind eye to the not-exactly-CIF
practices at the IUCr and the PDB, so people can get their work done.

   Bottom line:  A CIF is in UTF-8 and does not have a BOM or encoding
comment, and something with a BOM or encoding comment (or both) may
be useful and may have a trasliteration into a CIF, but is not itself
a CIF, and should use a file extension other than ".cif", such as ".cbf"
As in the past, the cbf files will continue to have multiple alternate
representations of the same information.

   Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 28 Oct 2009, Nick Spadaccini wrote:

> OK. Well I think this discussion has dies its death. One of the benefits (to
> Herb mostly) is that with UCS-2 encoding, the CBF comes in to the imgCIF
> fold. Herbıs group has a binary->UCS-2 encoding with a 7% added footprint so
> it is efficient. At the moment (or at least shortly) the alternative is to
> go binary->BASE64 which has a 33% added footprint. I guess binary->UTF-8
> would have the same footprint. By being able to switch encodings Herb can
> achieve a single file for his work. The price is that we have to adopt a
> switching encoding option ­ not attractive to most people here apparently.
>
> It seems the consensus is that UTF-8 is the encoding standard and everything
> else is a broken file. Not with standing that any application may parse it,
> attempt to determine the encoding and produce a corrected file. That is
> nice, but not a required part of the standard (which is the standard, to
> quote David, quoting me).
>
> It seems to me Herb, the easiest thing you can do with regard to CBF (since
> it canıt be a real CIF) is to adopt everything from CIF-2 except that CBF
> must be encoded as UCS-2. That may give you a bigger footprint per ASCII
> character, but since these are a small number of data values in a CBF it may
> well be worth the price. That way at least the binary part is efficiently
> encoded.
>
> If in the DDLm we manage to flag order dependency in the loop_ header (along
> with a flag to indicate row ordering is important in the loop_ - something
> John W wanted) then CBF will essentially be CIF-2 except that the encoding
> is UCS-2.
>
>
> On 26/10/09 10:43 PM, "David Brown" <idbrown@mcmaster.ca> wrote:
>
>> James has asked for the views of those of us who have just been watching this
>> discussion (it seems to go by faster than I can follow).
>>
>> For what is is worth, my stance is strongly the same as James'.  He has laid
>> out all the argumemts very succinctly - I will just briefly reinforce those I
>> think most importatn.
>>
>> 1. Permitting one or more other encodings (presumably these must constitute a
>> well defined list as their names must be recognized) immediately invites
>> people to use them.  To quote Nick, if we have a standard it should be a
>> standard.  If people decide to bend the standard they do so at their own
>> peril.
>>
>> 2. If other encodings are depricated (yes, that word again) in order to
>> encourage people to use the real standard, they are likely to be unaware that
>> their program has used a depricated encoding and therefore has failed to
>> identify it.  If the writer of the CIF knows that it is using a depricated
>> standard it can make the conversion.  Only if it does not know it is using a
>> depricated standard and therefore neither converts nor identifies the encoding
>> will the depricated encoding get through.
>>
>> 3. Providing a space (at this stage) for identifying encodings that may (or
>> may not) later become part of the standard is unnecessary since it can be
>> added if and when such other encodings are allowed.  UTF-8 then becomes the
>> default.
>>
>> 4. Having a single standard requires that the readers need only consider one
>> encoding and the writers need only support a conversion from the native to the
>> CIF2 standard.  Allowing 5 or 10 other encodings makes life easier for the
>> writer since it does not have to provide any conversion, but there is a price.
>> Every reader must be able to read 5 or 10 different encodings because it is
>> not allowed to reject any of the depricated standardss.  With a single
>> standard the IUCr may or may not decide that they will handle different
>> encodings, but that is their choice.  Making different encodings legal removes
>> that choice from the reader: it has to handle all possibilities, which is only
>> likely to discourage people writing local programs for occasional use.
>>
>> David
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.