Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. .

Dear Colleagues,

   Unfortunately, John Bollinger, in his desire to help clarify the
current CIF2 proposal with respect to encoding has overstated
the rules in his summary.  What the change documcent currently says
is:

"CIF2 files are standard variable length plain text files, which for 
compatibility with older processing systems will have a maximum line 
length of 2048 characters. As discussed above and below, however, there 
are some restrictions on the character set for token delimiters, 
separators and data names. For compatibility with CIF1 behaviour, there is 
no formal restriction on the encoding of CIF2 files, providing they 
contain only code points from the ASCII range. If a CIF2 file contains 
characters equivalent to Unicode code points greater than U+007F (127 
decimal), then the particular encoding used must either be UTF8 or 
algorithmically identifiable from the CIF2 file itself. Acceptable 
identification algorithms will be published as necessary as annexes to 
this standard (see description of magic code and encoding disambiguation 
in Change 1). Annexes notwithstanding, (i) a CIF2 file containing 
characters outside the ASCII range with no BOM and no disambiguation 
signature will be a UTF8 file, and (ii) a CIF2 file containing characters 
outside the ASCII range with a valid UTF8 or UTF16 BOM and no 
disambiguation signature, will be a Unicode file written in the indicated 
encoding.

In keeping with XML restrictions we allow the characters

U+0009  U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD 
U+10000 -- U+10FFFD

In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is 
any hexadecimal digit are disallowed. Unicode reserves the code points 
E000 F8FF for private use. The IUCr and only the IUCr may specify what 
characters are assigned to these code points in the context of CIF2.

Reasoning: There is growing demand for the wider character set afforded by 
Unicode to be made available in applications, especially those where 
internationalisation is an issue.

=====================================================

In particular, the statement

> The only CIF 2.0 mechanisms currently supported for
> including literal characters that have no ASCII mapping are (1) to
> encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2)
> to encode the whole document in UTF-16 with a UTF-16 BOM.

implies the encoding issue is settled.  That is not what the draft change 
document says, and what is in the change document is certainly not the 
last word on encodings.

I would urge those who have ideas on the subject to feel free to express 
them, especially because the specification of "disambuguation signatures" 
is an open, unresolved issue in the change document, and the concept of a 
unicode BOM admits a much wider range of encodings than just UTF-8 and 
UTF-16.

I have found what has been said thus far very helpful and educational, and 
hope that the dicussion will continue.

Regards,
   Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 10 Mar 2011, Bollinger, John C wrote:

>
> On Thursday, March 10, 2011 3:22 AM, Matthew Towler wrote:
>
>> I will agree with many of the points made by Peter.  I believe the 
>> decision on the byte order markings (BOM) should be made having 
>> considered what type of format CIF should be.  As I see it there are 
>> two options.
>>
>> 1) An easily human editable, text based format, as CIF 1.1 is 
>> presently.  [...]
>>
>> 2) A machine editable or non-text format, such as XML or PDF or a text 
>> file with non-standard encoding.  [...]
>
> [...]
>
>> In summary, I feel that creating a non-standard-standard will impede 
>> the usage of the new files, and therefore the best choice is to use 
>> standard Unicode files.
>
> I would like to point out that the DDLm technical subcommittee devoted 
> considerable time and energy to character encoding and related topics, 
> to the extent that we prevailed upon IUCr to provide a discussion list 
> specifically for that contentious debate.  You will find the early part 
> of the discussion among the archives of the main DDLm list 
> (http://www.iucr.org/__data/iucr/lists/ddlm-group/), and you will find 
> the later, larger part of the discussion, including the genesis of our 
> ultimate compromise, in the archives of the cif2-encoding list 
> (http://www.iucr.org/__data/iucr/lists/cif2-encoding/).
>
> A specification documenting the differences between CIF 1.1 and CIF 2.0 
> (http://www.iucr.org/__data/assets/pdf_file/0004/47434/cif2_syntax_changes_jrh20101115.pdf) 
> was previously approved by COMCIFS.  Inasmuch as the CIF 2.0 syntax 
> discussion continues, however, the changes already approved could yet be 
> modified.  I encourage those interested in the topic of character 
> encoding to read the "Change 2" section of the changes document to find 
> how CIF 2.0, as currently constituted, will address those issues.  To 
> summarize, however, the approved CIF 2.0 changes attempt to address the 
> text-based historical legacy of CIF -- recognizing that "text" is a 
> poorly-defined and system-dependent term -- while simultaneously looking 
> forward to Unicode.  The only CIF 2.0 mechanisms currently supported for 
> including literal characters that have no ASCII mapping are (1) to 
> encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2) 
> to encode the whole document in UTF-16 with a UTF-16 BOM.
>
> I am certain that COMCIFS would be interested in hearing from anyone who 
> believes the compromise to be flawed or unreasonable, or that it would 
> hinder adoption of CIF 2.0.  I do hope to avoid repeating the debate 
> that the DDLm group already conducted on the topic, however.
>
>
> Regards,
>
> John
>
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> comcifs mailing list
> comcifs@iucr.org
> http://scripts.iucr.org/mailman/listinfo/comcifs
>


Reply to: [list | sender only]