[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains

To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <[email protected]>
Subject: Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains
From: Peter Murray-Rust <[email protected]>
Date: Fri, 4 Mar 2011 16:50:41 +0000
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]>

On Fri, Mar 4, 2011 at 3:30 PM, Herbert J. Bernstein <[email protected]> wrote:

Dear Peter,

� �There is a misunderstanding here. �All CIF2 documents are _not_
required to use UTF-8. �The current draft proposal is written
in terms of Unicode, but the proposal explicitly says:

My apologies again.

"For compatibility with CIF1 behaviour, there is no formal
restriction on the encoding of CIF2 files, providing they contain
only code points from the ASCII range. If a CIF2 file contains
characters equivalent �to Unicode code points greater than U+007F
(127 decimal), then the particular encoding used
must either be UTF8 or algorithmically identifiable from the CIF2 file itself.
Acceptable identification algorithms will be published as necessary
as annexes to this standard (see description of magic code and
encoding disambiguation in Change 1). Annexes notwithstanding, (i) a
CIF2 file containing characters outside the ASCII range with no BOM
and no disambiguation signature will be a UTF8 file, and (ii) a CIF2
file containing characters outside the ASCII range with a �valid UTF8
or UTF16 BOM and no disambiguation signature, will be a Unicode file
written in the indicated encoding."

This seems reasonable. I interpret it as meaning that a CIF1 document only uses characters from U+0020 to U+007F so that is compatible with any encoding. Presumably processing software may then create higher Unicode points from appropriate escape sequences? In which case it should label the output document with the given� encoding.

We have not yet been able to come to agreement on the "disambiguation
signatures to be used". �We have space reserved on the first line.
Any suggestions?

I would suggest that encodings should only be taken from� http://www.iana.org/assignments/character-sets. That the encoding (including UTF-8) should be recorded in the first line of the file using only ASCII characters so that other software can recognise the encoding. I haven't followed the discussions on syntax but would suggest

encoding="FooBar1234"

as being compatible with XML and therefore most easily human-interpretable. I would not rely on the BOM as I expect that cut-and-paste will often destroy it.

I found http://www.opentag.com/xfaq_enc.htm quite a useful resource...

P.

� Regards,
� � Herbert

At 2:07 PM +0000 3/4/11, Peter Murray-Rust wrote:
>On Fri, Mar 4, 2011 at 11:47 AM, James Hester

><<mailto:[email protected]>[email protected]> wrote:
>
>Thanks Peter for your comments. �While you may not be a voting member
>of COMCIFS, you and other COMCIFS members fulfill an important
>advisory role and I would encourage everybody to take the opportunity
>to provide their perspectives.
>
>I assume you have no particular disagreement with the principles that
>you haven't commented on explicitly?
>
>
>None at all - it's just that I haven't been as heavily engaged in
>CIF recently and so wouldn't have meaningful comments.
>
>
>I've added some comments in response to your comments, inserted below:
>
> �>
> �> I found the original ASCII escapes difficult/tedious for some code points
> �> and woudl urge full unicode support (with numeric values).
>
>I perhaps wasn't clear that we have already taken this step. �The
>current CIF2 draft envisions full Unicode support using UTF-8
>encoding. �Some provision has been made for allowing other encodings
>in the future. �The point of the example was to show how this decision
>to adopt Unicode was justifiable in terms of these principles.
>
>
>It's really important to �manage encoding. I am completely
>supportive of UTF-8 but we don't mandate it in CML as XML can manage
>different encodings. The problem comes when non-conformant tools are
>used and this is particularly common with Microsoft tools which use
>CP-1252. This means that for any code points above 127 a
>cut-and-patse is likely to corrupt characters.
>
>So if I have understood correctly all CIF documents MUST use UTF-8
>and I'd strongly support this. It might be useful to announce this
>in the document (similarly to XML's <? encoding="UTF-8"?>). This is
>so that non-CIF tools can recognise the encoding.
>
>It does put requirements on the toolchain. If an author receives a
>CIF with high codepoints, pastes bits of it into (say) Windows and
>re-saves there is a good chance that characters will become
>corrupted. Anglophones often do not realise this as they do not have
>diacritics and high-code points. (I applaud the removal of the
>separate escaped diacritic that CIF originally had).
>
>P.
>
>
>--
>Peter Murray-Rust
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>+44-1223-763069
>

>_______________________________________________
>comcifs mailing list
>[email protected]
>http://scripts.iucr.org/mailman/listinfo/comcifs

--
=====================================================
�Herbert J. Bernstein, Professor of Computer Science
� �Dowling College, Kramer Science Center, KSC 121
� � � � Idle Hour Blvd, Oakdale, NY, 11769

� � � � � � � � �+1-631-244-3035
� � � � � � � � �[email protected]
=====================================================
_______________________________________________

comcifs mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/comcifs

--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply to: [list | sender only]

References:

Advice on COMCIFS policy regarding compatibility of CIF syntax withother domains (James Hester)

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (James Hester)

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (Peter Murray-Rust)

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (James Hester)

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (Peter Murray-Rust)

Prev by Date: Re: Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains

Next by Date: RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. .

Prev by thread: Re: Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains

Next by thread: Re: Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains

Index(es):

Date

Thread

Discussion List Archives

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains