Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains

On Fri, Mar 4, 2011 at 3:30 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
Dear Peter,

   There is a misunderstanding here.  All CIF2 documents are _not_
required to use UTF-8.  The current draft proposal is written
in terms of Unicode, but the proposal explicitly says:

My apologies again.

"For compatibility with CIF1 behaviour, there is no formal
restriction on the encoding of CIF2 files, providing they contain
only code points from the ASCII range. If a CIF2 file contains
characters equivalent  to Unicode code points greater than U+007F
(127 decimal), then the particular encoding used
must either be UTF8 or algorithmically identifiable from the CIF2 file itself.
Acceptable identification algorithms will be published as necessary
as annexes to this standard (see description of magic code and
encoding disambiguation in Change 1). Annexes notwithstanding, (i) a
CIF2 file containing characters outside the ASCII range with no BOM
and no disambiguation signature will be a UTF8 file, and (ii) a CIF2
file containing characters outside the ASCII range with a  valid UTF8
or UTF16 BOM and no disambiguation signature, will be a Unicode file
written in the indicated encoding."

This seems reasonable. I interpret it as meaning that a CIF1 document only uses characters from U+0020 to U+007F so that is compatible with any encoding. Presumably processing software may then create higher Unicode points from appropriate escape sequences? In which case it should label the output document with the given  encoding.

We have not yet been able to come to agreement on the "disambiguation
signatures to be used".  We have space reserved on the first line.
Any suggestions?

I would suggest that encodings should only be taken from  http://www.iana.org/assignments/character-sets. That the encoding (including UTF-8) should be recorded in the first line of the file using only ASCII characters so that other software can recognise the encoding. I haven't followed the discussions on syntax but would suggest


as being compatible with XML and therefore most easily human-interpretable. I would not rely on the BOM as I expect that cut-and-paste will often destroy it.

I found http://www.opentag.com/xfaq_enc.htm quite a useful resource...



At 2:07 PM +0000 3/4/11, Peter Murray-Rust wrote:
>On Fri, Mar 4, 2011 at 11:47 AM, James Hester
><<mailto:jamesrhester@gmail.com>jamesrhester@gmail.com> wrote:
>Thanks Peter for your comments.  While you may not be a voting member
>of COMCIFS, you and other COMCIFS members fulfill an important
>advisory role and I would encourage everybody to take the opportunity
>to provide their perspectives.
>I assume you have no particular disagreement with the principles that
>you haven't commented on explicitly?
>None at all - it's just that I haven't been as heavily engaged in
>CIF recently and so wouldn't have meaningful comments.
>I've added some comments in response to your comments, inserted below:
>  >
>  > I found the original ASCII escapes difficult/tedious for some code points
>  > and woudl urge full unicode support (with numeric values).
>I perhaps wasn't clear that we have already taken this step.  The
>current CIF2 draft envisions full Unicode support using UTF-8
>encoding.  Some provision has been made for allowing other encodings
>in the future.  The point of the example was to show how this decision
>to adopt Unicode was justifiable in terms of these principles.
>It's really important to  manage encoding. I am completely
>supportive of UTF-8 but we don't mandate it in CML as XML can manage
>different encodings. The problem comes when non-conformant tools are
>used and this is particularly common with Microsoft tools which use
>CP-1252. This means that for any code points above 127 a
>cut-and-patse is likely to corrupt characters.
>So if I have understood correctly all CIF documents MUST use UTF-8
>and I'd strongly support this. It might be useful to announce this
>in the document (similarly to XML's <? encoding="UTF-8"?>). This is
>so that non-CIF tools can recognise the encoding.
>It does put requirements on the toolchain. If an author receives a
>CIF with high codepoints, pastes bits of it into (say) Windows and
>re-saves there is a good chance that characters will become
>corrupted. Anglophones often do not realise this as they do not have
>diacritics and high-code points. (I applaud the removal of the
>separate escaped diacritic that CIF originally had).
>Peter Murray-Rust
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>comcifs mailing list

 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

Reply to: [list | sender only]