Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains

Dear Peter,

    There is a misunderstanding here.  All CIF2 documents are _not_
required to use UTF-8.  The current draft proposal is written
in terms of Unicode, but the proposal explicitly says:

"For compatibility with CIF1 behaviour, there is no formal 
restriction on the encoding of CIF2 files, providing they contain 
only code points from the ASCII range. If a CIF2 file contains 
characters equivalent  to Unicode code points greater than U+007F 
(127 decimal), then the particular encoding used
must either be UTF8 or algorithmically identifiable from the CIF2 file itself.
Acceptable identification algorithms will be published as necessary 
as annexes to this standard (see description of magic code and 
encoding disambiguation in Change 1). Annexes notwithstanding, (i) a 
CIF2 file containing characters outside the ASCII range with no BOM 
and no disambiguation signature will be a UTF8 file, and (ii) a CIF2 
file containing characters outside the ASCII range with a  valid UTF8 
or UTF16 BOM and no disambiguation signature, will be a Unicode file 
written in the indicated encoding."

We have not yet been able to come to agreement on the "disambiguation
signatures to be used".  We have space reserved on the first line. 
Any suggestions?


At 2:07 PM +0000 3/4/11, Peter Murray-Rust wrote:
>On Fri, Mar 4, 2011 at 11:47 AM, James Hester
><<mailto:jamesrhester@gmail.com>jamesrhester@gmail.com> wrote:
>Thanks Peter for your comments.  While you may not be a voting member
>of COMCIFS, you and other COMCIFS members fulfill an important
>advisory role and I would encourage everybody to take the opportunity
>to provide their perspectives.
>I assume you have no particular disagreement with the principles that
>you haven't commented on explicitly?
>None at all - it's just that I haven't been as heavily engaged in
>CIF recently and so wouldn't have meaningful comments.
>I've added some comments in response to your comments, inserted below:
>  >
>  > I found the original ASCII escapes difficult/tedious for some code points
>  > and woudl urge full unicode support (with numeric values).
>I perhaps wasn't clear that we have already taken this step.  The
>current CIF2 draft envisions full Unicode support using UTF-8
>encoding.  Some provision has been made for allowing other encodings
>in the future.  The point of the example was to show how this decision
>to adopt Unicode was justifiable in terms of these principles.
>It's really important to  manage encoding. I am completely
>supportive of UTF-8 but we don't mandate it in CML as XML can manage
>different encodings. The problem comes when non-conformant tools are
>used and this is particularly common with Microsoft tools which use
>CP-1252. This means that for any code points above 127 a
>cut-and-patse is likely to corrupt characters.
>So if I have understood correctly all CIF documents MUST use UTF-8
>and I'd strongly support this. It might be useful to announce this
>in the document (similarly to XML's <? encoding="UTF-8"?>). This is
>so that non-CIF tools can recognise the encoding.
>It does put requirements on the toolchain. If an author receives a
>CIF with high codepoints, pastes bits of it into (say) Windows and
>re-saves there is a good chance that characters will become
>corrupted. Anglophones often do not realise this as they do not have
>diacritics and high-code points. (I applaud the removal of the
>separate escaped diacritic that CIF originally had).
>Peter Murray-Rust
>Reader in Molecular Informatics
>Unilever Centre, Dep. Of Chemistry
>University of Cambridge
>CB2 1EW, UK
>comcifs mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


Reply to: [list | sender only]