Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains

On Fri, Mar 4, 2011 at 11:47 AM, James Hester <jamesrhester@gmail.com> wrote:
Thanks Peter for your comments.  While you may not be a voting member
of COMCIFS, you and other COMCIFS members fulfill an important
advisory role and I would encourage everybody to take the opportunity
to provide their perspectives.

I assume you have no particular disagreement with the principles that
you haven't commented on explicitly?

None at all - it's just that I haven't been as heavily engaged in CIF recently and so wouldn't have meaningful comments.

I've added some comments in response to your comments, inserted below:
> I found the original ASCII escapes difficult/tedious for some code points
> and woudl urge full unicode support (with numeric values).

I perhaps wasn't clear that we have already taken this step.  The
current CIF2 draft envisions full Unicode support using UTF-8
encoding.  Some provision has been made for allowing other encodings
in the future.  The point of the example was to show how this decision
to adopt Unicode was justifiable in terms of these principles.

It's really important to  manage encoding. I am completely supportive of UTF-8 but we don't mandate it in CML as XML can manage different encodings. The problem comes when non-conformant tools are used and this is particularly common with Microsoft tools which use CP-1252. This means that for any code points above 127 a cut-and-patse is likely to corrupt characters.

So if I have understood correctly all CIF documents MUST use UTF-8 and I'd strongly support this. It might be useful to announce this in the document (similarly to XML's <? encoding="UTF-8"?>). This is so that non-CIF tools can recognise the encoding.

It does put requirements on the toolchain. If an author receives a CIF with high codepoints, pastes bits of it into (say) Windows and re-saves there is a good chance that characters will become corrupted. Anglophones often do not realise this as they do not have diacritics and high-code points. (I applaud the removal of the separate escaped diacritic that CIF originally had).


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

Reply to: [list | sender only]