Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains



On Fri, Mar 4, 2011 at 11:47 AM, James Hester <[email protected]> wrote:
Thanks Peter for your comments. �While you may not be a voting member
of COMCIFS, you and other COMCIFS members fulfill an important
advisory role and I would encourage everybody to take the opportunity
to provide their perspectives.

I assume you have no particular disagreement with the principles that
you haven't commented on explicitly?

None at all - it's just that I haven't been as heavily engaged in CIF recently and so wouldn't have meaningful comments.

I've added some comments in response to your comments, inserted below:
>
> I found the original ASCII escapes difficult/tedious for some code points
> and woudl urge full unicode support (with numeric values).

I perhaps wasn't clear that we have already taken this step. �The
current CIF2 draft envisions full Unicode support using UTF-8
encoding. �Some provision has been made for allowing other encodings
in the future. �The point of the example was to show how this decision
to adopt Unicode was justifiable in terms of these principles.


It's really important to� manage encoding. I am completely supportive of UTF-8 but we don't mandate it in CML as XML can manage different encodings. The problem comes when non-conformant tools are used and this is particularly common with Microsoft tools which use CP-1252. This means that for any code points above 127 a cut-and-patse is likely to corrupt characters.

So if I have understood correctly all CIF documents MUST use UTF-8 and I'd strongly support this. It might be useful to announce this in the document (similarly to XML's <? encoding="UTF-8"?>). This is so that non-CIF tools can recognise the encoding.

It does put requirements on the toolchain. If an author receives a CIF with high codepoints, pastes bits of it into (say) Windows and re-saves there is a good chance that characters will become corrupted. Anglophones often do not realise this as they do not have diacritics and high-code points. (I applaud the removal of the separate escaped diacritic that CIF originally had).

P.


--
Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge
CB2 1EW, UK
+44-1223-763069

Reply to: [list | sender only]