Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains

I add some comments arising out of my own experience with XML/CML which may be useful. I don't think I am a full member of COMCIFs so feel free to ignore all or any. I comment after significant paragraphs.

On Fri, Mar 4, 2011 at 6:03 AM, James Hester <jamesrhester@gmail.com> wrote:

1. A feature should only be added to CIF syntax if all of the
following are satisfied:

(i) implementation or use of equivalent behaviour at dictionary level
is either significantly more cumbersome or not possible;
(ii) the feature provides significant new functionality that is widely
applicable to most scientific domains
(iii) reliable transfer and archiving of data is not compromised
(iv) there is no simpler way of achieving the desired behaviour

I would add:
* a feature should only be added if it has been shown possible to implement it with "reasonable ease". "Rough consensus and running code"
Example 2: Unicode support in CIF2.  This is broadly useful, given the
international nature of science and range of symbols used in
scientific papers.  It could have been implemented in dictionaries
using ASCII escapes, but this would have been cumbersome to use, so it
satisfies Principle 1.  We have adopted Unicode (rather than created
our own international character set) and copied the XML character
ranges (Principle 2)

I found the original ASCII escapes difficult/tedious for some code points  and woudl urge full unicode support (with numeric values).

Example 3: Space-separated lists in CIF2.  Lists, especially matrices,
are important in science and cumbersome to implement in dictionaries
(but possible) so lists satisfy principle 1.  Using space separators
is probably less mainstream than using commas - if we had chosen to
use both we would have definitely satisfied rule 2.  I think rule 2
would argue that we should allow both space and comma, but principle
1(iv) would argue choosing one or the other.

We use whitespace separated strings (i.e. including newline, tab, etc.) by default in CML for numeric arrays and matrices. It works well. However for lists of general strings, dates, etc. we allow the author to choose a delimiter which they know is not present in the strings.

Some locales (e.g. DE) use commas for decimal points and this is often added by the operating system. Thus 1.23,3.45 could be emitted as 1,23,3,45. It's possible but tedious to refactor code always to use period as the point.

I would also support the use of dictionaries for extending human and machine semantics.


Peter Murray-Rust
Reader in Molecular Informatics
Unilever Centre, Dep. Of Chemistry
University of Cambridge

Reply to: [list | sender only]