Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains

Here are a set of 3 principles that I think are worth discussing here
on COMCIFS, and probably should have been discussed before we embarked
on CIF2.


Principles guiding development of CIF syntax

Preamble: The CIF syntax describes a human-readable, syntactic
container for scientific data.  CIF syntax aims to be as simple as
possible.  The domain dictionaries are the primary location of
semantic information in the Crystallographic Information Framework.

1. A feature should only be added to CIF syntax if all of the
following are satisfied:

(i) implementation or use of equivalent behaviour at dictionary level
is either significantly more cumbersome or not possible;
(ii) the feature provides significant new functionality that is widely
applicable to most scientific domains
(iii) reliable transfer and archiving of data is not compromised
(iv) there is no simpler way of achieving the desired behaviour

2. As long as the requirements in (1) are satisfied, the CIF framework should:
 (i) behave in a way that is consistent with common usage
 (ii) align with pre-existing standards where those standards provide
the required behaviour. CIF1 can be considered a pre-existing standard
for CIF2 in this context.

3. Non-technical issues should be dealt with in non-technical arenas.


Justifications for these principles:


"CIF aims to have as simple a syntax as possible": this is desirable
for two reasons: human readability, and maximising the flexibility of
the data model with which the dictionary definition languages will
work.  The syntax makes relatively few assumptions about the most
appropriate way to describe scientific data, meaning that the DDL
language has a broad scope for creating data structures.

Principle 1

If we wish to have a simple syntax, we need to avoid complicating it
if at all possible, without excluding new features which are generally
useful and significantly more efficiently implemented in syntax.

Principle 2

We should not make CIF less accessible than necessary, and should not
make more work for ourselves and others if other standards already
meet our needs

Principle 3

We have a syntax standard (the technical arena).  We also have mailing
lists, committees, documentation, Wikipedia, journal policy and
various other avenues for disseminating information and countering
misconceptions.  Where principles (1) and (2) conflict, long-term
maintenance of a standard that meets the goals in the preamble
requires that principle (1) should be the priority and therefore that
other avenues should be used to address the non-technical issues.  For
example, concerns about use of delimiters being inconsistent with the
use in some other domain could be addressed by explicit notes in
documentation or comments in CIF template files, depending on the
persons likely to be confused and the expected magnitude of the


Example 1:The idiosyncratic characteristic of CIF1 files that quote
delimiters could appear within strings delimited by the same quotes,
provided they were not followed by whitespace.  This "feature"
provided marginal extra functionality compared to the simpler rule of
no delimiters in a string, so fails principles 1(ii) and 1(iv); and is
inconsistent with mainstream usage, rule 2(i).  It has been removed
from CIF2.

Example 2: Unicode support in CIF2.  This is broadly useful, given the
international nature of science and range of symbols used in
scientific papers.  It could have been implemented in dictionaries
using ASCII escapes, but this would have been cumbersome to use, so it
satisfies Principle 1.  We have adopted Unicode (rather than created
our own international character set) and copied the XML character
ranges (Principle 2)

Example 3: Space-separated lists in CIF2.  Lists, especially matrices,
are important in science and cumbersome to implement in dictionaries
(but possible) so lists satisfy principle 1.  Using space separators
is probably less mainstream than using commas - if we had chosen to
use both we would have definitely satisfied rule 2.  I think rule 2
would argue that we should allow both space and comma, but principle
1(iv) would argue choosing one or the other.

Example 4: Triple-quoted strings in CIF2. In the current draft these
provide no new functionality beyond the ability to quote
semicolon-delimited strings, so should probably be rejected unless new
functionality can be added.  Such new functionality would be the
ability to quote arbitrary strings (this may be exaggerating the
"significant" in principle 1(ii)).  In keeping with principle 2(i),
the eliding mechanism should be <backslash><delimiter> as this is the
most widespread approach and not markedly more complex than the
current proposal of using <backslash><eol>.  In keeping with principle
1(i) and 1(ii), no other escape sequences should be defined as they
are easily definable at dictionary level (if needed) and do not
provide behaviour that is generally needed.  In keeping with principle
(3), if there are concerns relating to user acceptance or user
confusion, they should be addressed in documentation and by providing
reference software (for example).


On Tue, Mar 1, 2011 at 3:12 PM, James Hester <jamesrhester@gmail.com> wrote:
> Dear COMCIFS members:
> The DDLm group is currently engaging in developing an elide mechanism
> for the CIF2 standard.  Our deliberations have reached something of an
> impasse due to disagreement around the use of triple quotes as a
> string delimiter.  Python is a popular programming language that also
> uses triple quotes to delimit strings. One side of the discussion
> considers that use of triple quotes as a string delimiter means that
> all escape sequences recognised by Python should also be recognised by
> CIF, in order to avoid confusion and improve consistency with
> mainstream (ie Python) practice.  The other side of the discussion
> sees little to benefit to CIF from including the additional ten or so
> escape sequences and advocates leaving them out of the CIF2 standard,
> instead adopting the minimal number of escape sequences to allow
> eliding.
> We would like COMCIFS participants to provide some input as to the
> appropriate policy to be followed in this situation: should we seek
> maximum consistency with other usage of identical syntactical
> constructs, despite the imposition of unnecessary technical baggage?
> Or should we produce a standard as simple and streamlined as possible,
> despite the potential for confusion and unorthodox behaviour?
> Details of discussions so far can be found at
> http://www.iucr.org/__data/iucr/lists/ddlm-group/
> James.
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

Reply to: [list | sender only]