Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Moving on to DDLm

On Monday, March 21, 2011 12:57 AM, James Hester wrote:
JH>It is apparent to me that we are not as close as I had hoped to
JH>finalising CIF2 syntax.  I believe that the remaining issues revolve
JH>largely around basic CIF2 semantics, and are limited to:
JH>(1) Choice of elide mechanism for triple-quoted strings
JH>(2) Inclusion of python-style backslash sequences in triple-quoted strings
JH>(3) Meaning of "1.23" vs 1.23
JH>(4) Meaning of <period>, <question mark> and quoted versions thereof
JH>While all of these issues need to be resolved, they are not critical
JH>to the operation of DDLm or CIF2 in the sense that failsafe strategies
JH>exist to avoid the issues.  I would therefore like to propose that we
JH>adopt the following strategy:
JH>(1) Resolve that *only* the above-listed items remain under discussion
JH>as far as CIF2 syntax and basic semantics are concerned;
JH>(2) Vote to adopt DDLm
JH>(3) Vote to adopt dREL
JH>(4) Finalise the remaining syntax/semantic issues

To the extent that resolution of James's list of remaining CIF issues does not require opening any broader questions -- and I don't foresee that it would -- I think leaving only those aspects of universal CIF syntax and semantics on the table is reasonable.

I have devoted some thought over the weekend to the design principle discussion, and I should like to offer it to you now, while it is still fresh.  I shall afterward remain silent on the topic at least until COMCIFS is prepared to take it back up.  My apologies for the length:

On Friday, March 18, 2011 5:20 AM, Herbert J. Bernstein wrote:
HJB>No, I do not see a problem with separate syntax and semantics documents any more than I see a problem with separate productions in a grammar.
HJB>I do see a problem with _considering_ the design or impact of either syntax or semantics in isolation from each other.  I firmly believe that the result of a purely "bottom-up" syntax-first design in isolation from a "top-down" sematics design or of a top-down semantics-first design in isolation from a syntax design is inefficient and likely to walk us into dead-ends.  I derive this view from decades of literature on software engineering on failed approaches to software designs, and the continuing success of "the scandanavian method" or "particpatory design" in which work on internal design is intertwined with design of externals.
HJB>As for the requested example -- I already gave one -- the design of the numeric types in CIF 1.0 and CIF 1.1, in which the equivalence classes of numbers (i.e. that 13.45 and 1.345E1 are the "same"
number) is simply an assumed semantic feature intimately coupled to the syntax.  To give another, much more subtle equivalence class issue, the equivalance of "abc" and abc and 'abc' but the inequivalences of "123" and 123 and of "." and "?" from . and ?
HJB>are semantic issues intimately coupled to the syntax.  The original design document for CIF was a semantics document with a bit of intertwined syntax intertwined.  DDL came later and the pure syntax and semantics documents came long after the intertwined approach _after_ everybody had a clear view of the interaction of CIF1 syntax and semantics.  I, for one, do not yet have a clear understanding of those interactions for CIF2 and DDLm.

Herbert's points are well made.  I agree that the design of numeric types is a soft spot in the CIF 1.x specifications, and that it relies on a close relationship between syntax and semantics.  The special data values . and ? depend even more on such a relationship.  Herbert is right that CIF 2.0 syntax cannot be designed in isolation from CIF 2.0 semantics, and that these issues in particular should be addressed.

The discussion has drifted rather far afield from the original and pressing question, however, which was "[With respect to triple quote syntax,] should we seek maximum consistency with other usage of identical syntactical constructs, despite the imposition of unnecessary technical baggage? Or should we produce a standard as simple and streamlined as possible, despite the potential for confusion and unorthodox behaviour?"  I would be happy for COMCIFS to issue broader guidance, as has been suggested, but I hope that decision will not be unduly delayed by a detour into minutiae such as the division and interplay between CIF 2.0 syntax and semantics.

In pursuit of broad rather than narrow guidance, therefore, I suggest a change in the terms of discussion.  Rather than syntax vs. semantics, it may be more useful to partition CIF into 'base' CIF 2.0, which all CIF 2.0 processors are expected to accept and interpret equivalently, and 'domain-level' CIF encompassing those aspects of CIF semantics and convention that are defined via the dictionary system.  The base contains CIF syntax and the common semantics, whereas domain-level CIF adds ontology, constraints, controlled vocabulary, etc..  The key distinction between these layers is, of course, which features "all CIF 2.0 processors are expected to accept and handle equivalently."  It is fitting that that dovetails with some of the technical arguments about the triple quote syntax.  Base CIF is I think equivalent to "the common syntax and semantics of the CIF language" in Hebert's latest proposed principles.

On that basis, I offer this re-couching of the proposed design principles:

Principles guiding development of Base CIF 2.0


CIF is a framework for exchanging and archiving scientific
data, featuring a human-readable, machine-parseable, electronic format
designed to serve as an exchange and archive medium.  "Base" CIF
comprises the definitions and constraints that underlie CIF and apply
to all CIF files; those aspects defining the CIF file format are
documented in the CIF Syntax specification and the CIF Common Semantic
Features specification.

Base CIF aims to remain as simple as possible by delegating
considerations such as ontology, vocabulary, data relationships,
and complex and rich data types to domain dictionaries and the DDL
formalisms by which those dictionaries are defined.  In the following,
the phrase 'domain level' refers to such documents (though it is not
anticipated that DDLs will be domain-specific).  Definitions
and constraints at domain level apply to a particular CIF files only
as declared by that file or as required by a particular CIF processor
in a particular context.


The design of base CIF 2.0 is guided by these principles:

1. A feature should be added to or changed in base CIF only if all of
 the following are satisfied:

 (i) Implementation of the desired behavior by changes at the
 domain level is not feasible, or else such changes, while feasible,
 would significantly reduce human readability;
 (ii) the change provides significant new functionality that is widely
 applicable to most scientific domains
 (iii) reliable transfer and archiving of data is not compromised
 (iv) there is no simpler way of achieving the desired behaviour
 (v) it has been shown possible to implement the change it at a cost
 commensurate with its benefits, as demonstrated in part by a rough
 consensus and running code.

2. As long as the requirements in (1) are satisfied, base CIF should:
 (i) behave in a way that is consistent with common usage
 (ii) align with pre-existing standards where those standards provide
 the required behaviour. CIF 1.1 can be considered a pre-existing
 standard for CIF 2.0 in this context.

3. Non-technical issues should be dealt with in non-technical arenas.

4. Draft changes to base CIF will be made available on the IUCr website
 for public comment for a period of at least 6 weeks, following which
 COMCIFS voting members, after consideration of any objections raised,
 can vote to accept the change. A change will be accepted if 3/4 of
 COMCIFS voting members approve it.

Best Regards,


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.