Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. .. .

On Wednesday, March 16, 2011 7:35 AM, Herbert J. Bernstein wrote:

>   I am glad we are getting closer, but now please consider what you have written, and what it really means in practical terms:
>> Preamble: The CIF syntax describes a human-readable, syntactic
>> container for scientific data.
>The word syntactic is misplaced here, and the "human-readable"
>constratint was lost years ago with the creation of mmCIF.  As we have just agreed, the semantics is an important part of the language, Also, in >practice, one of the most important contributions of CIF to our science has been the controlled vocaubulary it has provided, independent of the >form of expression:  tag-value, XML, HDF5, etc.  In addition, for the PDB, the important issue is _not_ the human readability, but the >preservation of all the essential information of a scientific experiment, and, if you glance throught some Acta C entries, you will see that even >for small molecules,
>the days of human readable CIFs are far behind us.   When we
>make a change, we need to bear all of that in mind.

Inasmuch as human readability has proven too subjective a criterion to guide the DDLm group on such technical details as it has lately considered, I am happy for that criterion to be rejected.  At the same time, I disagree that human readability as a desired and often achieved result is a lost cause for CIF.  Some people can and do successfully edit CIF by hand, which is possible only because of CIF's human readability.  Maintaining that capability was held by Herbert, me, and others as an important general principle directing some of the CIF 2 work performed by the DDLm group.  If that principle is rejected outright then much of the work to date on CIF 2.0 details will have been influenced by a false premise.  In particular, the DDLm group's compromise recommendation about character encoding makes less sense if human readability is not an important factor.

Herbert writes:

>I would recommend starting with a clearer expression of what CIF is:
>CIF is a language for the management of scientific data.  If combines a controlled vocuabulary with a simple, human-readable form of expression >(the CIF syntax) backed by rules clarifying the meaning of the language (the CIF semantics).  The overarching goal of CIF is to ensure that the >data of the relevant domains can be generated, transformed, transmitted and archived in ways that facilitate doing the science involved in ways >that both serve the individual scientific domains and ensure that different domains can share information reliably.

I take that as a proposed insertion at the beginning of the preamble.  It describes the overall CIF system, from data model through syntax and common semantics, up to and including dictionaries.  Although the document's intended audience (COMCIFS and the DDLm group) already has a firm grasp of that information, it would not be harmful to include it.

Herbert quotes and comments on James's latest text:

>> CIF syntax aims to be as simple as
>> possible.  The domain dictionaries are the primary location of
>> semantic information in the Crystallographic Information Framework.
>> In the following, the phrase 'dictionary level' refers either to the
>> domain dictionaries, the DDL language in which the domain dictionaries
>> are written, or the CIF2 common semantic features specification which
>> imposes minimum requirements on the semantics specified by
>> dictionaries and DDLs.
>Given that much modified goals, this next paragraph becomes an inappropriate strait jacket, misallocating responsibilies.

James's paragraph is not at odds with Herbert's description of the broader CIF picture.  It is simply a policy statement about the part of that picture on which our present work is focused (i.e. CIF syntax), followed by some definitions of terms.  As part of the definitions, it acknowledges the presence and global scope of a separate common semantic features specification.  Whether the policy statement ("CIF *syntax* aims to be as simple as possible" (emphasis added)) appropriately allocates responsibilities is the crux of the current debate.  A COMCIFS decision will be required here to settle the question at this level, and there will perhaps be a similar policy decision to be made about the common semantic features.

Herbert continues,

>  I would suggest we return to what the real practice has been:
>The CIF language tries for an appropriate balance between simplicity and sufficient expressive ability to meet the needs of the scientific domains involved, and changes to the existing syntax and common semantics should only be made for good reason.  If it is possible to make a needed change by simply defining a new term in the controlled vocabulary, in one of the domain dictionaries, then that option whould be considered first, especially because the controlled vocabulary is used in other forms of expression, such as XML and HDF5.  This is what we will call a change "at the dictionary level".  However, there are times, e.g. with the introduction of a new dictionary definition langauge, when changes are needed in the common syntax and semantics that apply to all domains.

Even if Herbert's characterization of the historic scope of dictionary-level changes were accepted, that would not imply that continuing such a policy must be the best choice.  However, such a limited characterization ("defining a new term in the controlled vocabulary") does not seem to capture historic practice and intent.  For example, DDL2 provides support for much finer-grained data types than CIF 1.1 natively provides, and mmCIF indeed defines such data types, independent of any particular data name.  Consider also the CIF 1.1 "Common Semantic Features" document:

Paragraph 25, speaking about the character markup conventions used in CIF 1, says "The specification is silent on which fields should be interpreted according to these markup conventions, but the published examples suggest that they may be used in any character field in a CIF data file except as prohibited by a dictionary directive. It is intended that the next CIF version specification shall formally declare where such markup may be used."  Thus, documented CIF 1.1 principles allow dictionaries to control which markup conventions apply to the values of defined items.

Paragraph 37 says "If it is necessary to convey more complex typographic information than is permitted by these special character codes and conventions, the entire text field should be of a richer content type allowing detailed typographic markup."  Thus CIF 1.1 supposes that special semantic rules may be defined -- presumably in a dictionary -- for the values of certain items.

Overall, it looks like the real practice with CIF 1.1 has indeed been to favor simplicity and stability of the syntax and, to a lesser degree, of the common semantics, delegating considerable control to the dictionary system.  COMCIFS has no obligation to continue that policy, and I urge you to decide based on a consideration of policy goals.  Maintaining consistent policy is only one possible goal among several non-exclusive ones.  Among other possible goals are easing the transition to CIF 2.0, some particular desired degree of backwards compatibility, encouraging development of CIF 2.0 software, and maintaining CIF 2.0's generality and domain-independence.



John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer

Reply to: [list | sender only]