(83) Discussion mode; BNF; mmCIF extensions from EBI

To: COMCIFS@iucr
Subject: (83) Discussion mode; BNF; mmCIF extensions from EBI
Dear Colleagues

I welcome John Faber of the International Center for Diffraction Data (ICDD)
to our distribution list. John will be helping ICDD to develop techniques
for managing and archiving data in pdCIF format in due course.

The current mailing contains a comment on the current mode of discussion
(D83.1) which I've put first because of its brevity, a summary and some
comment on the BNF discussion (D82.1) and some comments on the proposed
mmCIF extensions (properly numbered D82.3, and not D89.3 as it appeared in
the last mailing!) Contributors are: David Brown (D>), Peter Murray-Rust
(P>), Herbert Bernstein (H>) and Nick Spadaccini (N>).

D83.1 Discussion group versus moderated digest
----------------------------------------------
D> 	I notice that the Comcifs deliberations are turning into a
D> discussion group which is contrary to our traditional mode of operation. 
D> However, it may be that the committee would prefer to operate in this mode
D> and the time may have come to transfer our business to a formal discussion
D> group of the kind you have set up for the coreDMG.  Otherwise we need to
D> remind members that postings to the whole group should normally be
D> arranged through the secretary so we can keep some semblance of order in
D> the discussions. 

Comments are welcomed. I think the coreDMG group is working pretty well,
though it does require discipline from the members in posting on specific
threads. The structured archive is generally felt to be useful, and when
time permits I'll try to build in some full-text indexing and searching
facilities on the archive. I certainly appreciate not having to spend time
assembling and editing mailings. 

I also think the discussions on BNF which I reprint below for archival
completeness, but which were carried out in discussion-group mode, advanced
the discussion much more rapidly than through the traditional fortnightly or
monthly mailings. 

It might be useful, though, if it is generally accepted that the Chairman or
list owner may intervene to direct the course of a discussion that is
wandering off-topic, or to terminate a particular line of discourse. In
theory I could always say "This correspondence is now closed" in the
traditional COMCIFS mailings; though in practice most threads of
correspondence expired after a finite lifetime.

==============================================================
Existing discussion threads
---------------------------

D82.1 BNF description of STAR/CIF/dictionary CIF format
-------------------------------------------------------
With the exception of a few comments of mine at the end about SGML (that may
safely be skipped!) the following discussion is a conflation of the various
mailings that have used the COMCIFS email address as a convenient
mail-expander.

If I read the tone of the discussion correctly, the bottom line is that
although a formal specification encompassing STAR and its CIF and DDL
sub-variants would be very nice, we don't yet feel up to that; so Nick will
write four BNF's (for STAR, CIF data files, DDL1 and DDL2 dictionaries)
that we can post on the web and include in IT Vol. G.  Right :-) ?

> The "Backus-Naur Form" is a formal description of the syntax of a computer
> language. ... I invite a proposal for a CIF form, and also for the data
> dictionary format (question from a computer science viewpoint - are one or
> two different BNF specifications required for DDL1 and DDL2 dictionaries?
> The DDL1 dictionaries conform fully to the CIF syntax; DDL2 dictionaries
> also permit save frames).

N> To be complete the BNFs should be written for CIF, DDL1 and DDL2. In the
N> case of the latter two, the BNF can be quite specific since the syntax
N> of the allowed datanames is known.

P> I have been working with CIFs recently and found that this problem - of not
P> knowing the precise nature of various 'CIFs' - creates enough uncertainty
P> to make writing code a non-relaxing exercise.  It is critical in my
P> opinion that the actual syntax(es) are precisely identified, as
P> otherwise implementations will vary according to people's perceptions
P> of what is current (this is a common problem with evolving standards).
P> 
P> My understanding is that there are 3 syntaxes:
P> 	Full STAR (not used in crystallographic applications??)
P> 	CIF (which is used for DDL1.0, dictionaries that conform to DDL1.0 and
P> 		all data files)
P> 	save-enhanced-CIF (not full STAR, but a superset of CIF. Used in mmCIF
P> dictionaries)
P> 
P> If I'm wrong on this, then it shows how it is possible to make mistakes
P> :-). Assuming there are 3 syntaxes, then there must be three formal
P> specifications. It may be possible to combine them into one, with
P> different 'top-level' points. Without this, it isn't
P> formally possible to create interchangeable documents.

N> I make it 4 syntaxes. STAR and CIF are distinct, though the latter is a
N> subset of the former. DDL1 and DDL2 are not identical, bar the introduction
N> of the save_frame.  DDL1s grammar is CIFish but is a subset yet again.
N> There you can restrict the grammar further because the data names can be
N> specified as terminal classes, ie.  rather than
N> 
N> <DATA_NAME>                 ::= _ <non_blank_char>+
N> 
N> followed by the other necessary definitions, the DDL1 can be
N> 
N> <DATA_NAME>                 ::=  _category |
N>                                  _name     |
N>                                  _type     |
N>                                  .....     |
N> 
N> Similarly for DDL2. They are 4 distinct formal specifications and should
N> remain so. The hope that we create one, and each of the subsets has
N> different entry points is not a good way to go. I know of no formal
N> mechanism to do this in a BNF but I would think it would tend to 
N> confuse things, rather than clarify - god knows we need more of the latter
N> and less of the former!

P> I have spent the last year helping with the development of XML, the 
P> emerging markup language of the WWW. Many of the problems that CIF faces are
P> generic to structured documents and are often more serious and more difficult
P> than are thought at first sight. I hope my comments do not sound didactic, 
P> but they are the distillation of many document experts during this process.
P> 
P>> 3). Comments. There are two possible ways of treating comments. The
P>> characters from the initial '#' to the last character on the line
P>> (inclusive) can simply be thrown away during lexical analysis.
P>> below. (This is the treatment implied by the 1994 JCICS paper.)
P>> However, [sometimes it is]
P>> useful to retain comments in their original places in the output file, as
P>> far as possible.
P> 
P> This is a critical point and subtler than appears. We spent a lot of time
P> in XML discussing whether information of this sort should be retained
P> during the passage of a document through a system. Ultimately you either
P> have to make some clear decisions about what information is thrown away
P> or you have to implement something similar to the W3C's
P> DocumentObjectModel (DOM), www.w3.org
P> 
P> For example, are the following CIFs equivalent?
P> 
P> # file 1
P> data_cell
P> # non-standard setting!
P> 	_cell_length_b 12.34
P> 	_cell_length_a 11.34
P> 	_cell_length_c 10.00
P> 
P> # file 2
P> data_cell
P> 	_cell_length_a 11.34
P> 	_cell_length_b 12.34
P> 	_cell_length_c 10.00
P> 
P> I may be rusty on the documentation but my understanding is that 
P> *the order of components in a CIF was meaningless* 
P> rather than the more constraining 
P> *CIF components can be in any order*

N> I for one am interested in you elaborating on this fine distinction, Peter.
N> You have posed an interesting view, but those of us who know you, know
N> your thinking and ours are often at a tangent.

P> I have assumed that comments are throwaways (on the assumption
P> that if something matters it should be an explicit item).  However I expect
P> that some people will get upset if their comments disappear.

N> I share your view point, that is why one shouldn't hide explicit data in
N> comments, nor should they hide explicit data in datanames, as was the
N> case several years ago with the embedding of units of a dataitem in
N> its dataname.

P> In general it is very time consuming and error-prone to preserve the
P> lexical state of an input (i.e. all the whitespace, what types of
P> quotes are used, etc.). This is, however, important for the authoring
P> (or more precisely the editing) process. For that reason writing a
P> CIF editor which takes in general CIF input is not trivial.

N> Peter K[eller]'s view was that comments should be part of the formal spec
N> in case other people had opinions different from the prevailing forces.
N> Retention of the lexical state is messy but if there are those who wish
N> to attempt it, they should have the specification at hand.

P>> 4). Whitespace. This is a potential minefield for precise definition,
P>> since which whitespace characters are allowed at particular places in a
P>> STAR file (or indeed any text file or stream) is system and protocol
P>> dependent.
P> 
P> On XML we have probably spent 1000 mail messages on the topic of
P> whitespace. It is one of the most difficult areas of all (acknowledged
P> by Knuth). If people are expecting whitespace to be preserved, have
P> a common semantic meaning, etc. they will be disappointed.
P> 
P> [...]
P>> Note, though, that the DDL version 2.x, and mmCIF, restrict the characters
P>> which may appear in data items (check the contents of the ITEM_TYPE_LIST
P>> category in the dictionary). The above example only became legal in a
P>> macromolecular CIF at version 0.7.28 of mmCIF (1995-10-6), and then only
P>> for certain types of data.
P> 
P> Before reading Brian's reply I found it difficult to find a precise
P> definition of the allowed syntax here...
P> 
P> [...]
P>> 
P>> 8). Loops. It is a requirement of STAR that a loop structure must have
P>> complete loop packets
P> 
P> This is essentially a validity constraint, rather than a syntactic one and
P> is quite difficult to express other than in prose. It will be important
P> to define what validity constraints there are in CIFs - and what action
P> should be taken on encountering a violation. Again, the XML group
P> spent > 1000 messages on how to handle errors - it is not trivial.


N> Issues concerning semantics and validity are not specified in a BNF, only
N> the syntax is. Peter makes the very valid point that such issues are
N> usually expressed in prose, because it is difficult to write or understand
N> anything more formal, like denotational semantics for instance.
N> 
N> Restrictions on structure, length of lexemes etc are very difficult to
N> formally specify, though I see Herb (coming up) suggests we do this in the
N> DDL and dictionaries.

H> At the risk of confusing a more than sufficiently complex discussion, I
H> would like to suggest a somewhat different view than the one put forward by
H> Peter Murray-Rust.
H> 
H> I believe we would all be better served by adhering to the view that the
H> formal syntax of all the files under discussion is defined by the STAR
H> syntax, and that each of the particular uses of STAR (CIF,  mmCIF, the NMR
H> format, etc.) is to be a further specification based upon a DDL and one or
H> more dictionaries.  This is parallel to the approach used with the various
H> SGML languages, such as HTML and XML.  It allows for rigor without the
H> confusion of multiple syntaxes.

N> The DDL and associated dictionaries were always for the purpose of providing
N> additional information about the DATA contained within a DATA file. Trying
N> to formally specify semantics and rules of validation in the 
N> DDL/dictionaries is going to be difficult and not what it was intended for.
N> I don't know how Herb intends to handle this formal specification, but I
N> suspect it will end up being (for all intents and purpose) prose by some
N> other name.
N> 
N> Herb's reference to it being done this way in the SGML extensions, is
N> interesting and maybe Peter can elaborate (since he is intimate with
N> things *ML). My understanding of the DTDs (SGML's equivalent dictionaries)
N> was that they specified the allowed syntax, how many arguments a tag
N> could have etc. I didn't realise you could specify structure.


P> I understand what Herb is proposing and in one sense it is the 'right'
P> thing to do in that it is the most formally correct. When I developed
P> tkCIF some years ago I took this 3-level approach:
P>     - first parse the DDL (there was only one then :-)
P>     - then parse the relevant dictionary *and* validate it against the DDL.
P>       There was only one dictionary format then (no save_). This was quite
P>       a lot of work because it threw up a number of very tricky semantic
P>       concerns which I don't believe were ever fully resolved. A major
P>       example is the use of defaults - should they be automatically
P>       expanded into the file.
P>     - then parse the data file and validate/expand against the dictionary.
P>       Again non-trivial.
P> What I want to emphasise is that this is a lot of work and throws up
P> semantic concerns that are problematic. It is extremely unstable against
P> a moving target which is why I dropped it. It also requires a serious
P> commitment to the project. This might be possible if there was a
P> community-wide agreement to do this and to create a single fully robust
P> set of communally available tools - in other words the tools define the
P> semantics where there are problems.
P>  
P> However it disenfranchises those people who want to want to write simple
P> parsers for particular purposes. One problem of complexity is that it
P> limits the number of people who can be involved. I value the global
P> communal approach to software development in the crystallographic world
P> and worry that increases in sophistication will limit implementors
P> to a very small group.
P> 
P> [This has been a real concern in the XML world where there is a natural
P> tension between those who want it to be able to do everything and those
P> who want the 'anyone can play' approach. I have championed the latter,
P> and do so here :-). XML has the image of the 'desperate perl hacker'
P> - i.e. someone who gets an XML file and wants to write an application
P> in an afternoon. I think we *have* to value the CIF-DPH as well - the
P> person who gets a CIF and wishes to extract the molecular information.
P> For example, we do not have a CIF interface to RasMOL - the most widely
P> used molecular viewer. I spent some time 2-3 years ago with Roger [Sayle,
P> the author of RasMOL] looking at how we could interface to mmCIF,
P> but the target moved too much at that stage :-)]

H> For this view to work, then we need to add the restrictions imposed by CIF
H> on STAR to the DDL's and the dictionaries, i.e. we need to add the 80
H> column restriction, the restriction to loop levels, and handling of
H> save_frames, as properly parsable features of the DDLs, by adding tags
H> which control these features, and then placing the appropriate values into
H> the dictionaries.

P> My personal feeling is that this places too much work on a developer. It
P> implies they have to write a full STAR parser/API. The parsing isn't
P> the problem - it's how to manage the results of the parsing [we've
P> been through this in XML and managed to create early versions of 
P> APIs for both event stream models and structured documents [DOM] - STAR 
P> will probably only need the latter, but it isn't trivial and probably needs
P> much of the W3C's DOM model .]

H> With this minor addition to the DDL's, a core CIF would be a file
H> conforming to STAR syntax, written using the core dictionary and
H> (optionally) additional dictionaries conforming to the DDL1 definition, and
H> an mmCIF would be a file conforming to the STAR syntax, written using the
H> mmCIF dictionary and (optionally) additional dictionaries conforming to the
H> DDL2 definition.  It would make relaxation of these limitations a "simple"
H> dictionary issue, rather than taking us into yet another syntax variation.

P> There is a difference between the simplicity in defining this in prose in
P> the dictionary and implementing it algorithmically. The attraction of CIF
P> (as I see it) is that so long as the BNF is clearly defined (and that is
P> what I was asking for) it is relatively easy to write a parser for it.
P> The validation (e.g. loop_ value counts) will have to be put in
P> explicitly, but it's not a big deal. My CIF parser is 2-3 pages
P> of Java. Extending it to STAR and requiring a DDL to control the parsing
P> would take it to 50+.

N>HB> For this view to work, we need to add the restrictions imposed by CIF
N>HB> on STAR to the DDL's and the dictionaries, ...
N> 
N> All this and then ....
N> 
N>HB> With this minor addition to the DDL's, a core CIF would be a file
N>               ^^^^^
N> I love the way you make such throw-away statements.
N> 
N>HB> conforming to STAR syntax, written using the core dictionary and
N>HB> (optionally) additional dictionaries conforming to the DDL1 definition,
N>HB> and an mmCIF would be a file conforming to the STAR syntax, written using
N>HB> the mmCIF dictionary and (optionally) additional dictionaries conforming
N>HB> to the DDL2 definition....
N> 
N> I suspect we will find this a much bigger can of worms than we will want to
N> deal with in the future. The syntax can be easily and separately specified.
N> As for semantics and pragmatics I take my lead from almost all programming
N> languages ... has anybody read the formal specification for the semantics
N> of C, Fortran ...? We all know what is supposed to happen because we have all
N> read the prose, but has anybody ever seen a formal specification of the
N> semantics (much less understood it!)?

H> There seems to be a need for clarification of the role of DTD's in SGML.
H> I think the creation of multiple, overlapping, "official" BNFs for STAR
H> and its children, rather than some system which preserves the
H> relationships among the members of the family would be an invitation to
H> unintented misinterpretations and divergence.  The uses of DTDs in SGML has
H> reduced such problems for that community.  I believe we have a similar
H> need.  Explanatory text is always useful, but as much as possible, the 
H> data for parsers should come from parsable data.  Here is an extract from
H> SGML Open's view of the role of DTD's:
H>  
H> The following information comes from "SGML Open - Information on SGML",
H> written by the staff and members of SGML Open, the SGML vendor consortium.
H>  
H> "Structure and presentation
H>  
H> "Documents comprise three types of information: data, structure, and 
H> format.
H>     "* The data in a document may include text, graphics, images and even
H>        multimedia objects such as video and sound. The data may also
H>        include information that does not itself appear on the printed
H>        page. For example, a particular graphic of a machine component may
H>        have hidden data about what class of machinery uses it, what the
H>        tolerances of the component are, and who manufactures it.
H>     "* The structure of a document refers to the relationship among the
H>        data elements. For example, in this document there are
H>        subheadings, paragraphs, and bullet lists. These are each elements
H>        of the structure of the document. In a parts catalog, there might
H>        also be part numbers, product illustrations, and inventory
H>        numbers, some of which might be hidden elements.
H>     "* The format of a document is its appearance. In this document,
H>        subheadings are printed in boldface text flush against the left
H>        margin. They might also be italicized and centered without
H>        affecting the structure of the document.
H>        
H>   "SGML recognizes that data, structure, and format are separable
H>    elements. It preserves the data and structure, but does not specify
H>    the format of the document -- recognizing that format should be
H>    optimized to user requirements at the time of delivery.
H> 
H> "...   
H> 
H> "Document Type Definition (DTD)
H> 
H>   "An SGML document has an associated document type definition (DTD) that
H>    specifies the rules for the structure of the document; for example, a
H>    DTD might specify that the document must have a chapter title and
H>    cannot have any part numbers that are not immediately followed by a
H>    paragraph describing the part. Several industries have standardized on
H>    various DTDs for the different types of documents that they share.
H>    ..."

I have a couple of comments about SGML, based on our experience of it within
an academic publishing house. First, SGML is a meta-language, permitting a
very rich description of document structure and content (and even, pace the
statement quoted above by Herbert, format, though this latter purpose is an
abuse of the underlying intent). The author of a particular SGML
application has great freedom - in principle - to change even such things as
the tag delimiting characters: the familiar <tag> notation with angle
brackets is, strictly, a mutable convention. However, angle brackets are
described in a reference concrete syntax, and this syntax is almost
universally applied. The reason is that it is very, very difficult to build
robust tools that can meet the demands of the full SGML specification, and
coping with different syntactic notations is an unwelcome overhead. Even for
implementations using the reference concrete syntax, it has taken nearly a
couple of decades to produce robust and affordable authoring and editing
tools. Even with the tools, it is taking a very long time for publishers to
work consistently and correctly to the specs. And even where this is
achieved, the data that are tagged form a minimal subset of the information
that might be tagged - there are rich descriptions of the names, addresses,
titles and affiliations of each author of a paper, but rarely any
description of the information in the body of the paper (other than "a
collection of paragraphs, perhaps with mathematics, within sections").
I hasten to add that there are noble exceptions, such as Peter Murray-Rust's
Chemical Markup Language (CML), but it's generally true to say that the take
up of SGML within the publishing industry has been slower than its more
optimistic proponents would have wished.

In brief, richness brings complexity and complexity means complications. CIF
has always walked a tightrope between the simple and the complex, and I
would second Peter's view that there is much to be gained by leaning towards
the level of complexity that allows access to the community as a whole.



D83.2 "Magic number" to identify CIF format internally
------------------------------------------------------
P> Having now dealt with a number of CIFs I re-iterate the importance of
P> identifying the file type, especially for dictionaries. 



D83.3 mmCIF dictionary extension
--------------------------------
D> First some general comments
D> ---------------------------
D> 1. Since we have adopted the term 'standard uncertainty' rather than
D> 'estimated standard deviation', shouldn't our data names now reflect this
D> new policy by replacing _esd by _su in newly defined names?
D> 
D> 2. In our previous dictionary definitions, we have carefully avoided
D> building in references to the requirements of specific programs, or
D> recognising private dictionaries.  This proposal contains a large number
D> of aliases to the EBI dictionary which is not a Comcifs approved
D> dictionary.  According to my understanding, these aliases should not be in
D> a Comcifs dictionary, rather there should be a way of pointing to the
D> mmCIF from the EBI dictionary.  My concern is that if we start including
D> such aliases other private dictionaries will want to be included and there
D> will be no end to the number of aliases that we must include.  The problem
D> with this route is that it will encourage the development of local
D> dialects of cif, something that we have been most anxious to avoid.  If an
D> item is defined in a cif, it should use a cif dataname.  Pseudocifs
D> written with a private dictionary should be read using the private
D> dictionary which could alias the cif datanames or point to included cif
D> dictionaries.  It should not be necessary for cif dictionaries to include
D> aliases to non-cif dictionary names. 

This reminds me that the proposals for maintaining dictionaries have not yet
been brought to COMCIFS for formal approval, and for that I apologise. I
shall try to assemble and present a final report in the near future. The
proposed extensions in this case are following the draft protocol that
tries to provide for a well-controlled transition from datanames developed
locally to the datanames in the public dictionaries. The idea is that
someone (in this case Kim Henrick for the European Bioinformatics Institute
(EBI)) has been working with CIFs that include data names defined for
local use and incorporating a registered prefix (look at
http://www.iucr.org/cgi-bin/reserve.pl to confirm that "ebi" has been duly
registered). Some of these local data names are felt to have more general
application to the community, and have been submitted for inclusion in the
next revision of the public mmCIF dictionary. The mmCIF dictionary
management group have drawn up a provisional dictionary in which the
definitions are given new data names appropriate for public use, but the
link to the original local designations is maintained through the "alias"
entries. These should disappear from the public dictionary upon final
approval. More details of the proposed scheme can be found in the
archive of discussions of the dictionary maintenance working group
http://www.iucr.org/cif/comcifs/wg1/

D> 	In the same context, the term _refine_ls_restr_type.U_sigma_wghts
D> appears to be an item that is only used by one specific program
D> (RESTRAIN).  If this item is of general interest it should be defined
D> without reference to a particular program.  I assume that this item is
D> included here because it is deemed to be of sufficient importance that it
D> might well be used by other programs, and this should be reflected in the
D> description.  The program RESTRAIN could be used as an example of where
D> and how it might be used.
D> 
D> Some smaller points and typos
D> -----------------------------
D> 1. 'Thermal parameters', 'thermal factors' and 'tempfactor' should be
D> replaced by 'atomic displacement factors' or 'ADPs' to conform to IUCr
D> usage.
D> 
D> 2.  The reference to 'Cruickshank' in
D> _refine.overall_esu_R_cruickshanks_DPI and the following item is
D> inadequate to the point of being useless.  At the very least the author's
D> initials, and the date and place of the study weekend should be given if
D> the user is to have even a sporting chance of locating the reference. 
D> 
D> 3.  In save_refine.overall_fom_free_rset the _item_name is given
D> incorrectly as *_work_rset (also the alias).
D> 
D> 4.  The expression for 'drest' in _refine_analyze.RG_work_free_ratio is
D> missing a closing parenthesis.
D> 
D> 5.  In the example for the category refine_ls_restr_type there is a
D> dataname that includes '_ebi', presumably in error.  The same error is
D> found in the description of _refine_ls_restr_type.U_sigma_wghts.  The
D> abbreviation used for 'weights' elsewhere in mmCIF is 'wt', perhaps this
D> name should also conform to this convention, i.e. '*.U_sigma_wt' or
D> '*.U_sigma_wts'.
D> 
D> 6.  In _refine.correlation_coeff_Fo_to_Fc, are the expressions between {
D> and } in the denominator of the definition of R_corr correct?  It seems to
D> me that the 'sum' should be divided by the number of reflections in the
D> sum before <F>^2 is subtracted, i.e. that these expressions should be of
D> the form SQRT{<Fo^2> - <Fo>^2}.

Here are my own comments, prepared independently of David's remarks, so I
apologise if there is any duplication. I also had a list of minor typos
and stylistic changes, which I shall send separately to the mmCIF dictionary
maintenance group via Paula Fitzgerald, and, upon request, to anyone else
who might be interested.

------------------------------------------------------------------------------
save__phasing_MIR_der.power_acentric
The meaning of the following phrase is not clear:
"Phasing power is <FH / Lack_of_closure>." 
    - what is FH?
    - what is Lack_of_closure? If a general descriptive phase, the
      underscores may be omitted (cf "Isomorphous difference" in
      save__phasing_MIR_der.R_cullis_acentric). If a single symbolic
      reference, it should be properly defined.
    - do the angle brackets have any meaning (e.g. "expectation value")?
      If not, should they be dropped altogether? Replaced by parentheses?
      Is the placement correct? (i.e. not <FH>/<Lack_of_closure>?)

------------------------------------------------------------------------------
save__phasing_MIR_der.R_cullis_acentric
Same comments as above regarding "Lack_of_closure" and angle brackets.

The meaning of this sentence is not clear:
"NB:  This is tabulated for acentric and anomalous terms,
      extending the former definition."
    - what is the former definition? Presumably it is the equation

                           sum| |Fph~obs~ +/- Fp~obs~| - Fh~calc~ |
               R~cullis~ = ----------------------------------------
                                   sum|Fph~obs~ - Fp~obs~|

      given in save__phasing_MIR_der_shell.R_cullis, in which case
      reference should be made to the location of this definition.
    - is there a reference to the tabulation for acentric and anomalous
      terms?
    - should the literature reference to the paper of Cullis et al.:
               Ref: Cullis, A. F., Muirhead, H., Perutz, M. F., Rossmann, M. G.
                    & North, A. C. T. (1961). Proc. R. Soc. A265, 15-38.
      appear in any case?
    - i.e. is the quantity still properly named as a "Cullis R factor"?
Here is a suggested alternative definition:
    _item_description.description
;              Residual factor R~cullis~ for acentric reflections in this
               derivative.

               The Cullis R factor is the ratio of the lack of closure and
               isomorphous difference terms. It is defined in terms of an
               analytical formula for centric reflections, but must be
               extracted or interpolated for acentric and anomalous
               reflections from the tabulation of XXXXX XXXXX XXXXX XXXX.

               Ref: Cullis, A. F., Muirhead, H., Perutz, M. F., Rossmann, M. G.
                    & North, A. C. T. (1961). Proc. R. Soc. A265, 15-38.
;

------------------------------------------------------------------------------
save__phasing_MIR_der.R_cullis_anomalous
The quantity is defined twice:
"Cullis R factor is <Lack_of_closure>/<Isomorphous difference>."
"Cullis  Rfactor is <Lack_of_closure>/<Anomalous difference>"
    - have two definitions been conflated?

I find the nomenclature confusing and not fully explained - what are FPHi(+)
and FPHi(-)? What is FHi"? The various Dano terms? RC(ano) should be
called something else - R~cullis~ or R~cullis~^ano^ perhaps?

Small point - "Sum" in the equation should be "sum" for consistency with
other such expressions.

Is the Cullis et al. literature reference appropriate here too?

------------------------------------------------------------------------------
save__phasing_MIR_shell.reflns_anomalous
Doesn't exist - should it? (i.e. there are _centric and _acentric
definitions, but no _anomalous.)

------------------------------------------------------------------------------
REFLN_SYS_ABS
Is not this entire category a candidate for the Core dictionary?

------------------------------------------------------------------------------
save__refln_sys_abs.index_l
Typo: "Miller index h of the reflection." should of course be "Miller index l
of the reflection."

------------------------------------------------------------------------------
save__refine.overall_ESU_B
"Overall estimated standard uncertainties" should be "Overall standard
uncertainty". Presumably this is the quantity sigma_B (the equation gives
(sigma_B)^2).

------------------------------------------------------------------------------
save__refine.overall_ESU_ML
Seems identical to the sigma_B definition apart from the numerical factor
(3/8 versus 8). Is this correct?

------------------------------------------------------------------------------
save__refine.overall_ESU_R_Cruickshanks_DPI
Better named as "_refine.overall_ESU_R_Cruickshank_DPI" (i.e. without the s
at the end of Cruickshank)?

What does DPI stand for?

------------------------------------------------------------------------------
save__refine.overall_ESU_Rfree
better named as "_refine.overall_ESU_R_free" (i.e. with an _ after the R)?

------------------------------------------------------------------------------
save__refine.overall_FOM_free_Rset
better named as "_refine.overall_FOM_free_R_set" (i.e. with an _ after the R)?

_item.name is incorrectly given as '_refine.overall_FOM_work_Rset' (and
likewise _item_aliases.alias_name as '_refine.ebi_overall_FOM_work_Rset').

------------------------------------------------------------------------------
save__refine.overall_FOM_work_Rset
better named as "_refine.overall_FOM_work_R_set" (i.e. with an _ after the R)?

------------------------------------------------------------------------------
save__refine_analyze.RG_d_res_high
refers to "__refine_analyze.ls_RG_free" - typo? Also in 
save__refine_analyze.RG_d_res_low.

------------------------------------------------------------------------------
save__refine_analyze.RG_work_free_ratio
would be better as "_refine_analyze.RG_free_work_ratio" (to match the
definition as the free/work ratio).

------------------------------------------------------------------------------
save__refine_funct_minimized.numterms
would be better as "save__refine_funct_minimized.number_terms"

------------------------------------------------------------------------------
save__refine_ls_restr.type
The new RESTRAIN labels are verbose - not necessarily a problem, though
there may be an implication to a casual user that any old sentence or phrase
could go in here. 

------------------------------------------------------------------------------
save__refine_ls_restr.U_sigma_wghts
Should we go for the extra two letters of .U_sigma_weights for clarity?

"The expected r.m.s. differences in thermal parameter, either Uiso or Uaniso,
are listed for each shell in  _refine_ls_restr.ebi_rmsdev_dictionary."
    - _refine_ls_restr.ebi_rmsdev_dictionary is not defined in this batch of
      data names
    - is it a generally useful data name or does it have application only to
      the RESTRAIN program? 
    - the same question might indeed be asked of the .U_sigma_weights
      data name itself.
    - the .ebi_rmsdev_dictionary refers presumably to a tabulation of
      values that are to be regarded as a set of standards characterising
      the particular structure under investigation. This use of the term
      "dictionary" (also used elsewhere in protein structural science
      for standard tabulations, if I am not mistaken) is unfortunate
      within the nomenclature of CIF and its associated data dictionaries.
      Is there a suitable synonym acceptable to the macromolecular community?

"...in both cases, WU is the value stored in _refine_ls_restr.U_sigma_wghts."
    - _refine_ls_restr.U_sigma_wghts (or _weights) is not defined: is this a
      typo for _refine_ls_restr_type.U_sigma_wghts? Likewise the definition
      begins with a reference to refine_ls_restr.ebi_U_sigma_wghts - should
      this be _refine_ls_restr_type.ebi_U_sigma_wghts (note also the missing
      initial underscore)?
------------------------------------------------------------------------------


With good wishes to all who are planning an Easter vacation,

Regards
Brian
Prev by Date: (82) BNF; magic string; extension to mmCIF
Next by Date: (84) Procedural matters: terms of reference, discussion lists, DMGs
Index(es):
- Date
Discussion List Archives

(83) Discussion mode; BNF; mmCIF extensions from EBI