(47) Extended Core: last round before Seattle

To: [email protected]
Subject: (47) Extended Core: last round before Seattle
From: bm
Date: Mon, 5 Aug 1996 14:53:45 +0100
Dear Colleagues

I had hoped for final approval of this version of the dictionary by the
time of the Seattle meeting. We haven't quite made it, but we're very
close, and I expect the last few remaining queries to be ironed out in the
next few weeks. There should be opportunities to discuss any queries or
problems during the Seattle Congress.

The current mailing contains extensive contributions from Syd Hall ("S>")
below, and some contributions from Gotzon Madariaga (G), Nick Spadaccini
(N), David Brown (D) and David Watkin (DW). The only change of substance
that I have implemented is the redefinition of _refine_ls_number_reflns
according to Syd in D45.3 below - please check that you approve this.

I now consider most of these discussions closed. A summary of issues still
unresolved will come round soon after Seattle.

D40.1 F(000) - oops
-------------------
Somehow David Watkin's comment got mangled in transmission. He comments
also:

DW> Phi(h) and (-h) are the same, though not necessarily 0 and 180
DW> even for centro structures.
DW> 
DW> Another thing I intended to mention as a virtue for [this calculated
DW> value of F(000)] is that it offers some information about the
DW> form-factors which were used. Since these are also used in the
DW> structure analysis, this information might be useful. The information
DW> content of the [conventional formulation of] F000 is almost zero.


D44.1 _atom_site_aniso_B_*
--------------------------
S> I would oppose any move to drop "B" items as acceptable
S> data (and of course they must remain in future dictionaries
S> to maintain back-compatibility). The cif dictionary is
S> not an efficient instrument to effect changes in nomenclature
S> and I have already pointed this out to Sidney Abrahams.
S> What his commission has to do is make sure that all software
S> packages use only U's...and Acta will help with this. In the
S> meantime we should list these only as alternate definitions.

D44.2 _atom_site_disorder_*
---------------------------
S> I was involved in trying to accommodate John Davies' treatment
S> of disorder....it is a bit "shelx"y but I think it should work.
S> Perhaps all that was missing was a good example and the one 
S> you have given should do the job. Best ask a novice though!
S> 
S> If a description needs to be very detailed, perhaps the original
S> codes are a mite obtuse. Does anyone have a better scheme?

John Davies has been away for a while, but has promised to review the
existing descriptions and comment.

D44.3 _atom_site_scat_versus_stol_list
--------------------------------------
S> This is a kluge for those cases where very special sf tables
S> are needed. I have never seen it used but I can imagine cases
S> where it may be. My vote is not to restrict the layout of this
S> text field. As Nick would say, this is an application not
S> a definition matter. I think David's layout would work in
S> most cases but perhaps someone might need to also put in the 
S> dispersion terms....and so on.

S> Here I side with David's argument. Gotzon's suggestions assume a
S> uniform step size and this is often not the case. I would counter
S> the machine argument (tongue placed firmly in the left cheek) with
S> the belief that "intelligent" software can parse "anything", the
S> intelligence of those chips is hardly worth a cracker! Roll on
S> the nested loops.

Not in this revision cycle!

D44.4 _audit_conform_[] examples
--------------------------------
S> Personally I was in favour of this example [of mixed MIF and CIF
S> dictionaries] ... but I am biased.
S> I think that the objective of mixing items from different
S> dictionaries is an important one...and perhaps the syntax 
S> argument should not really be used here. After all many
S> practitioners of this black art think that cifs will 
S> ultimately acquire the more sophisticated airs of their
S> mif cousins....and perhaps this will be an encouragement!

Well, I agreed with David's request to pull out this example until we have
a working demonstration. But I hope to discuss it with the CCDC people,
who have some requirement to mix CIF and MIF terms, and see if they can
explore such applications.

D44.5 _fax and _phone numbers
-----------------------------
S> Here I put on my pedantic hat and say that there was 
S> considerable debate about phone formats and the agreed
S> upon style  <country>(<area>)<local>x<extension> was 
S> decided on. I think that the examples should conform to
S> this.

D44.6  Expanded CELL category
-----------------------------
S> The original intention of categories was to formalise the
S> groupings of data names, AND to identify items that should 
S> appear in the same loop structures. The reasons why CELL and
S> CELL_MEASUREMENT items were separated do not spring to mind
S> and as far as I can see there would be no reason why the 
S> two can't be merged.

G> In the new CELL category there is no possibility of handling several
G> cells within the same data block. Should a cautious "_list both" be
G> added to these items?. As you mentioned it would imply a new pointer
G> to cell_measurement_refln.

This would indeed be the correct formalism. What we need to decide is
whether in practice different cells will be described in the same data
block, or whether it would make sense to have each different cell
described in a different data block, if the inter-block relationships
could be made explicit through the new _audit_linked_ conventions. David
has already suggested that he can envisage reasons for keeping them
together in a single data block. I shall hold off implementing anything
until I get a better sense of general agreement or disagreement with that
position.

D44.10, 11 DIFFRN... categories
-------------------------------
S> The definitions that concern me in particular are...
S>     _diffrn_measurement_details   and
S>     _diffrn_measurement_device_details 
S> These are not precise enough for use in many situations,
S> not the least of which is the description of area detector
S> data (something of immediate importance to Acta C). For
S> submission to C (and to B and D later) cifs will need to
S> contain details about specific measurement parameters.
S> So a number of other items will need to be added for 
S> this purpose...these items could be retained but they
S> should relate only to "special details".... the original 
S> intent of such loose items is that they contain or
S> describe only "unusual"/"unpredictable" data or general
S> descriptions.

S> Perhaps following the Seattle meeting there will be a better set of
S> diffraction measurement items. I really have not had a chance to formulate
S> what these might be yet but will try to do so before the
S> Acta C closed meeting. For submission purposes I suspect that
S> this level of specificity is essential....for Section D as well.

D44.12 _diffrn_orientation_matrix_*
-----------------------------------
S> I concur that *_type is needed because I suspect (but have no
S> real knowledge) that each manufacturer has a different 
S> definition. This should be investigated.

For now, I have added to the example the entry
    _diffrn_orient_matrix_type             'TEXSAN convention (MSC, 1989)'
and hope to persuade the TEXSAN people at Seattle to specify in future
releases of their software just what that convention is.


D44.13 schizophrenia
--------------------
S> TWO D44.13s

Well, at least somebody's awake.

D44.16 New definitions for *_symmetry entries
---------------------------------------------
S> Yep, agree with better definitions.

D45.1 Schedule for implementation of _type_construct
----------------------------------------------------
S> I concur with Brian M.'s response. What's missing from the _construct_
S> story so far is TIME. The stop-press news is that Nick and I 
S> appear to be within a whisker of getting a fellowship for someone
S> to work on the software etc. to implement some of these constructs.
S> They are in our view an essential aspect of future dictionaries.
S> 
S> I did once argue that a:z as a range made as much sense as -1:5 but 
S> was (quite rightly) put in my place by the pedanticists. Clearly
S> this is a function of the construct definition.

D45.2 Version of DDL
--------------------
S> Brian M.'s response is appropriate. The issue is more complicated 
S> than it might seem. DDL2 is conceptually more difficult but
S> potentially more powerful....Nick Spadccini begs to differ on the
S> latter because he favours the OO approach over the relational
S> one and thinks that the latter will get us into longer term
S> and perhaps irreversible problems. The key is to let both 
S> developments continue to the point that applications show
S> that one is clearly superior to the other. The analogy between 
S> data definition and computing languages is appropriate..though
S> cross compatibility (compiling) may be easier for the DDL.

I recommend the text
     C. J. Date
     An Introduction to Database Systems  6th edition
     Addison Wesley               ISBN: 0-201-82458-2
as a very informative and comprehensive description of the relational (and
other) data models. I am still working my way through it, and might bring
it along on the plane to Seattle for some light reading. On the other
hand, it might fall foul of the excess baggage rules...

D45.3 _list both
----------------
S> I think that it was a mistake to have allowed "both" for _list.
S> If it has only been used for *_xyz I would be tempted to grep
S> the existing cifs for P1 structures and convert them all to 
S> _loop structures.

Hmmm. The fact remains that the _list both behaviour is sanctioned in the
original core dictionary, and so must be accommodated. Acta is not the
only source of CIFs (as people keep reminding me).

D45.6 Fragmentation of address fields
-------------------------------------
S> Yep, Brian M.'s argument was the original one...and it still holds.
S> Anyone who has been involved in setting up a database of 
S> addresses knows that a free format approach is the way to fly!
S> Perhaps the only exception is *_country, and even with this
S> there are problems in construct!

D45.7 _audit_link_block_code
----------------------------
G> There are some problems with _block_id. On one hand, _block_id should
G> be defined as in a very open way (as _block.id in mmCIF) leaving the
G> possibility of further restrictions to the extensions (pd, ms, etc...).
G> However the implementation of such restrictions is not very clear.
G> Think for example that Brian Toby wanted to put his algorithm for
G> _block_id in his PD dictionary. He would be redefining
G> the item and therefore modifying the core definition. A different approach
G> would be to define different items _pd_block_id, _ms_block_id,
G> _symm_block_id, etc...
G> But in this case _audit_link_block_code should be defined as "The name of 
G> a data block (i.e. the value associated to a (_pd,_ms,_symm)_block_id 
G> item) in the current file...". I have no idea of how this last approach
G> could be formulated. I would be very happy if blocks were considered
G> in the core but I think that _audit_link_block_code should not point
G> to the string of a data block declaration. I would like to hear
G> B. Toby's opinions about this subject.

I'm glad this came up again, for I forgot to say in my last posting that I
wasn't happy about David's proposal in (46)D45.7 that the value of a
_block_id should be open to change when files were copied, because that
seems to me no better than using the data block code (an arbitrary and
potentially mutable string) for that purpose.

What I suggest is that we revert to the original intention for
_audit_link_block_code as a code listing related data block names.

Here is my reasoning. The datablock name ('xyz' in data_xyz) must be
unique within any file. If files with identical datablock names are
concatenated, the merging mechanism must resolve the name clash to ensure
the syntactic integrity of the result. It also has the responsibility of
updating datablock names referenced in _audit_link_block_code fields to
ensure the *semantic* integrity of the result. This allows applications
to define clusters of related blocks within a file - the application must
be able to keep track of what those clusters mean (perhaps assisted by the
information in the _audit_link_block_description field); but this is a
very general mechanism that can be utilised with care by any application.

Brian T.'s _pd_dataset_id has a more global objective - to define a unique
label for a set of data relating to a specific experiment. It is indeed
possible to envisage the same _pd_dataset_id occurring in more than one
data block if the result file were partitioned into different logical
groupings of data.

I see merit in having an application-general mechanism for stating links
between data blocks, and an application-specific protocol for identifying
one or more data blocks in accordance with the requirements of the
relevant discipline.

Let me try out one more example to show what I'm getting at. Suppose Acta
Cryst. has an archive file like this:

data_aaaa_text
 loop_ _audit_link_block_code _audit_link_block_description
   .             'Text of paper comparing powder and single-crystal studies'
   aaaa_pow      'Powder data for phase 1 of brianite'
   aaaa_xtal     'Single-crystal data for phase 1 of brianite'

data_aaaa_pow
   _pd_dataset_id   Bri1|NSLS|A.N.Other|96-02-29|08:45
 loop_ _audit_link_block_code _audit_link_block_description
   aaaa_text     'Text of paper comparing powder and single-crystal studies'
   .             'Powder data for phase 1 of brianite'
   aaaa_xtal     'Single-crystal data for phase 1 of brianite'

data_aaaa_xtal
 loop_ _audit_link_block_code _audit_link_block_description
   aaaa_text     'Text of paper comparing powder and single-crystal studies'
   aaaa_pow      'Powder data for phase 1 of brianite'
   .             'Single-crystal data for phase 1 of brianite'

Among the files that the author used to collate the contents of his paper
was one relating to a day's work on the various phases of brianite:

data_phase_1
   _pd_dataset_id   Bri1|NSLS|A.N.Other|96-02-29|08:45
data_phase_2
   _pd_dataset_id   Bri2|NSLS|A.N.Other|96-02-29|09:55

Conceivably also, Acta may have stored only the derived results in its
'aaaa_pow' datablock; the raw data may be placed in a datablock
data_bbbb_pow of a supplementary data file, but that data block will also
include the unique _pd_dataset_id value of Bri1|NSLS|A.N.Other|96-02-29|08:45
that identifies the originating experiment.

Both types of identification are needed.

Syd's comments on this reflect my view (though with the caveat that in
many ways the CIF definitions do specify the 'rules' for the discipline of
crystallography):

S> If data block links have to be used, I favour no rules...this
S> is an application and discipline matter.

D45.8 Inheritance across data blocks
------------------------------------
Gotzon pointed out the formal difficulty of wishing to validate some
data value against a related item (a parent, say) listed in a different
data block. This is not possible as a STAR property, but the way remains
open to permit it as a convention in some crystallographic application(s):

N> What is it exactly that is wanted here? I can see it being a CIF issue,
N> by which I mean what you require is application dependent, but I don't
N> see it as a STAR issue. You see the Crystallographic community can
N> interpret the data in a CIF file anyway it likes, and therefore the data
N> can actually be an instruction to the application about what to do. eg.
N> 
N> _some_CIF_data "http://www.syd.is.working.his.butt.off" # he says ...
N> 
N> Now the CIF application/community can choose to dereference that URL
N> and do something with it at access time. BUT THIS HAS NOTHING TO DO
N> WITH THE DICTIONARY aka STAR.
N> 
N> I think many people have difficulty differentiating between the DDL
N> dictionary and the CIF dictionary. The DDL is generic and covers ALL
N> current and future discipline data dictionaries. The scoping rules have
N> been very thoroughly worked out, and since moving to computer science
N> I am amazed at how right we got them.
N> 
N> Scope is one of the single most important features of a well defined
N> language. A data item with a link_parent makes sense only if the
N> link_parent data item is within its scope. I really can't see how it
N> would work if the child data item is in a block, and the parent item is
N> in a saveframe! There could be any number parent items matching across
N> several saveframes that are defined within the same data block. Now you
N> want to do this across data blocks themselves! Fine if it is your
N> application that is going to do it, but for it to be a feature of the
N> DDL is messy and contravenes good language design. In the dictionary
N> definition of the data item you can define the link-parent data name,
N> but how are you going to enumerate the block to look in (and you have
N> to do that in the dictionary remember), unless there is a data item in
N> every data file which does the enumeration. Let me give you an analogy.
N> In C you require every compiler (generic and equivalent to the starbase
N> parser) to every time you want to compile your C code to read the first
N> X lines of your program where you re-define the scoping rules of
N> extern, static and local declarations. 

S> I share Nick's concern about this...it's a direct consequence
S> of not permitting save frames in cif. I am as much a pragmatist
S> as the next programmer (and some of my more theoretical colleagues
S> use this label as an insult) but implying relationships outside of 
S> a data block via the dictionary is a no-no as far as I am concerned 
S> because it has long term and unpredictable consequences. E.g. what
S> happens if only one data block is extracted from a file....the 
S> relationship will be lost....and this is certain to be a common
S> occurrence. This is why data blocks are expected to be standalone
S> objects. I wonder if Gotzon was thinking in terms of global_ data?

I followed this up with a further round of correspondence with Nick (what I
wrote is prefaced just by ">", Nick's response by "N>"):

> I've gone back to look at what Gotzon's done, to see why he might have raised
> this issue. I think it's straightforward enough; he has
> 
> data_COMMON_INFO
>      loop_ _atom_type_symbol   C  S  O  N   H
> 
> data_REFERENCE_STRUCTURE
>      loop_ _atom_site_label _atom_site_type_symbol _atom_site_fract_x _. _..
>        S1    S     0.1234  ...   ...
>  
> data_MODULATED_STRUCTURE
>      loop_ _atom_site_label _atom_site_type_symbol _atom_site_fract_x _. _..
>        S1    S     0.2345  ...   ...
> 
> and he doesn't want to repeat the common _atom_type_symbol (the parent)
> values in every block. (This is a trivial example - there's no real reason
> not to in this case, but you can imagine it being extended to more
> substantial items of information.) He wants to use the fact that there is a
> _list_link_parent/child relation in the dictionary for these items to
> validate them across these various blocks.

N> You know to really protect the integrity of your data it makes sense to
N> repeat it in the blocks. This added redundancy could be very useful in
N> cases where a "cock-up" occurs and data is lost.

> Two questions: the definition for _list_link_parent is "Identifies a data
> item by name which must have a value which matches that of the defined item,
> and which must be present in the same data block as the defined item". Does
> "must be present..." mean "must be present (if at all)...", i.e. is a data
> block without the parent "invalid" against the dictionary (or just in some
> sense incomplete)?

N> No the reading should be, must be present in the sense that if child 
N> data needs the parent data then the link has to be present. A link
N> becomes vital ONLY when access requires the link to be de-referenced.

> Second, if it IS permitted for the parent to be absent from the same data
> block but (at the discretion of the application) present in some other data
> block, do you see it as workable for the application to locate that other
> data block and validate the parent/child relationship by following block
> codes listed in the _audit_link_block_code ? (I mean potentially - working
> code isn't essential yet :-)

N> It's an application thing. I can see no reason why it can't be done that
N> way, but what about global_ for that file? It means all these blocks
N> have to be kept in the same file. (and while where at it, why
N> not have saveframes and call CIF, STAR? ......).

It is clear that there is pressure from several directions to explore the
transition to a full STAR implementation. Again, I feel the time isn't quite
right for this: we should wrap up this revision of the core, and get at
least the powder and mmCIF dictionaries launched. But a useful next stage in
development will be an investigation of broadening the syntax rules to allow
more (or all) of the currently forbidden STAR constructs. Perhaps Syd's
fellowship, if it materialises, can find time to address some of these
issues.

D45.9 _chemical_conn_atom_type_symbol
-------------------------------------
S> Bit concerned at Gotzon's use of the phrase "declared previously".
S> I hope he meant "declared" as there is no precedence requirement
S> for data in a cif.

I'm sure he meant "declared".

D45.12 _diffrn_orient_refln_angle_
----------------------------------
S> Additional angles are OK.

D45.14 _diffrn_reflns_number
----------------------------
S> I guess there is no reason why this definition should not state
S> "due to translational symmetry in the crystal unit cell" but
S> not that many data collection routines do include these and it
S> might worry a lot of users. I don't feel strongly about it.

D45.16 _diffrn_standard_decay_%
-------------------------------
S> I am not sure that Gotzon's addition would help 99.99% of the
S> users....more likely to worry them. Perhaps a mite too much
S> hair splitting...though I have nothing personally against the
S> addition.

D45.18/28 *_symmetry links to _symmetry_equiv_pos_id
----------------------------------------------------
S> I strongly oppose making _symmetry_equiv_pos_id mandatory and
S> therefore I see no need for these relationships. Perhaps this
S> rationale will change when the construct expressions are 
S> properly developed....clearly the _site_symmetry terms will
S> need to refer to component items. But again these need not
S> be mandatory.

D45.23 _refine_ls_number_reflns
-------------------------------
S> Gotzon is correct. Strictly speaking this description is not
S> appropriate for present applications. This item distinguishes
S> between the actual number of unique reflections used in the
S> ls process and the _reflns_number_total (unique diffraction
S> points after averaging) and the _diffrn_reflns_number (the
S> total number of diffraction points measured). All three of
S> these numbers are likely to be different, and usually are.
S> The definition should better read...
S> 
S>    The number of unique reflections contributing to the 
S>    least-squares refinement calculation.

I have changed the existing definition to this. OK with everyone?

D45.27 Need for new symmetry categories
---------------------------------------
S> Perhaps the converse argument might be put. Since when does a
S> standard in widespread use "collide" with a new proposal?

My only comment here is that the mechanism for implementing a
comprehensive description of symmetry will be high on our list of
items for informal discussion at Seattle.

D45.28 _symmetry_equiv_pos_id
-----------------------------
S> I would be happy about sequence "number", but NOT with it
S> being a mandatory item. It is important to stress that the
S> ORDER that xyz values are listed is TOTALLY dependent on 
S> the user and software....and just as the order of space
S> groups in IT has no special meaning, nor has the order of
S> the symops (OTHER than to their use in the *_symmetry_site
S> items).

D45.29  Definition of su
------------------------
In response to Gotzon's suggestion that the dictionary should contain a
note about standard uncertainties, rather than a series of remarks in many
definitions explaining their equivalence to e.s.d.'s, I asked him where
such a note might (formally) appear.

G> A direct answer to your direct question: I do not know!. There is no way
G> of specifying such remark in a dictionary unless one use comments and
G> comments are completely inapropriate. Then forget my suggestion.

Syd also acknowledges the awkwardness of the change of name:

S> Good question....and it highlights the difficulties that 
S> will be associated with the transition!


D> D46.1 _citation_journal_coden_CAS
D> ---------------------------------
D>         If the definition is correct the name of this item should be:
D> 
D>         _citation_abstract_coden_CAS
D> 
D> However, I suspect the name is OK and the definition is wrong.  It should 
D> match the definitions in the other _citation_journal_* fields.

I'm still waiting for clarification on the original intention from Paula.

D> D46.2 _units for angles
D> -----------------------
D>         These units should be given.  Fortran assumes all angles are in 
D> radians so there is a chance for confusion.

I've added the _units deg field to all relevant quantities.

---------------
This is, therefore, the last mailing before the Congress. To those of you
who are attending, may I wish you a pleasant and safe journey, and I look
forward to seeing you there.

Best wishes
Brian
Prev by Date: (46) A further round of changes to the submitted Core
Next by Date: (48) Extended Core: interblock links; DIFFRN categories
Index(es):
- Date
Discussion List Archives

(47) Extended Core: last round before Seattle