Discussion List Archives

[Date Prev][Date Next][Date Index]

(33) Modulated structures dictionary, R factors, DDL2, ACA abstracts

  • To: COMCIFS@iucr.ac.uk
  • Subject: (33) Modulated structures dictionary, R factors, DDL2, ACA abstracts
  • From: bm
  • Date: Thu, 4 May 1995 16:35:25 +0100
Dear Colleagues

Contrary to some rumours, I am still alive and well. I'm taking advantage
of this mailing to include some replies to specific points that might not
have been intended for general distribution. I hope no-one minds; but I
have tried to restrict this questionable practice to items that are of
general interest!

An item of news regarding the future movements of one of our colleagues:

B>    Let me pass on a announcement of a more personal nature. I would
B> imagine that many of you know Ted Prince at the Reactor Radiation Division
B> of NIST.  Ted will be retiring this spring and I have been hired to take
B> his position. I will be leaving Air Products in mid-May and I will start
B> at NIST sometime in late May or early June. My new e-mail address is
B> Brian.Toby@NIST.GOV. 

Congratulations, Brian.

I have put some more items in the comcifs ftp directory, which people are
welcome to look at. There is a copy of the modulated structures draft and
example (see below for more details). Although I'm not yet sending this out,
you might find it helpful to obtain a copy to follow all the details in
the related discussions below.

There is source code for a little syntax checker for CIFs. It's in the file
syncif.c, and should compile reasonable cleanly with old (non-ANSI) C
compilers under Unix. I shall try to put a DOS executable there too in the
near future. This tries to check CIF syntax as rigorously as possible, and
does its best to recover from certain errors so that it can struggle on and
report others. It will very soon be implemented in our automated checkcif
service, and is intended as the first step in a larger vcif project to
validate CIFs against dictionaries. (But if you're already working on such a
thing, don't wait for me!) Reports on syncif would be very welcome.

There has been a lively discussion in the last few weeks on the PDB
discussion list about the merits (or otherwise) of CIF as perceived
within the mmCIF community. Many of you will have seen this, but if not,
and if you are interested, there is a copy in the ftp area as the file
discuss.pdb.


Continuing discussions
======================

(30)D28.2, D28.3 R Factors
--------------------------
 
B>    David's disdain for the _all suffix is appropriate. I fear that I
B> dislike it even in the case that he finds it acceptable:
B> _proc_ls_R_factor_all is an oxymoron as one never includes the
B> "unobserved" reflections where background exceeds the peak intensity as
B> one might in a F^2^ refinement. There is no need to add qualifiers such as
B> _all to any of the powder R-factors. For the profile, it will be clear if
B> a point is included from the weight. If a reflection is included in the
B> file, it should be included in the reflection R-factor computation. 
B>  
B>    I have not fully resolved all of the issues related to the definitions
B> related to this "reflection R-factor", which Rietveld'ers commonly call
B> R~B~ (B for Bragg) because the community uses three different definitions.
B> Probably the best definition from a statistical point of view for this is
B> the one that I have included in the dictionary, based on the sum of
B> |I~obs~ - I~calc~| as either _proc_ls_I_R_factor or _refine_ls_I_R_factor
B> (post #30 is internally inconsistent). 
B>  
B>    Another choice bases R on |F~obs~ - F~calc~|. This is the same
B> definition as used for _refine_ls_R_factor. It has one merit -- it can be
B> directly compared to a single crystal measurement result. 
B>  
B>    We are still missing the expression for an unweighted R based on the
B> sum of |F^2^~obs~ - F^2^~calc~|. This was used more back in the "good old
B> days"  when all data, including so called "unobserved" reflections, were
B> included in a refinement. It is still used in the neutron community.
B> Therefore I propose we add one more R-factor to our list:
B> _refine_ls_F2_R_factor. (Note that I's and F^2^ differ by the Lp factor so
B> this is not the same as _refine_ls_I_R_factor.)

Apropos of this general discussion, Paula has pointed out to me the
following definition in the current working draft of her dictionary:

save__refine.ls_R_factor_obs
    _item_description.description
;              Residual factor R for reflections that satisfy the resolution
               limits established by _refine.ls_d_res_high and
               _refine.ls_d_res_low, that were flagged as observed (see
               _reflns.observed_criterion), and that were included in the
               refinement.
                 
                   sum | F(obs) - F(calc) |
               R = ------------------------
                        sum | F(obs) |
 
               F(obs)  = the observed structure factor amplitudes
               F(calc) = the calculated structure factor amplitudes
 
               sum is taken over the specified reflection data
;

That is, she has taken the opportunity of merging the core and mmCIF
dictionaries in DDL2 to revise some definitions. Here the core definition for
R~obs~ has been expanded to include a reference to the resolution shells in
mmCIF work, but also includes explicitly the phrase "that were included in
the refinement". Is this a case where we are entitled to modify the
definition?

D30.2 The New DDL
-----------------
B>    The extended discussion between JW & Brian M. raises a question. Why do
B> links between data blocks violate STAR, as John mentions in 30.2? If so
B> how can data blocks refer to each other and remain CIF compliant? This is
B> a very important issue to me. Here is an example that may bring this point
B> home to the MM folks: If I measure data sets on the material at several
B> wavelengths and perhaps on different instruments, I must put each set of
B> measurements in a different data block. However, if I use all the datasets
B> for a single refinement, when I am done I have a structure that represents
B> a result from the composite of all the data blocks. How do I indicate the
B> relation between the "data"  data blocks and the "structure" data block.? 

I suppose the idea is that scoping rules are applied much as in the writing
of a computer program - a data name in a data block is seen as like a local
variable defined within a subroutine. There aren't mechanisms in CIF for
formally extending that scope (though there are extensions (through global_)
in the larger STAR framework). But I tend to take a more pragmatic viewpoint:
conventions can be established where appropriate. You will remember the
lengthy correspondence we had previously on how we might change datablock
names when we send material to CCDC. Your block_id proposals allow a pointer
into a related datablock, and appear to me a legitimate way of indexing
related data, so long as conforming applications understand the way in which
to use this.

It has been pointed out to me, though, that it is possible to include data
from different experiments in a single data block, by introducing extra
datanames indexing the various cases; something like:

loop_ _pd_experiment_id
      _refln_index_h _refln_index_k _refln_index_l _refln_observed_status
      _refln_F_squared_calc _refln_F_squared_meas _refln_phase_calc
A    1   0  -1  o   1097.425   894.563   180.
A    1   0   1  o   2377.211   2340.638    0.
A    2   0   0  o   541.626   172.256    0.
B    1   1   0  o   6531.787   6817.781   180.
B    0   1   1  o   317.545   517.333    0.
B    0   0   2  o   1965.235   1996.33   180.

You could then have looped lists of experimental parameters for A, B and so
forth. You may well feel that for your purposes this is too complex (and I
would have some sympathy with that viewpoint).

B>    My initial reaction to the DDL2 proposal was pretty negative as it
B> makes writing a CIF dictionary much harder. I personally am not willing to
B> invest the time to make the PD dictionary DDL2 compliant. The complexity
B> of the DDL is at a level where developing code for CIF will be beyond the
B> resources of most of the programmers who are writing crystallographic code
B> (excluding perhaps the MM community, which can support much greater
B> investments in infrastructure). Having made these points, I have come to
B> the conclusion that the levels of abstraction in DDL2 must be necessary,
B> or they would not have been created and, with a few reservations, I
B> support the migration in this direction, but only after we been provided
B> with good dictionary authoring tools and CIF parsers so that "the internal
B> structure does not matter" as David put it.  Given this trend, I support
B> JW's proposal for matrix and vector objects (30.5). If we have made the
B> jump to DDL2, we really don't need to worry about how FORTRAN programmers
B> will handle the equivalent of nested _loops.

However, I have pointed out that the current formalism for matrix and vector
objects is not STAR compliant. That needs to be fixed before we can really
start developing this idea. As I understand matters, Paula has agreed to keep
these constructions out of the current version of the mmCIF dictionary. To be
STAR compliant, matrices must either be string objects with internal
structure (e.g.  "1  0  0  0  1  0  1  1  0") or (potentially) nested loops;
and there are currently objections to both of these solutions.

D30.4 Dot separator in alias names
----------------------------------
Two points of view here:

B>    Where I do have problems with proposal 30.4. There are really two
B> actions arising from 30.4.: 
B>  (A) Allow CIF items to be given names of form category_id.item_name. 
B>  (B) assign a new name in this form and an alias to the original name for 
B>      all entries in the existing dictionaries. 
B> I don't really have a problem with 30.4(A), but I think that 30.4(B)
B> should be the other way round: cell_length.a should be an alias for
B> cell_length_a.  To do otherwise redefines CIF wholesale. In the printed
B> dictionary, cell_length_a should come first, and cell_length.a can appear
B> in parenthesis. In the powder dictionary, I don't want these
B> category-based names to even appear as it will only create confusion. I
B> would also not want to see entries displayed grouped by category. The
B> rules of data normalization required very inconvenient assignment of
B> categories. The hierarchy used in naming makes more sense to the reader of
B> the printed document. 

P> ... aliases.  Here I also come down with the idea of doing things the
P> new way.  I am in favor of this partly because I think the dots really help
P> emphasize the hierarchy in some of the very heavily treed parts of the
P> data structure - you are right that it is easy to figure out what is category
P> and what is specific item when you are looking at the dictionary with the
P> _item_category (or whatever it is) DDL.  But when you are just looking at a
P> data file and trying to figure things out, the dots can really be helpful.
P> 
P> But I also want to stick with the aliasing because it allows us to repair
P> some inconsistencies in the core itself and between the core and the mm
P> extensions.
P> A simple case is that (due to name length restrictions) we were forced to
P> always used _details instead of _special_details where needed.  Using the
P> alias mechanism, I am now free to change all of the core items from 
P> _special_details to _details (bet you hadn't noticed that, had you?)  There
P> are a couple of other areas in which I used the aliases to change data names,
P> all for the goal of making the whole data structure more consistent.

As I see it, the proposal for aliases is designed to allow DDL2-savvy
applications to handle CIF-1 and CIF-2 files with equal ease. All purely
syntax-based tools (like QUASAR, starbase) will handle both file types
identically. Request lists for QUASAR (or map files for ciftex) will need to
be loaded with both forms of the dataname to allow extraction from CIF-1 or
CIF-2 files; that's probably nothing more than a nuisance (though with
perhaps 2000 datanames in total, not an insignificant nuisance!).

I have pointed out to Paula that there are some problems with the aliasing as
currently set up in the mmCIF/DDL2 dictionary - for instance, _cell.length_a
is aliased to _cell_length_a, but the latter may have an e.s.d. (sorry,
s.u.!) appended in parentheses, whereas the DDL2 representation has
_cell.length_a_esd for the corresponding quantity. 


D32.1  CIF-style publication mechanisms
---------------------------------------
There were a few comments in response to George's posting on the problems of
using CIF-style files for posting abstracts to the ACA.

S> I initially had similar concerns, and it took some digging in 
S> Phil's WWW files to come up with the dictionary and the example submission.
S> I was very very surprised that the relevant data names and example submission
S> did not appear in the ACA registration and abstract handbook -- goodness 
S> knows how you would make a submission without access to the WWW files -- and
S> as I have bleated on past occasions, one cannot assume that most (or even a 
S> significant proportion) of authors have easy and familiar access to WWW 
S> facilities. Phil and Peter and others have made that oversight on previous
S> occasions!

It can be very difficult to make sure that the right information gets to all
the right people at the right time, and in the right way. In a private mail
on this subject, David Brown has outlined to me some of the organisational
problems at the ACA that beset the writing of this year's Call for Papers.
It's clear that there is great danger in trying to provide too many routes to
electronic publication.

My own experience in submitting my abstract was rather different. I left it
literally to the very last minute, but I was able to locate Phil's data entry
form on WWW and enter the abstract direct into this. In consequence, I think,
mine is the only abstract actually available for general perusal on the Web
at the time of writing. The merit of this approach is that one can simply
"fill in the boxes" and be guaranteed to generate a syntactically correct
CIF-style file without being troubled by the technical details of the file
format. Obviously this is the sort of approach that is well worth developing.
(I'm not sure whether I'm creating an "abstract information file" or just a
HTML representation when I fill in Phil's form; but presumably it could be
either - or both.)

Syd's cautionary note on making assumptions about the availability of network
tools is valid; but there comes a time when one must judge that most people
will have some idea on how to use these facilities. In Britain - even in
Britain - there are now dozens of magazines on Internet issues, several
television programmes, and frequent coverage in the national press (Peter
Murray-Rust was extensively quoted in a half-page article in The Independent
newspaper this week on the Online Protein Structure Course: fame at last,
Peter - can fortune be far behind?).

Howard Flack has had experience in assembling a set of conference proceedings
online:

H>  I am very well aware that what I write below will probably not be to the 
H> taste of many members of COMCIFS. For sure, the following is just my personal
H> opinion but having read the e-mail from George, I must say that I do not
H> regard the type of problem that he describes as being one of 'teething'.
H> 
H>   For the Aperiodic '94 conference (120 participants - 193 abstracts) held
H> in September, I handled the publication side of the abstracts. The submission
H> of abstracts by e-mail in 'ASCII' was encouraged (some came as Latex or Tex 
H> files), and for the rest, hard copy was deemed acceptable. Diskettes were not
H> allowed. All abstracts were subsequently marked up in HTML and shown up on
H> the WWW. An abstract book was printed directly through the WWW-hypertext
H> browser. There are a few more details on URL 
H>      http://www.unige.ch/crystal/aperiodic/abstract-book.html
H>
H>   The scanning and character interpretation of the hardcopy abstracts was a 
H> failure. It was quicker and surer to retype them. The programme committee
H> chairman of ACA '95 and I have been in very frequent contact since he became
H> aware of our activities of abstract handling for Aperiodic '94. 
H> 
H>   Frankly, CIF is not a universal standard suitable or attractive for being 
H> used to sumbit 100-word abstracts to a conference. CIF-electronic abstract 
H> submission was heavily pushed in the run-up to ACA '95 and as is evident from
H> George's e-mail (other correspondence that I have had confirms this) the 
H> complications of preparing a CIF abstract has made more than one person turn
H> to hard-copy submission.

Certainly, we should remember that CIF is a crystallographic data file
format, not a document preparation one. But it's flexible enough to allow
document preparation in well-structured cases, such as Acta 'C'. Remember
that it's also an interchange format - Acta translates CIFs into TeX, but it
could in principle translate them into WordPerfect, MS-Word, FrameMaker
format (which, amusingly, is called MIF!) or anything else. 

CIF works for Acta because we have by now invested several man-years of
experience in it. It could be made to work for IUCr Congress Abstracts if
similar effort were ploughed into it. There are benefits, as David outlined
in his article in the ACA Call for Papers booklet - it's easy to collect
titles, author lists, addresses and so forth. But handling all these items
requires an investment of time in developing the appropriate document-handling
tools, and one needs to weigh the rewards against the necessary investment.

H>   A conference participant wishes to be able to write and submit an abstract 
H> with an absolute minimum of administrative overhead at the last minute, and
H> to be able to read abstracts in advance. To do so, increases the
H> attractiveness of the conference organisation and persuades him of
H> the usefulness of paying his registration fee. Last minute writing means
H> that ones usual word-processor has to be used perhaps hacking sentences
H> out of another abstract that one wrote to the local chemistry, physics
H> or biology conference. Complicated procedures for 
H> submission, registration and payment are a discouragement to conference 
H> participation and the payment of a registration fee. 
H>   'The customer is always right'
H> 
H>  As a consequence it is clear that a conference organiser must offer flexible
H> abstract submission, registration and payment procedures. One has to be 
H> prepared to accept abstracts in electronic form in a variety of the most 
H> popular formats. Electronic submission provides the possibility of language
H> correction, easing of communication between members of the programme
H> committee and preconference viewing. 
H> 
H>   None of these desiderata are satisfied by the use of a single externally-
H> imposed complicated-to-use standard for abstract submission.
H> 
H> > G> They also
H> > G> highlight the difficulties that have to be solved before we allow CIF
H> > G> submission of abstracts for the IUCr Meeting in Seattle
H> 
H>   We - the choice is only one for the organisers of the Seattle meeting.
H> 
H> > G though no doubt Phil would be happy to edit them all again !
H> 
H>   Don't be deceived - you'll be editing almost all of the abstracts you 
H>   receive in any case. Corrections to the English, corrections to the
H>   spelling of names and cities, standardising adresses etc etc. 
H> 
H> > G> 10. Does WWW use LATEX?
H> 
H>    No, it uses HTML, a specific dictionary for the public domain SGML 
H>    ISO-standard.
H> 
H> > If not why not ?
H> 
H>    Latex is a type-setting language and knows about fonts and pages of a 
H>    determined size. LATEX does not do hypertext.
H> 
H>    HTML marks up the logical context of a text and graphic document i.e.
H>    end of paragraphs, levels of headings, mathematical variables, hypertext
H>    links to other places in other documents. HTML says nothing about fonts
H>    and sizes of paper or screens because one has no idea on what sort of
H>    equipment an HTML document will be shown. It is the browser software on
H>    the specific equipment which (according to user-preferences) displays the
H>    document to the best of its ability.

One feature common to CIF, HTML (and SGML in general) and LaTeX is that they
are all capable of labelling strings of text with tags that identify their
purpose and function. Thus it is (in principle) easy to translate
automatically betwen these various formats in a manner which doesn't lose too
much information. The same isn't true when it comes to translate a typical
word-processed file into one of these formats. But there are some
developments which take us in this direction. Inspired by the popularity of
WWW, Microsoft now provides a filter for translating a MS-Word document into
HTML. If this type of facility becomes widespread, it may make the life of
the future electronic publisher easier. But note that HTML is far from being
a rich tagging format: its '<HEADER>' tag might equally map to an AIF
_ACA_abstract_title, to a CIF _publ_section_title, and to lots of other
things besides!

Well, probably this is the place to say 'this correspondence is now closed'.
For those who want to discuss this further, Phil has this to say:

PEB> Hi Brian.. so much fun am I having that I have decided to devote a little
PEB> time to this at the workshop both from the results of my making mistakes
PEB> and how the user interprets CIF!


New topics
==========

D33.1 Modulated Structures dictionary
-------------------------------------
Dr Gotzon Madariaga has submitted a new draft of the modulated structures
dictionary, which I am placing in the comcifs ftp directory (as the
dictionary file ms_core.dic and the associated example file ms_example.cif).
I shall not, as yet, mail this to everyone because there are a few technical
issues that I think we should resolve first; but, on the other hand, I do
encourage you to download copies (or ask me to e-mail them to you) if you are
interested in these technical matters. I have already had some correspondence
with Dr Madariaga, part of which I reproduce below (my contributions are
prefaced with a single ">", Dr Madariags's with a "G>"):

G> Dear Prof. Brown:
G> 
G>  I am submitting to COMCIFS a new draft of the future CIF Dictionary for 
G> modulated structures. I have tried to correct all the points you mentioned
G> in  your mail (6 months ago) looking for solutions already approved by
G> COMCIFS. After reading some COMCIFS discussions my state was of complete
G> confusion and the experience was, at least in the beginning, discouraging. I
G> had no answer to basic things like: what is the DDL version I should use?. 

> As the collator of these discussions, I must bear the blame if they are
> confusing to follow. However, it is the case that the rapid development of
> DDL in two divergent directions leaves the field in a state of some
> confusion. It is not clear to me whether we should adopt the untested DDL2
> that is being used in the mmCIF dictionary for all applications, or whether
> it is best, in the longer term, to have two different types of dictionary,
> those using DDL1 and those using DDL2, depending on the uses to which they
> are intended to be put. At this time there is an opportunity for the authors
> of current dictionaries to influence that decision the primary motivation
> behind DDL version 2 is the presentation of information on datanames in a
> hierarchical form that is easily amenable to object-oriented programming
> techniques, and that allows relatively straightforward loading of relational
> databases using the machine-readable information in the dictionaries. The
> macromolecular (and NMR) dictionary authors consider this a very important
> consideration in their field; Brian Toby is at this stage much less
> convinced of the benefits of this approach in his more modest requirements
> for extending the powder definitions.

G> when I finally convinced myself of the benefits of using _include_file, I
G> received a new e-mail indicating that in the new DDL release such a facility
G> had been removed and it would possibly substituted by a 'semantically-void
G> object loader' (??). Fortunately I could, at last, extract a lot of valuable 
G> information specially from the ideas of Brian Toby.

> I consider it rather unfortunate that there have been so many changes in DDL
> over the last couple of years. Unfortunately, the people actively involved
> in developing this formalism are not able to commit themselves to working on
> the project full-time, and there tends to be occasional periods of frenetic
> activity, rather than a more gradual evolutionary process.

David Brown received a copy of this exchange, and wrote to me thus:

D> 	Madariaga is not the only one confused by comcifs business.  I 
D> suspect that you are privy to much more of the goings on than the rest of 
D> us.  Some of us will be familiar with some of the activities, others with 
D> other activities. A little bit of the history of DDL would not go 
D> amiss.  As I understand it, DDL was supposed to be the description of the 
D> STAR standard and it has been under construction during most of the life 
D> of comcifs, often causing us to defer decisions.  On the other hand, we 
D> have had the opportunity to influence its development.  Just as I 
D> received a copy of THE definitive paper that describes DDL, I suddenly 
D> discovered that there was a DDL2 which I had heard nothing about.  So do 
D> we have two STAR standards, a DDL1 and a DDL2?  Or is DDL2 an extension 
D> of DDL1, in that anything that conforms to DDL1 automatically conforms to 
D> DDL2?  This has not been made clear.  If the two standards are not 
D> mutually compatible (there was a hint of that in your letter to 
D> Madariaga) then how can we have CIFDICS that do not conform to the same 
D> standard?  If they are compatible, then we should abandon DDL1 (even 
D> before the ink is dry) and work with DDL2.
D> 
D> 	I also agree that DDL2 is itself pretty formidable (so much for 
D> the great simplicity of the STAR file structure!) and I have not had time 
D> to come to terms with all of its structure.  It does not help that, of 
D> necessity, no one has had time to write a simple introduction outlining 
D> the philosphy of the file, so most times we are left struggling with a 
D> dictionary that is itself written in the very structure it is attempting 
D> to define.  It is like reading a language dictionary without any prior 
D> knowledge of the language - all the definitions are circular!

I shall endeavour to put together an essay on the history of the DDL, which I
shall send in a later message. 


I shall break out some of the other points into separate sections for
discussion:

D33.2 Handling indeterminate numbers of datanames with _type_construct
----------------------------------------------------------------------
G> Modulated structures are normally described within the so-called
G> superspace formalism. Their diffraction patterns are then indexed using more
G> than three basic reciprocal vectors. The required additional vectors are the
G> modulation wave vectors their number defining what is usually called the
G> dimension of the modulation. This modulation dimension is compound dependent
G> and, although normally is below 3, it has not a definite upper limit.
G>
G> 	One could then define the following data names
G> 	
G> 	_refln_index_h1 .  .  . 	_refln_index_h12
G>
G> and cover (with an wide margin of security) any MS. 
G> 
G> 	But the question is what would COMCIFS do if somebody required 13 Miller
G> indices to index a diffraction pattern? Add new entries to the existing
G> dictionary? How many? I think that within the general philosophy of COMCIFS,
G> such a thing should never occur. I mean the dictionary definitions must be,
G> as much as possible, compound independent.

> Although I agree with the desirability of having a definition that is
> indefinitely extensible, it may be necessary to adopt the pragmatic
> viewpoint that defining twelve datanames is (probably) going to cover
> all cases, and extending the datanames to 13 or 14 if ever that remote
> possibility takes effect. However, if we can think of a better way to do it,
> then we should.

G>                                            The immediate consequence is that
G> the MS dictionary requires multi-valued data names just as those recently
G> introduced in DDL2.0. I have found this new approach really interesting but
G> given Brian's reticent comments, I decided (at least in this step) not to
G> include the new feutures of the DDL2.0 in the MS dictionary.

> Though this is just the sort of case where you might ask yourself if the
> application does in fact need these new features. However, the expression
> of this idea in the current DDL is not compliant with the underlying STAR
> syntax, and needs to be modified before we can seriously consider it in CIF
> applications.

G> 	Nevertheless even using earlier versions of the DDL one has the
G> possibility of defining multi-valued data names without violating any STAR
G> rule. That could be done by associating a data name with a unique matchable
G> pattern specified by _type_construct. This pattern should be further decoded
G> by an intelligent parser. I supposed that this trick will be inmediately
G> anathemizated by COMCIFS but is one possibility to manage efficiently MS CIFs
G> and I have included it in the new draft.

> 'anathematized' is a strong word - it makes COMCIFS sound like the Spanish
> Inquisition, which I hope we are not (although we do sometimes have to act
> the role of Devil's Advocate).
> 
> Personally, I am very enthusiastic about the power of _type_construct, and
> hope that it can be used in ways similar to those you suggest. But there are
> a couple of problems.
> 
> (1) The resulting value must be a single entity ('token', if you like) in
>     STAR terms. For example, your sample file contains the line
>        _diffrn_reflns_limit_indices_max  8  18  10  1
>     and this is incorrect, because the dataname must take only one value.
>        _diffrn_reflns_limit_indices_max  '8  18  10  1'
>     would be allowable purely in terms of syntax; but the applications
>     parser then needs to be intelligent enough to extract the individual
>     values from this string.
> 
> (2) In your constructions, you use expressions such as
>       ( *-?[0-9]+)( +-?[0-9]+){2}( +-?[0-9]+){(_cell_modulation_dimension)}
>     where the last term is meant to be understood as 'the value of the
>     dataname _cell_modulation_dimension'). This wasn't anticipated in the
>     original motivation for _type_construct, which was intended to supply
>     patterns that the data expression must match, rather than values. So
>     where I have used datanames within my examples for _type_construct, the
>     idea was to replace them by the regexp for that dataname (or by an
>     enumeration range).
> 
>     But your idea of substituting the value of that dataname is an
>     intriguing one, and I shall put it before COMCIFS as a possible
>     extension. It would, of course, make heavy demands on the intelligence
>     of the parser.

Let me amplify this last remark to allow for more discussion. The
_type_construct supplies a regular expression defining the possible patterns
a data value may take; for instance, the _type_construct for a date should
be something like  [0-9][0-9]-[0-9][0-9]-[0-9][0-9], where the square
brackets mean "any character in the range". So this would imply that
95-05-03 conforms to the allowed pattern for a date (though not necessarily
that it is a valid date; 95-95-95 also satisfies the pattern). What has
previously been suggested is that we replace some of the components of the
expression by datanames, with the understanding that we replace those
datanames in turn by their _type_construct patterns. So dates might be
given as (_year)-(_month)-(_day). Where we see _day, we go and look up the
_type_construct for _day, which is, let us say, [0123][0-9] (i.e. a 2-digit
number beginning with 0, 1, 2, 3 - this tightens up the range of values which
are considered valid).

Gotzon's suggestion is that, in some cases, we replace the _day (or whatever)
not by its permitted range, but by its actual value in the current data set.
Here's a specific example: he wants to define a set of Miller indices for a
reflection through a _type_construct looking something like this:
 '( *-?[0-9]+)( +-?[0-9]+){2}( +-?[0-9]+){(_cell_modulation_dimension)}'
       ^         ^         ^   ^                                   ^
       |         \________/     \_________________________________/
       |             |                            |
       |             |            and this means 'n  more of these', where
       |             |          n is the value of _cell_modulation_dimension
       |             \
       |            This means '2 more of the same, separated by spaces'
       |
   This means an integer with one or more digits and optional minus sign

Here's an analogy to try to show the sort of thing we're trying to do. In Unix,
you can type "cat ls" and the "cat" command will print to the screen the
contents of the file "ls" (if such exists). But if you type "cat `ls`", the
operating system will first evaluate the command "ls" (which means "list the
names of the files in the current directory") and then print to the screen the
contents of all those files. Probably we should introduce some special
operator into the _type_construct (like the grave accents ` in the Unix
example) to indicate when we are required to substitute the current value of
the data item referenced.

So: is it feasible to undertake this dynamic substitution of values into
_type_construct? Is it desirable?


D33.3 Other _type_construct queries
-----------------------------------
G> 	While constructing the regular expressions I used for the
G> _type_construct fields, I have found (I am not very familiar with REGEX)
G> questions as the following:
G>
G> 	- could REGEX eliminate the use of quotation marks in those character
G>           expressions that contain embedded blanks?

> No. This is my point (1) above. The data file MUST adhere to the STAR
> syntax, so that it can be parsed without reference to any dictionary (this
> is the idea behind the so-called 'semantically void' applications). This was
> the fatal flaw in the first attempts to define mutivalued data types in DDL2.
> It may be possible to generate loop_'s of data values; but if the dataname
> itself appears in a loop, the CIF syntax would have to be extended to
> include nested loops (these are valid in STAR files). This is, however, a
> very difficult construction to program, and there are good reasons not to
> attempt it at this stage.

G> 	- is <newline>, within the CIF conventions, a REGEX special character?
G> 	- How should be written a REGEX that extends more than one line?. I have
G>           used a \ as line terminator but it seems not to be very ortodox.

> The idea of regular expressions (regex) is to provide a pattern-matching
> language. In general it will be effected through a specific library
> implementation, so I think that the best approach will be to actually
> write a parser that will expand and match _type_construct regular
> expressions using, say, the GNU regex library. 
> 
> We will probably need to adopt some extra conventions for handling long
> expressions that extend over more than one line, and the use of \ does
> seem appropriate (there are parallels in applications like Unix shell
> programs). Nick Spadaccini is thinking of how to implement the
> _type_construct idea, and we should perhaps talk to him about this.
> 
> Note, however, that if you write a multi-line regex expression in the
> dictionary, it MUST be entered as a text field, e.g.
>     _type_construct
> ;        (( *-?[0-9]+)\
>          ( +-?[0-9]+){2}\
>          ( +-?[0-9]+){(_cell_modulation_dimension)})\
>          {(_cell_reciprocal_basis_vect_numb)}
> ;


G> 	I have also used REGEX to define block identifiers. These _block_id
G> are, as in the powder CIF Dictionary, neccessary for MS's. Therefore
G> I think that their format (at least in part) should be standardized.

> Sounds fine to me.

I remind you that the POSIX regex document is available in the comcifs ftp
area as the PostScript file regex.ps.

D33.4 Parsing formulae in CIFs
------------------------------
G> 	More questions to COMCIFS. The modulation of the atomic parameters is
G> always described by periodic functions which, by definition, are expandable
G> as Fourier series. Normally the number of trigonometric terms present in the
G> series is low. However some modulated structures have been solved using
G> strongly anharmonic modulation functions (based on sawtooth functions) that
G> require more suitable parametrizations. These special functions are
G> considered in the MS dictionary but their description will not be
G> operational if a method of parsing formulas is not included in CIFs.

> This is an interesting question which I shall put up for general discussion.
> There has been some limited discussion of how to embed parsable formulae
> within CIF dictionaries; ideally, the techniques used to do this would also
> be employed in the parsing of formulae in the CIF data files.

If I understand correctly, the Fourier representation is done in this way:

_refine_ls_mod_funct_description
;
Displacive modulation. Fourier series. Modulation of SeO4 group described in
terms of rigid translations and rotations.
;

loop_
_refine_ls_Fourier_term
_refine_ls_Fourier_term_code
1 1
loop_
_atom_site_label
_atom_site_Fourier_label_disp
_atom_site_Fourier_coeff_disp
K1   Acz  0.0080(4)
K1   Asz -0.0106(5)
K2   Acz  0.0159(4)
K2   Asz  0.0071(6)
SeO4 Rcx -4.2(1)
SeO4 Rsx  0.91(3)
SeO4 Rcy  4.3(1)
SeO4 Rsy -5.5(2)
SeO4 Tcz -0.0089(2)
SeO4 Tsz -0.0058(2)

where the last loop effectively gives coefficients of Fourier terms
representing displacive modulations. Terms like Rcx mean 'the real part of
the complex amplitude of a rigid-group translation in the x direction'.

What is required is a more general way of parameterising the displacement
functions where these are not harmonic series.

D33.5 Units
-----------
Some questions regarding units.

First, in the draft modulated structures dictionary there is currently the
possibility of datanames referring to quantities that may be measured in
different units:

G>         Other important question concerns the labelling of parameters. The
G> excessively verbose labelling of the first draft has been strongly reduced by
G> means of the appropriate _type_construct. However my impression is that
G> they should be, at least partially, split. The problem is just a matter of
G> units. For example the _atom_site_Fourier_coeff_disp associated with a
G> certain _atom_site_Fourier_label_disp could correspond to a rigid rotation
G> (frequently measured in degrees) or a displacement (usually adimensional
G> quantities although they could be expressed in angstroms). Such differences
G> could be included in the label, but the parser would not know anything
G> about the units conversion (between degrees and radians in the mentioned
G> case of rigid rotations). One possibility is to explicit the units in the
G> definition but I do not know if it is the best solution.

My feeling is that we have tended almost everywhere to ensure that the
quantity referred to by a dataname is expressed in a single unit. Are there
any precedents for doing otherwise? For example, in the core,
'_diffrn_standards_interval_count' and '_diffrn_standards_interval_time'
are defined to reflect the two different ways of defining the interval
between measuring standard reflections.

Note that there is something of a precedent for embedding the units in the
dataname (through the use of the 'units extension', so that _cell_length_a_pm
is the cell a dimension in picometres, as opposed to _cell_length_a, which
gives the value in angstroms). However, this is increasingly regarded as poor
style, and Syd is even considering dropping this provision from the published
DDL specification. I should say here that I do not see a major problem in
dropping it from the DDL, although we need to think hard about whether we are
able to drop all mention of such datanames as _cell_length_a_pm from the
dictionary. There are several cases in the IUCr CIF archive where authors
have used _exptl_absorpt_coefficient_mu_cm and a few instances of distances
measured in pm and suffixed accordingly.

Paula has also run up against problems with units in her work with the PDB on
archiving structure factors:

P> I've been working with Brookhaven to try and produce a
P> draft of a set of CIF data items to use in reorganizing their archive of
P> structure factors (a much easier task than the coordinates).  Things were
P> going well until a moment ago, when I realized we had a problem.
P> It is that the data items that we are using from
P> the CIF core (_refln.F_calc, _refln.F_meas, for instance) are defined such 
P> that the units are required to be electrons.  This is clearly
P> not the case with the data in the archive - I'm sure that in
P> 95% of the cases, the structure factors are on an arbitrary scale.
P> 
P> The only clean way out of this that I see is a new set of data items, where
P> the units are specified to be arbitrary.  Any thoughts?

I think Paula's right. Does anyone see a better way around this?


33.6 Family names
-----------------
P> Another little question that has come up is about what the CIF assumption is
P> about suffixes in names (i.e., Jr., Sr., III).  The example that we carry
P> in all of the _name data items is
P> 
P>     _item_examples.case          'Bleary, Percival R.'
P>                                  "O'Neil, F.K."
P>                                  'Van den Bossche, G.'
P>                                  'Yang, D.-L.'
P>                                  'Simonov, Yu.A'
P> 
P> but none of these give a style for suffixes.  Any thoughts?

The WDC9 data dictionary has

data_name_family
    _name                       '_name_family'
    _type                        char
    _list                        yes
    loop_ _example               'Epelboin' 'Van Dyk' 'von Graben' "O'Neill"
    _definition
;              Family name. The name used for primary identification.
               Multiple components should be hyphenated or capitalised
               according to local custom. If quote delimiters are used,
               they should not conflict with embedded diacritical marks.
                 ...
               Dynastic components ("Jr", "III") may also be given.
;

data_name_other
    _name                       '_name_other'
    _type                        char
    _list                        yes
    _list_identifier            '_name_family'
    loop_ _example              'Prof. Yves' 'Dr Richard J.' 'Mr Zhong-he'
    _definition
;              Given names or initials. Title may precede the name, and should
               be shortened with a trailing point if it is an abbreviation (e.g.
               "Prof." for "Professor") and without a trailing point if it is
               a contraction ("Dr" for "Doctor").
;

Taking these together would imply that a composite name should take the
form

    '(surname) (dynastic modifier), (title) (forenames)'

to be compatible with the WDC conventions (and why not?). So

     'Bleary Jr, Percival R.'   'Churchill III, Mr Winston S.'  etc.

Has anyone any problems with this?


Regards
Brian