Discussion List Archives

[Date Prev][Date Next][Date Index]

(25) _local_ (again), radiation types, MIME, HTML, DDL, dict's

  • To: COMCIFS@uk.ac.iucr
  • Subject: (25) _local_ (again), radiation types, MIME, HTML, DDL, dict's
  • From: bm@uk.ac.iucr (Brian McMahon)
  • Date: Thu, 14 Jul 94 17:12:36 BST
Dear Colleagues

Just a brief circular this time :-)  Life is a little busy at present,
so it's possible I've forgotten some important pending matters - please feel
free to jog my memory. Note that there remain a large number of open issues,
and continuing comment on any of these is always welcome.

D25.1 - was (12)D4.1, also (8, 10, 11, 17) _local_
--------------------------------------------------
Just to illustrate that putting one's head in the sand doesn't always work...

B>    Have we ever come to a conclusion on the use of _local_*? I would
B> like to see COMCIFS declare that CIF will never have any definitions
B> beginning _local_ so that organizations are free to use _local_ for
B> their own private definitions without fear that at any time such usage
B> will interfere with a CIF dictionary. I would prefer to see that
B> COMCIFS declare that all local definitions use _local_, but I don't
B> think that will pass.

I think our lengthy debate on this was rather inconclusive. However, Brian
Toby's suggestion here seems to show the way out of the morass. If we agree
that COMCIFS will never include datanames beginning _local_ in official
dictionaries, then users may devise data names with this construction without
fear of colliding with official datanames. There is, of course, the
potential for collision with other users' invented _local_ data names,
and the user must realise this and accept the consequences. Paula, I
recall, saw no merit in this approach, and nor do I, especially in the
context of the register of unique user prefixes which we have agreed to, in
principle. (Another loose end to be tied up.) But we can hardly stop people
using the prefix if they are sure that it will have no impact upon their
projected uses.

So I call for agreement on the statement that

COMCIFS will not permit the construction of datanames beginning with the
string _local_ in dictionaries intended for general use.


D25.2 New X-ray edge nomenclature
---------------------------------
B>   IUPAC in its infinite (or was that infinitesimal) wisdom has created
B> a new naming system for x-ray wavelengths [Jenkins et al., Pure & Appl.
B> Chem. 63, 735-746 (1991).] While they have eliminated greek letters,
B> they have kept (in this day of ASCII!) subscripts. I suggest we add 
B> two new entries to the core to deal with this. I am suggesting that we
B> decouple the target from the symbol based on David's anti-parsing 
B> principle. (Note that there are many more possible IUPAC symbols, but I 
B> am too lazy to type in enumerations that no one will ever use.)
B> 
B> data_diffrn_radiation_xray_symbol
B>     _name                      '_diffrn_radiation_xray_symbol'
B>     _category                    diffrn_radiation
B>     _type                        char
B>     _list                        both
B>     _list_reference            '_diffrn_radiation_wavelength_id'
B>     loop_ _enumeration _enumeration_detail
B>           K-L~3~        'K\a~1~ in older Siegbahn notation'
B>           K-L~2~        'K\a~2~ in older Siegbahn notation' 
B>           K-M~3~        'K\b~1~ in older Siegbahn notation' 
B>           K-L~2,3~      'use where K-L~3~ and K-L~2~ are not resolved'
B>     _definition
B> ;              The IUPAC symbol for the x-ray wavelength for probe
B>                radiation. 
B> ;
B> 
B> data_diffrn_radiation_xray_target
B>     _name                      '_diffrn_radiation_xray_target'
B>     _category                    diffrn_radiation
B>     _type                        char
B>     _list                        both
B>     _list_reference            '_diffrn_radiation_wavelength_id'
B>     loop_ _enumeration 
B>         H  He  Li  Be  B  C  N  O  F  Ne  Na  Mg  Al  Si  P  S  Cl  
B>         Ar  K  Ca  Sc  Ti  V  Cr  Mn  Fe  Co  Ni  Cu  Zn  Ga  Ge  
B>         As  Se  Br  Kr  Rb  Sr  Y  Zr  Nb  Mo  Tc  Ru  Rh  Pd  Ag  
B>         Cd  In  Sn  Sb  Te  I  Xe  Cs  Ba  La  Ce  Pr  Nd  Pm  Sm 
B>         Eu  Gd  Tb  Dy  Ho  Er  Tm  Yb  Lu  Hf  Ta  W  Re  Os  Ir  
B>         Pt  Au  Hg  Tl  Pb  Bi  Po  At  Rn  Fr  Ra  Ac  Th  Pa  U  
B>         Np  Pu  Am  Cm  Bk  Cf  Es  Fm  Md  No  Lr
B>     _definition
B> ;              The chemical element symbol for the x-ray target
B>                (usually the anode) used for x-ray generation.
B> ;

These definitions might be used in place of the existing
_diffrn_radiation_type. If there are no objections, I shall add these two
definitions to the Core.

D25.3 _pd_instr_radiation_probe
-------------------------------
Brian has another question relating to _diffrn_radiation_type, which,
you will recall, is defined as:

    _name                      '_diffrn_radiation_type'
    _category                    diffrn_radiation
    _type                        char
    _list                        both
    _list_reference            '_diffrn_radiation_wavelength_id'
    loop_ _example               CuK\a      neutron     electron
    _definition
;              The nature of the radiation.
;

B>  [David] asked if there is a need for both _pd_instr_radiation_probe and
B> _diffrn_radiation_type. I have come to the conclusion that I am not happy 
B> with a non-enumerated way to determine the type of radiation used for a 
B> dataset as this is one of the most important values in the CIF. Note 
B> that there are examples of neutron and electron, but x-rays
B> must be assumed if any other value is specified. I would be happy to see 
B> _pd_instr_radiation_probe moved to the core. If this is done it might 
B> make sense to modify the examples for _diffrn_radiation_type to remove 
B> neutron and electron.

The definition of _pd_instr_radiation_probe is currently:

    _name                      '_pd_instr_radiation_probe'
    _category                    pd_instr   
    _type                        char
    loop_ _enumeration           x-ray
                                 neutron
                                 electron
    _definition
;             Code for the type of radiation used. It is strongly encouraged
              that this field be specified for all powder diffraction data
              so that the probe radiation can be simply determined.
;

I am a little unhappy about having two entries in the core that are so
similar, differing in effect only by the presence or absence of a list
of enumerated values. But I don't have strong feelings. Brian is keen to
have this settled one way or the other, so your opinions are welcomed.


D25.4  MIME types and the spread of CIF
---------------------------------------
PMR> CIF Colleagues:
PMR> 	Here are some points that I'd be grateful if you would consider.  
PMR> Several of them arise out  of my project to mark-up the dictionaries in 
PMR> html (see below) but they are more general.
PMR> 
PMR> 1.	CIF is now becoming established in ceratin areas *outside* 
PMR> crystallography (partly as a result of 'advertising' at Molecular 
PMR> Graphics and other meetings).  I am aware of its adoption by:
PMR> 	- Roche (for molecular modelling).  I have their dictionary which I 
PMR>         will post here shortly as it is publicly available
PMR> 	- at least two labs working in the area of biomolecular sequence
PMR> 	and I am hopeful that it will become a standard tool for the
PMR> 	relaunched CCP11 project at Daresbury.
PMR> 
PMR> 2.	The proposal that Henry Rzepa and I put to the MIME committee looks
PMR> like gaining support.

(see D20.2)

PMR> The final form of the syntax is still undecided but we
PMR> expect that it will be chemical/.  There are about 20 types proposed in 
PMR> this standard, including cif, mif (and pdb, fdat, cdc).  This means that 
PMR> anyone mailing or viewing a *.cif can expect it to be transported and 
PMR> viewed without corruption.
PMR> 
PMR> 	This has implications for COMCIFs as, unless there is objection, 
PMR> *.cif will become a *standard suffix*.  I would strongly urge the 
PMR> dictionary distributors to adopt this standard as the dictionaries are, 
PMR> of course, CIFs.  (I appreciate that the DOS file limit makes this 
PMR> difficult, but I think it will cause problems unless a nomenclature is 
PMR> [adopted] which is consistent with MIME.  There is a very strong move 
PMR> internationally to the use of MIME types for public documents.  With this 
PMR> convention it becomes possible for people browsing a CIF site to view any 
PMR> CIFs with a CIF viewer (see below).
PMR> 
PMR> 	I believe that the MIME recognition will give a great deal of 
PMR> welcome publicity to CIF.

A brief reminder here that the idea behind this is to define standard file
types in terms of their filename extensions, so that you can configure a
gopher or WWW client to do something specific to CIFs (recognised as any
file with a name ending in '.cif'). This is a class of application tuned
to current requirements; I see its effect as being to encourage authors and
archive sites to name their files as <something>.cif, but it is *not* an
extension to the standard, requiring that CIFs be named with file extension
'.cif'.

Note, by the way, that there is also a file format known as CIF (for Caltech
Intermediate Form) used in VLSI CAD applications; and that MIF (Maker
Interchange Format) is used in the FrameMaker document package. Henry's
proposals, if accepted, would pre-empt Caltech and Frame Corp. from
using their (rather well established) abbreviations as globally accepted
filename extensions. This might not please them. I don't know if this worries
people - I would hope we don't need to hunt for registration of trademarks to
go along with our patent and copyright baggage!

I would suggest to Peter that STAR dictionaries NOT be given a 'cif'
extension. Not only are they not CIFs, but the typical end-user application
would not wish to 'view' them in the same way (I was talking to Henry last
week, and it's clear that his vision of the way this will work is to load a
CIF directly into a graphics package). 'dic' would be a sensible convention:
whether it needs to be a MIME recognised type is arguable (though it might
be well to register it against future use).

Any other comments?

D25.5 Distributed dictionaries in hyperspace
--------------------------------------------
PMR> 3.	I have now automatically marked up the ddl, core and mm 
PMR> dictionaries from Chester, into html.
PMR> This is imprecise for several reasons:
PMR> - Comments are used to structure some of the files , but comments have
PMR> 	no semantic import.

Not a critical problem - mostly the comments are used to ease the eye
when scanning a dictionary listing. According to previously agreed
principles, we shall work hard to eliminate all information of substance
from comments.

PMR> - There is frequent use of underscored terms, especially in definitions, 
PMR>    which are not defined.  Thus '_audit_author_' is used 
PMR>    implicitly to refer to _audit_author_* (i.e. all items 
PMR>    starting with '_audit_author_' or perhaps to the category
PMR> 	_audit_author (not necessarily the same) or perhaps to
PMR> 	_audit_author_[] (again not necessarily the same).
PMR> - the new dictionary_definition is welcome, but there is only an
PMR> 	*implied* semantic connection with the ?category that it 
PMR> 	relates to.  As you will see, all these items are naturally
PMR> 	classified in dictionary_definition category, rather than linked
PMR> 	to the one they describe.  Perhaps a '_category_described'
PMR> 	field would be useful.

Syd? I'm not sure that this is very important - the dictionary_definition
entries are aimed at the human reader (though there have been some proposals
to build in more of the properties of the items in the associated categories
in a machine-readable way). I would incline not to add another DDL term
for this at present, at least until we have seen how effective machine
parsing of everything else turns out to be!

PMR> - I have not fully solved the problem of linked names (e.g. 
PMR>    _cell_length_*) but this shouldn't be a problem.
PMR> Ideally it would be valuable to be able to mark up all 
PMR> underscored terms in the dictionary, although I can see that _nm, etc 
PMR> require a manually created glossary.
PMR> 
PMR> 4.	The markup exercise has very exciting possibilities for 
PMR> distributed dictionaries.  the html world (especially in latex2html) has 
PMR> anticipated this problem and starts to solve it by each site having an
PMR> downloadable index file with crossreferences in it.  Thus we might with 
PMR> IUCr to maintain an index.cif which contained a list of all the 
PMR> IUCr-supported CIF names, whilst other sites (e.g. CCP11 might maintain 
PMR> subsidiary ones).  The user can then devise her dictionary based on 
PMR> publicly available ones without having to download (obviously IUCr terms 
PMR> have highest priority).  
PMR> 	I shall explore whether html can easily support a hierarchy of 
PMR> dictionaries, because this is an obvious present need.
PMR> 
PMR> -------------------------------------------------------------------------
PMR> 
PMR> I'd value comments on the dictionaries marked up in html:
PMR> http://www.dl.ac.uk/CBMT/cif/HOME.html
PMR> 
PMR> It's very experimental and is based primarily on categories, so that the 
PMR> entries are put into subdirectories based on these.  Where dictionaries 
PMR> have items in the same category these are bundled.  I would appreciate a 
PMR> standard term to identify dictionaries with (e.g. [mm], mm93 or whatever) 
PMR> because until then this is likely to be inconsistent.  Please let me have 
PMR> comments.
PMR> 
PMR> There is also a crude search (grep -i) on the names of entries, so that 
PMR> you can search for everything with the field 'cell' in it.  Your browser 
PMR> has to support forms for this (Mosaic or lynx are what I use).
PMR> 
PMR> PS.  I also intend to mount tkCIF (my tk based CIF hypertext browser and 
PMR> proto-editor very shortly and I'd welcome people to test it.  The editing 
PMR> is semi-intelligent - e.g. if you put a space in a word it appends quotes 
PMR> at the ends if required.

I can say that I have very briefly glanced at the html dictionaries, and
these look very exciting. However, the structuring does not at first glance
look correct - I'll study these again. Other folk are encouraged to look at
these, especially as Peter is keen to announce their existence to a public
forum such as the bionet newsgroup.


D25.6 DDL revisited
------------------
(a) Validating the form of a data item
--------------------------------------
Although the DDL has been 'almost' frozen for some time, Syd is still
proving amenable to constructive suggestions on its refinement, and one of
the potentially useful outcomes of our meetings in Atlanta was the provision
of a new dictionary dataname, '_type_construct', which allows intelligent
parsers to validate the form of a data expression against a supplied
pattern. The pattern uses the 'regular expression' (regexp) syntax that is
familiar to programmers in Unix and similar environments, and may also
contain pointers to datanames defined elsewhere in the STAR dictionaries;the
process is therefore recursive.

Some of us have been working on implementing this, and progress seems to have
been made. Here, for instance, are a couple of examples from the STAR
core dictionary (mentioned in D21.7), to illustrate how this may work.
The final form of the pattern-matching language to be used is still to
be decided, but it will be similar to this example.

data_audit_creation_date
    _name                      '_audit_creation_date'
    _category                  audit
    _type                      char
    _type_construct
;
    {_chronology_year}\                         # year must occur
        { {-{_chronology_month}}?\              # month only if year
          { {-{_chronology_day}}?\              # day only if month
            { {T{_chronology_hour}}?\           # hour only if day
              { {:{_chronology_minute}}?\       # minute only if hour
                {:{_chronology_second}}?\       # second only if minute
               }?\
              {[+-]{_chronology_timezone}}?\    # timezone if any time
             }?\
           }?\
         }?
;
    loop_ _example    1990-07-12   1994-07-12T11:10:12   1994-02-22T11+08
    _definition
;              A date that the CIF was created. 
;

data_chronology_[]
    _name                      '_chronology_[]'
    _category                  dictionary_definition
    _type                      null
    _definition
;              Data names in the _chronology_ category may be used as
               components of date values for other datanames. THEY MAY
               NOT STAND ALONE AS DATA NAMES IN ANY DATA FILE.
;


data_chronology_year
    _name                     '_chronology_year'
    _category                  chronology
    _type                      numb
    _type_construct            [0-9][0-9][0-9][0-9]
    _example                   1994
    _enumeration_range         1:
    _definition
;              The year component of a date, in years anno domini of the
               Christian calendar.
;


Incidentally, on a point of style, Syd dislikes the indented layout which
seeks to display levels of nesting. Below is a more compact alternative.
Either will be equally readable by a program - the nested layout *may*
be more easily readable by a human, but does that matter? This is not
too important an issue, but any opinions are welcome.
    _type_construct
;
    (_chronology_year)\                    # year must occur
    ((-(_chronology_month))?\              # month only if year
    ((-(_chronology_day))?\                # day only if month
    ((T(_chronology_hour))?\               # hour only if day
    ((:(_chronology_minute))?\             # minute only if hour
    (:(_chronology_second))?)?\            # second only if minute
    ([+-](_chronology_timezone))?)?)?)?    # timezone if any time
;

I have one suggestion that I would like to hear people's opinions on:
basic type constituents (such as _chronology_year) will be common
components to many data value constructions, but should not appear alone
in any data file. One way to enforce this would be to assign them to
category 'null' (rather than category 'chronology' etc). So items in
category 'null' would be special, in a similar way to items of type 'null'
(i.e. the dictionary_definition stuff). These new items should be allowed to
be of *type* numb (or char) to allow them to have enumeration ranges and
other attributes. Any thoughts on this?

(b) Fundamental data types
--------------------------
The introduction of _type_construct allows intelligent parsers to
check for most, if not all, of the extended data types requested by
database implementors. We had some prolonged discussions over dinner in
Atlanta on this topic. The outcome is that Syd remains steadfast on retaining
the minimum number possible of basic data types: numb (data that can be
interpreted numerically), char (textual data), null (for data items, such as
the explanatory material in dictionaries, that may not appear in data files).

The DDL term _type_conditions allows for certain numb or char fields to have
additional properties - allowed values for this are none (no additional
interpretation of the value based on its structure), esd (a number in
parentheses trailing another number is to be interpreted as an e.s.d.),
seq (the characters : and , delimit values in a list or sequence),
incl (for STAR-compliant files that must be included in the calling file
at the point of invocation) and xdat (for external data files).
We have had some discussions over the relationship between this and
_type_construct, say for the case of e.s.d.'s. However, I concede that
there is a difference in the nature of these terms - _type_conditions esd
defines the *interpretation* of the esd value; a putative
   _type_construct   "[+-]?[0-9]*\.[0-9]+(\([0-9]?\))?"
would (honest) define the allowed form of a number with e.s.d., but
would give no indication of the meaning of the term in parentheses.

So I see the existence of these three _type_ datanames as permitting data
typing to be done according to a hierarchy of complexity. Simple programs
(such as ciftex) benefit from being able to differentiate between numb, char
and null. More complex programs that need to interpret and operate upon the
contents of certain fields will use the _type_conditions (and probably
programmers will need to hard-code the handling of fields possessing these
type conditions). The most powerful programs can do pattern-matching to
verify the form of a very wide range of data constructions. And programmers
may ignore the higher levels of complexity than are appropriate to their
application. (This last remark is slightly disingenuous. A simple CIF writer
ought to output dates in the date-compliant format described by the
relevant _type_construct. In practice, though, the author of such a program
is more likely to write his output date format after reading the redundant
but simple textual information in the _definition or _example fields
than to construct it dynamically from the _type_construct DDL.)

(c) _include_file
-----------------
First, I must correct a mis-statement of fact I made in circular no. 23,
where I said "_include_file may (uniquely) appear in dictionaries and data
files". It may not - it is an integral part of DDL, and so should appear in
dictionaries only. However, data files may include other STAR files
through datanames of _type_conditions 'incl'.

The function of _include_file is therefore to build up the hierarchy of
dictionaries against which a data file may be validated. The data file
itself should contain a pointer to the highest-level dictionary of relevance;
this pointer will be of _type_conditions 'incl', and I suggest

data_audit_conformant_dictionary
    _name            '_audit_conformant_dictionary'
    _category        audit
    _type            char
    _type_conditions incl
    _definition
;              The highest-level dictionary defining datanames used in
               this STAR file.
;

Then a macromolecular CIF may have an entry
     _audit_conformant_dictionary    cif_mm.dic
pointing to the dictionary of macromolecular terms. Then (in Syd's words,
as already outlined in circular no. 21):

S> My construction for dictionary file names will be:
S>             <dictionary type>_<dictionary class>.dic
S> So for the DDL core dictionary the generic name is "ddl_core.dic".
S> 
S> ddl_ext.dic      [ddl extension file(s)]
S> ddl_core.dic     [ddl core definitions]
S> star_core.dic    [primitive definitions common to all star dictionaries]
S> --------  additional dictionaries are application specific (e.g. CIFmm)
S> cif_core_ext.dic   [cif extensions to core definitions]
S> cif_core.dic       [cif core definitions]
S> cif_mm_ext.dic     [cif extensions to macromolecular definitions]
S> cif_mm.dic         [cif macromolecular definitions]
S> 
S> So a dictionary would always contain at the front an _include_file
S> which inserts the dictionary file higher in the tree. 
S> 
S> For example...
S>     ### MACROMOLECULAR CIF DICTIONARY
S>         data_on_this_dictionary
S>             _dictionary_name                cif_mm.dic
S>         data_include_dependent_dictionaries
S>             _include_file                   cif_mm_ext.dic
S>           ...          etc. etc.
S> 
S> and the included file looks like this....
S>     ### MACROMOLECULAR CIF EXTENSION DICTIONARY
S>         data_on_this_dictionary
S>             _dictionary_name                cif_mm_ext.dic
S>         data_include_dependent_dictionaries
S>             _include_file                   cif_core.dic
S>           ...          etc. etc.
S> 
S> and the included file looks like this....
S>     ### CORE CIF DICTIONARY
S>         data_on_this_dictionary
S>             _dictionary_name                cif_core.dic
S>         data_include_dependent_dictionaries
S>             _include_file                   cif_core_ext.dic
S>           ...          etc. etc.
S> 
S> and the included file looks like this....
S>      ### CORE CIF EXTENSION DICTIONARY
S>         data_on_this_dictionary
S>             _dictionary_name                cif_core_ext.dic
S>         data_include_dependent_dictionaries
S>             _include_file                   star_core.dic
S>           ...          etc. etc.
S> 
S> and the included file looks like this....
S>     ### PRIMITIVE STAR DICTIONARY
S>         data_on_this_dictionary
S>             _dictionary_name                star_core.dic
S>         data_include_dependent_dictionaries
S>             _include_file                   ddl_core.dic 
S>           ...          etc. etc.
S> 
S> and the included file looks like this....
S> 
S> OK, you get the idea. This way you really do not have to be even aware
S> of the hierarchy for your application -- just the dictionary above you
S> in the tree. But of course the parsing software must be able to handle
S> nested inclusions.

Note in this example that the mmCIF dictionary includes the file
cif_mm_ext.dic, which is supposed to be some extension to the mmCIF
dictionary definitions. If this did not exist, the value of _include_file
would be 'cif_core.dic', and likewise for other extension dictionaries
that might be created. It is the responsibility of the included
dictionary to supply a pointer to the next (lower-level) link in the chain.
At first, it seems counter-intuitive to assign the 'extension dictionaries'
as 'lower-level' in this scheme, but it does work.

D25.7 Structuring the front-of-dictionary
----------------------------------------
The stuff reported above is intended more for information than debate,
but Syd's plans to use _include_file in the way indicated are dependent on
CIF dictionaries being structured in a particular way, as exemplified by:

##############################################################################
#                                                                            #
#                            STAR CORE DICTIONARY                            #
#                                                                            #
##############################################################################

data_on_this_dictionary
    _dictionary_name           star_core.dic
    _dictionary_version        0.1
    _dictionary_update         1994-07-12
    _dictionary_history
;
   1994-07-12  Created from CIF Core Dictionary. BMcM
;

data_include_dependent_dictionaries
    _include_file              ddl_core.dic

global_
    _list                      no
    _list_mandatory            no
    _list_level                1
    _type_conditions           none
    _type_construct            .*

data_audit_[]          # Now the real definitions begin...

We agreed at the famous Atlanta dinner to allow global_'s in dictionaries
(but CIF dictionaries would not inherit other general STAR attributes, like
nested loops!). Dictionaries should by convention begin with a data block
data_on_this_dictionary containing the _dictionary_ terms describing
itself; then a data block, data_include_dependent_dictionaries, containing
the pointer to the next dependent dictionary as outlined above; then the
global_ declarations pertinent to the current dictionary. Note that the
global_ values in the dependent dictionary are inherited, and must where
appropriate be over-ridden at this stage!

Although '_include_file' is not strictly dependent on this structure, Syd
is unwilling to implement it except within this agreed framework. It would
appear to me an adequately structured approach, and so I shall put this
up for formal acceptance as a COMCIFS decision:

The prefatory material in CIF Dictionaries will be constructed in the
fashion outlined above.

Regards
Brian