[Date Prev][Date Next][Date Index]
(25) _local_ (again), radiation types, MIME, HTML, DDL, dict's
- To: COMCIFS@uk.ac.iucr
- Subject: (25) _local_ (again), radiation types, MIME, HTML, DDL, dict's
- From: bm@uk.ac.iucr (Brian McMahon)
- Date: Thu, 14 Jul 94 17:12:36 BST
Dear Colleagues Just a brief circular this time :-) Life is a little busy at present, so it's possible I've forgotten some important pending matters - please feel free to jog my memory. Note that there remain a large number of open issues, and continuing comment on any of these is always welcome. D25.1 - was (12)D4.1, also (8, 10, 11, 17) _local_ -------------------------------------------------- Just to illustrate that putting one's head in the sand doesn't always work... B> Have we ever come to a conclusion on the use of _local_*? I would B> like to see COMCIFS declare that CIF will never have any definitions B> beginning _local_ so that organizations are free to use _local_ for B> their own private definitions without fear that at any time such usage B> will interfere with a CIF dictionary. I would prefer to see that B> COMCIFS declare that all local definitions use _local_, but I don't B> think that will pass. I think our lengthy debate on this was rather inconclusive. However, Brian Toby's suggestion here seems to show the way out of the morass. If we agree that COMCIFS will never include datanames beginning _local_ in official dictionaries, then users may devise data names with this construction without fear of colliding with official datanames. There is, of course, the potential for collision with other users' invented _local_ data names, and the user must realise this and accept the consequences. Paula, I recall, saw no merit in this approach, and nor do I, especially in the context of the register of unique user prefixes which we have agreed to, in principle. (Another loose end to be tied up.) But we can hardly stop people using the prefix if they are sure that it will have no impact upon their projected uses. So I call for agreement on the statement that COMCIFS will not permit the construction of datanames beginning with the string _local_ in dictionaries intended for general use. D25.2 New X-ray edge nomenclature --------------------------------- B> IUPAC in its infinite (or was that infinitesimal) wisdom has created B> a new naming system for x-ray wavelengths [Jenkins et al., Pure & Appl. B> Chem. 63, 735-746 (1991).] While they have eliminated greek letters, B> they have kept (in this day of ASCII!) subscripts. I suggest we add B> two new entries to the core to deal with this. I am suggesting that we B> decouple the target from the symbol based on David's anti-parsing B> principle. (Note that there are many more possible IUPAC symbols, but I B> am too lazy to type in enumerations that no one will ever use.) B> B> data_diffrn_radiation_xray_symbol B> _name '_diffrn_radiation_xray_symbol' B> _category diffrn_radiation B> _type char B> _list both B> _list_reference '_diffrn_radiation_wavelength_id' B> loop_ _enumeration _enumeration_detail B> K-L~3~ 'K\a~1~ in older Siegbahn notation' B> K-L~2~ 'K\a~2~ in older Siegbahn notation' B> K-M~3~ 'K\b~1~ in older Siegbahn notation' B> K-L~2,3~ 'use where K-L~3~ and K-L~2~ are not resolved' B> _definition B> ; The IUPAC symbol for the x-ray wavelength for probe B> radiation. B> ; B> B> data_diffrn_radiation_xray_target B> _name '_diffrn_radiation_xray_target' B> _category diffrn_radiation B> _type char B> _list both B> _list_reference '_diffrn_radiation_wavelength_id' B> loop_ _enumeration B> H He Li Be B C N O F Ne Na Mg Al Si P S Cl B> Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge B> As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag B> Cd In Sn Sb Te I Xe Cs Ba La Ce Pr Nd Pm Sm B> Eu Gd Tb Dy Ho Er Tm Yb Lu Hf Ta W Re Os Ir B> Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Ac Th Pa U B> Np Pu Am Cm Bk Cf Es Fm Md No Lr B> _definition B> ; The chemical element symbol for the x-ray target B> (usually the anode) used for x-ray generation. B> ; These definitions might be used in place of the existing _diffrn_radiation_type. If there are no objections, I shall add these two definitions to the Core. D25.3 _pd_instr_radiation_probe ------------------------------- Brian has another question relating to _diffrn_radiation_type, which, you will recall, is defined as: _name '_diffrn_radiation_type' _category diffrn_radiation _type char _list both _list_reference '_diffrn_radiation_wavelength_id' loop_ _example CuK\a neutron electron _definition ; The nature of the radiation. ; B> [David] asked if there is a need for both _pd_instr_radiation_probe and B> _diffrn_radiation_type. I have come to the conclusion that I am not happy B> with a non-enumerated way to determine the type of radiation used for a B> dataset as this is one of the most important values in the CIF. Note B> that there are examples of neutron and electron, but x-rays B> must be assumed if any other value is specified. I would be happy to see B> _pd_instr_radiation_probe moved to the core. If this is done it might B> make sense to modify the examples for _diffrn_radiation_type to remove B> neutron and electron. The definition of _pd_instr_radiation_probe is currently: _name '_pd_instr_radiation_probe' _category pd_instr _type char loop_ _enumeration x-ray neutron electron _definition ; Code for the type of radiation used. It is strongly encouraged that this field be specified for all powder diffraction data so that the probe radiation can be simply determined. ; I am a little unhappy about having two entries in the core that are so similar, differing in effect only by the presence or absence of a list of enumerated values. But I don't have strong feelings. Brian is keen to have this settled one way or the other, so your opinions are welcomed. D25.4 MIME types and the spread of CIF --------------------------------------- PMR> CIF Colleagues: PMR> Here are some points that I'd be grateful if you would consider. PMR> Several of them arise out of my project to mark-up the dictionaries in PMR> html (see below) but they are more general. PMR> PMR> 1. CIF is now becoming established in ceratin areas *outside* PMR> crystallography (partly as a result of 'advertising' at Molecular PMR> Graphics and other meetings). I am aware of its adoption by: PMR> - Roche (for molecular modelling). I have their dictionary which I PMR> will post here shortly as it is publicly available PMR> - at least two labs working in the area of biomolecular sequence PMR> and I am hopeful that it will become a standard tool for the PMR> relaunched CCP11 project at Daresbury. PMR> PMR> 2. The proposal that Henry Rzepa and I put to the MIME committee looks PMR> like gaining support. (see D20.2) PMR> The final form of the syntax is still undecided but we PMR> expect that it will be chemical/. There are about 20 types proposed in PMR> this standard, including cif, mif (and pdb, fdat, cdc). This means that PMR> anyone mailing or viewing a *.cif can expect it to be transported and PMR> viewed without corruption. PMR> PMR> This has implications for COMCIFs as, unless there is objection, PMR> *.cif will become a *standard suffix*. I would strongly urge the PMR> dictionary distributors to adopt this standard as the dictionaries are, PMR> of course, CIFs. (I appreciate that the DOS file limit makes this PMR> difficult, but I think it will cause problems unless a nomenclature is PMR> [adopted] which is consistent with MIME. There is a very strong move PMR> internationally to the use of MIME types for public documents. With this PMR> convention it becomes possible for people browsing a CIF site to view any PMR> CIFs with a CIF viewer (see below). PMR> PMR> I believe that the MIME recognition will give a great deal of PMR> welcome publicity to CIF. A brief reminder here that the idea behind this is to define standard file types in terms of their filename extensions, so that you can configure a gopher or WWW client to do something specific to CIFs (recognised as any file with a name ending in '.cif'). This is a class of application tuned to current requirements; I see its effect as being to encourage authors and archive sites to name their files as <something>.cif, but it is *not* an extension to the standard, requiring that CIFs be named with file extension '.cif'. Note, by the way, that there is also a file format known as CIF (for Caltech Intermediate Form) used in VLSI CAD applications; and that MIF (Maker Interchange Format) is used in the FrameMaker document package. Henry's proposals, if accepted, would pre-empt Caltech and Frame Corp. from using their (rather well established) abbreviations as globally accepted filename extensions. This might not please them. I don't know if this worries people - I would hope we don't need to hunt for registration of trademarks to go along with our patent and copyright baggage! I would suggest to Peter that STAR dictionaries NOT be given a 'cif' extension. Not only are they not CIFs, but the typical end-user application would not wish to 'view' them in the same way (I was talking to Henry last week, and it's clear that his vision of the way this will work is to load a CIF directly into a graphics package). 'dic' would be a sensible convention: whether it needs to be a MIME recognised type is arguable (though it might be well to register it against future use). Any other comments? D25.5 Distributed dictionaries in hyperspace -------------------------------------------- PMR> 3. I have now automatically marked up the ddl, core and mm PMR> dictionaries from Chester, into html. PMR> This is imprecise for several reasons: PMR> - Comments are used to structure some of the files , but comments have PMR> no semantic import. Not a critical problem - mostly the comments are used to ease the eye when scanning a dictionary listing. According to previously agreed principles, we shall work hard to eliminate all information of substance from comments. PMR> - There is frequent use of underscored terms, especially in definitions, PMR> which are not defined. Thus '_audit_author_' is used PMR> implicitly to refer to _audit_author_* (i.e. all items PMR> starting with '_audit_author_' or perhaps to the category PMR> _audit_author (not necessarily the same) or perhaps to PMR> _audit_author_[] (again not necessarily the same). PMR> - the new dictionary_definition is welcome, but there is only an PMR> *implied* semantic connection with the ?category that it PMR> relates to. As you will see, all these items are naturally PMR> classified in dictionary_definition category, rather than linked PMR> to the one they describe. Perhaps a '_category_described' PMR> field would be useful. Syd? I'm not sure that this is very important - the dictionary_definition entries are aimed at the human reader (though there have been some proposals to build in more of the properties of the items in the associated categories in a machine-readable way). I would incline not to add another DDL term for this at present, at least until we have seen how effective machine parsing of everything else turns out to be! PMR> - I have not fully solved the problem of linked names (e.g. PMR> _cell_length_*) but this shouldn't be a problem. PMR> Ideally it would be valuable to be able to mark up all PMR> underscored terms in the dictionary, although I can see that _nm, etc PMR> require a manually created glossary. PMR> PMR> 4. The markup exercise has very exciting possibilities for PMR> distributed dictionaries. the html world (especially in latex2html) has PMR> anticipated this problem and starts to solve it by each site having an PMR> downloadable index file with crossreferences in it. Thus we might with PMR> IUCr to maintain an index.cif which contained a list of all the PMR> IUCr-supported CIF names, whilst other sites (e.g. CCP11 might maintain PMR> subsidiary ones). The user can then devise her dictionary based on PMR> publicly available ones without having to download (obviously IUCr terms PMR> have highest priority). PMR> I shall explore whether html can easily support a hierarchy of PMR> dictionaries, because this is an obvious present need. PMR> PMR> ------------------------------------------------------------------------- PMR> PMR> I'd value comments on the dictionaries marked up in html: PMR> http://www.dl.ac.uk/CBMT/cif/HOME.html PMR> PMR> It's very experimental and is based primarily on categories, so that the PMR> entries are put into subdirectories based on these. Where dictionaries PMR> have items in the same category these are bundled. I would appreciate a PMR> standard term to identify dictionaries with (e.g. [mm], mm93 or whatever) PMR> because until then this is likely to be inconsistent. Please let me have PMR> comments. PMR> PMR> There is also a crude search (grep -i) on the names of entries, so that PMR> you can search for everything with the field 'cell' in it. Your browser PMR> has to support forms for this (Mosaic or lynx are what I use). PMR> PMR> PS. I also intend to mount tkCIF (my tk based CIF hypertext browser and PMR> proto-editor very shortly and I'd welcome people to test it. The editing PMR> is semi-intelligent - e.g. if you put a space in a word it appends quotes PMR> at the ends if required. I can say that I have very briefly glanced at the html dictionaries, and these look very exciting. However, the structuring does not at first glance look correct - I'll study these again. Other folk are encouraged to look at these, especially as Peter is keen to announce their existence to a public forum such as the bionet newsgroup. D25.6 DDL revisited ------------------ (a) Validating the form of a data item -------------------------------------- Although the DDL has been 'almost' frozen for some time, Syd is still proving amenable to constructive suggestions on its refinement, and one of the potentially useful outcomes of our meetings in Atlanta was the provision of a new dictionary dataname, '_type_construct', which allows intelligent parsers to validate the form of a data expression against a supplied pattern. The pattern uses the 'regular expression' (regexp) syntax that is familiar to programmers in Unix and similar environments, and may also contain pointers to datanames defined elsewhere in the STAR dictionaries;the process is therefore recursive. Some of us have been working on implementing this, and progress seems to have been made. Here, for instance, are a couple of examples from the STAR core dictionary (mentioned in D21.7), to illustrate how this may work. The final form of the pattern-matching language to be used is still to be decided, but it will be similar to this example. data_audit_creation_date _name '_audit_creation_date' _category audit _type char _type_construct ; {_chronology_year}\ # year must occur { {-{_chronology_month}}?\ # month only if year { {-{_chronology_day}}?\ # day only if month { {T{_chronology_hour}}?\ # hour only if day { {:{_chronology_minute}}?\ # minute only if hour {:{_chronology_second}}?\ # second only if minute }?\ {[+-]{_chronology_timezone}}?\ # timezone if any time }?\ }?\ }? ; loop_ _example 1990-07-12 1994-07-12T11:10:12 1994-02-22T11+08 _definition ; A date that the CIF was created. ; data_chronology_[] _name '_chronology_[]' _category dictionary_definition _type null _definition ; Data names in the _chronology_ category may be used as components of date values for other datanames. THEY MAY NOT STAND ALONE AS DATA NAMES IN ANY DATA FILE. ; data_chronology_year _name '_chronology_year' _category chronology _type numb _type_construct [0-9][0-9][0-9][0-9] _example 1994 _enumeration_range 1: _definition ; The year component of a date, in years anno domini of the Christian calendar. ; Incidentally, on a point of style, Syd dislikes the indented layout which seeks to display levels of nesting. Below is a more compact alternative. Either will be equally readable by a program - the nested layout *may* be more easily readable by a human, but does that matter? This is not too important an issue, but any opinions are welcome. _type_construct ; (_chronology_year)\ # year must occur ((-(_chronology_month))?\ # month only if year ((-(_chronology_day))?\ # day only if month ((T(_chronology_hour))?\ # hour only if day ((:(_chronology_minute))?\ # minute only if hour (:(_chronology_second))?)?\ # second only if minute ([+-](_chronology_timezone))?)?)?)? # timezone if any time ; I have one suggestion that I would like to hear people's opinions on: basic type constituents (such as _chronology_year) will be common components to many data value constructions, but should not appear alone in any data file. One way to enforce this would be to assign them to category 'null' (rather than category 'chronology' etc). So items in category 'null' would be special, in a similar way to items of type 'null' (i.e. the dictionary_definition stuff). These new items should be allowed to be of *type* numb (or char) to allow them to have enumeration ranges and other attributes. Any thoughts on this? (b) Fundamental data types -------------------------- The introduction of _type_construct allows intelligent parsers to check for most, if not all, of the extended data types requested by database implementors. We had some prolonged discussions over dinner in Atlanta on this topic. The outcome is that Syd remains steadfast on retaining the minimum number possible of basic data types: numb (data that can be interpreted numerically), char (textual data), null (for data items, such as the explanatory material in dictionaries, that may not appear in data files). The DDL term _type_conditions allows for certain numb or char fields to have additional properties - allowed values for this are none (no additional interpretation of the value based on its structure), esd (a number in parentheses trailing another number is to be interpreted as an e.s.d.), seq (the characters : and , delimit values in a list or sequence), incl (for STAR-compliant files that must be included in the calling file at the point of invocation) and xdat (for external data files). We have had some discussions over the relationship between this and _type_construct, say for the case of e.s.d.'s. However, I concede that there is a difference in the nature of these terms - _type_conditions esd defines the *interpretation* of the esd value; a putative _type_construct "[+-]?[0-9]*\.[0-9]+(\([0-9]?\))?" would (honest) define the allowed form of a number with e.s.d., but would give no indication of the meaning of the term in parentheses. So I see the existence of these three _type_ datanames as permitting data typing to be done according to a hierarchy of complexity. Simple programs (such as ciftex) benefit from being able to differentiate between numb, char and null. More complex programs that need to interpret and operate upon the contents of certain fields will use the _type_conditions (and probably programmers will need to hard-code the handling of fields possessing these type conditions). The most powerful programs can do pattern-matching to verify the form of a very wide range of data constructions. And programmers may ignore the higher levels of complexity than are appropriate to their application. (This last remark is slightly disingenuous. A simple CIF writer ought to output dates in the date-compliant format described by the relevant _type_construct. In practice, though, the author of such a program is more likely to write his output date format after reading the redundant but simple textual information in the _definition or _example fields than to construct it dynamically from the _type_construct DDL.) (c) _include_file ----------------- First, I must correct a mis-statement of fact I made in circular no. 23, where I said "_include_file may (uniquely) appear in dictionaries and data files". It may not - it is an integral part of DDL, and so should appear in dictionaries only. However, data files may include other STAR files through datanames of _type_conditions 'incl'. The function of _include_file is therefore to build up the hierarchy of dictionaries against which a data file may be validated. The data file itself should contain a pointer to the highest-level dictionary of relevance; this pointer will be of _type_conditions 'incl', and I suggest data_audit_conformant_dictionary _name '_audit_conformant_dictionary' _category audit _type char _type_conditions incl _definition ; The highest-level dictionary defining datanames used in this STAR file. ; Then a macromolecular CIF may have an entry _audit_conformant_dictionary cif_mm.dic pointing to the dictionary of macromolecular terms. Then (in Syd's words, as already outlined in circular no. 21): S> My construction for dictionary file names will be: S> <dictionary type>_<dictionary class>.dic S> So for the DDL core dictionary the generic name is "ddl_core.dic". S> S> ddl_ext.dic [ddl extension file(s)] S> ddl_core.dic [ddl core definitions] S> star_core.dic [primitive definitions common to all star dictionaries] S> -------- additional dictionaries are application specific (e.g. CIFmm) S> cif_core_ext.dic [cif extensions to core definitions] S> cif_core.dic [cif core definitions] S> cif_mm_ext.dic [cif extensions to macromolecular definitions] S> cif_mm.dic [cif macromolecular definitions] S> S> So a dictionary would always contain at the front an _include_file S> which inserts the dictionary file higher in the tree. S> S> For example... S> ### MACROMOLECULAR CIF DICTIONARY S> data_on_this_dictionary S> _dictionary_name cif_mm.dic S> data_include_dependent_dictionaries S> _include_file cif_mm_ext.dic S> ... etc. etc. S> S> and the included file looks like this.... S> ### MACROMOLECULAR CIF EXTENSION DICTIONARY S> data_on_this_dictionary S> _dictionary_name cif_mm_ext.dic S> data_include_dependent_dictionaries S> _include_file cif_core.dic S> ... etc. etc. S> S> and the included file looks like this.... S> ### CORE CIF DICTIONARY S> data_on_this_dictionary S> _dictionary_name cif_core.dic S> data_include_dependent_dictionaries S> _include_file cif_core_ext.dic S> ... etc. etc. S> S> and the included file looks like this.... S> ### CORE CIF EXTENSION DICTIONARY S> data_on_this_dictionary S> _dictionary_name cif_core_ext.dic S> data_include_dependent_dictionaries S> _include_file star_core.dic S> ... etc. etc. S> S> and the included file looks like this.... S> ### PRIMITIVE STAR DICTIONARY S> data_on_this_dictionary S> _dictionary_name star_core.dic S> data_include_dependent_dictionaries S> _include_file ddl_core.dic S> ... etc. etc. S> S> and the included file looks like this.... S> S> OK, you get the idea. This way you really do not have to be even aware S> of the hierarchy for your application -- just the dictionary above you S> in the tree. But of course the parsing software must be able to handle S> nested inclusions. Note in this example that the mmCIF dictionary includes the file cif_mm_ext.dic, which is supposed to be some extension to the mmCIF dictionary definitions. If this did not exist, the value of _include_file would be 'cif_core.dic', and likewise for other extension dictionaries that might be created. It is the responsibility of the included dictionary to supply a pointer to the next (lower-level) link in the chain. At first, it seems counter-intuitive to assign the 'extension dictionaries' as 'lower-level' in this scheme, but it does work. D25.7 Structuring the front-of-dictionary ---------------------------------------- The stuff reported above is intended more for information than debate, but Syd's plans to use _include_file in the way indicated are dependent on CIF dictionaries being structured in a particular way, as exemplified by: ############################################################################## # # # STAR CORE DICTIONARY # # # ############################################################################## data_on_this_dictionary _dictionary_name star_core.dic _dictionary_version 0.1 _dictionary_update 1994-07-12 _dictionary_history ; 1994-07-12 Created from CIF Core Dictionary. BMcM ; data_include_dependent_dictionaries _include_file ddl_core.dic global_ _list no _list_mandatory no _list_level 1 _type_conditions none _type_construct .* data_audit_[] # Now the real definitions begin... We agreed at the famous Atlanta dinner to allow global_'s in dictionaries (but CIF dictionaries would not inherit other general STAR attributes, like nested loops!). Dictionaries should by convention begin with a data block data_on_this_dictionary containing the _dictionary_ terms describing itself; then a data block, data_include_dependent_dictionaries, containing the pointer to the next dependent dictionary as outlined above; then the global_ declarations pertinent to the current dictionary. Note that the global_ values in the dependent dictionary are inherited, and must where appropriate be over-ridden at this stage! Although '_include_file' is not strictly dependent on this structure, Syd is unwilling to implement it except within this agreed framework. It would appear to me an adequately structured approach, and so I shall put this up for formal acceptance as a COMCIFS decision: The prefatory material in CIF Dictionaries will be constructed in the fashion outlined above. Regards Brian
- Prev by Date: (24) Notice of circulation of draft powder dictionary
- Next by Date: (26) Powder draft; enhancements to core; new dictioanries
- Index(es):