Dale Tronrud asks some good questions about the presentation of information in the mmCIF dictionary. The conventions employed are for the most part common to the core and other CIF dictionaries, so I feel I might chip in my two cents worth from that broader context. > While I realize that descriptions are free format text and > the DDL does not restrict their content I am curious as to any > conventions that have been or might be adopted for the > descriptions in the official mmCIF dictionary. > > For example, if someone submitted a tag definition in Spanish > would it be accepted or must mmCIF descriptions be in English? > If English is required is there a preference in spelling > convention (US vrs UK)? There's no rule that restricts the text part of the definitions to be English, though one would prefer for consistency that the master dictionary were in English - for better or worse, the current lingua franca of science - throughout. The US/UK English divide is rarely a matter of heated argument, though there are some differences in style between the core and mmCIF dictionaries (and these are of course apparent in the core "image" that is embedded in mmCIF). Definition writers are encouraged to adopt as mid-Atlantic a style as possible. Notice, however, that components of data names are constrained to some extent, and new data names are best selected in accordance with existing conventions: components relating to colour should appear as "_colour_" and not "_color_", for example. COMCIFS has put together a list of abbreviations found in data name components (in a file found from the IUCr CIF home page http://www.iucr.org/iucr-top/cif/) - would it be useful to have a similar file of approved word components? > A similar issue is the (un)desirability of HTML formating tokens. > I received a tag definition which contained " "'s. Clearly > the person had simply cut the definition out of a web page. My > question is; does the mmCIF to HTML converters pass through these > tokens or "escape" them out and make them visible to the reader? > Should they be avoided or used? While a non-breaking space is > rather boring, there are other characters, such as a proper Angstrom > symbol, that could be incorporated if HTML were allowed in mmCIF > descriptions. > > My last question has to do with embedded mathematics. I find > it rather difficult to read typewriter math. While I can figure it > out, usually, I find a nicely typeset equation much easier. If > the mmCIF to HTML converter was to incorporate some of the code > from LaTeX2HTML one could enter equations into the description in > LaTeX and, when viewed in a browser, see a GIF image of a nicely > formatted equation. The down side to this is that LaTeX does require > some study and practise to write while typewriter math can be banged > out pretty easily, and the raw mmCIF dictionary would be less > accessible because the unprocessed math would be harder to read. Is > there a place for LaTeX in mmCIF? These two questions address essentially the same point. The decision was made at the beginning of CIF to have minimal coding for non-ASCII characters (the actual codings permitted are listed in the IUCr Guide to CIF for Authors, also available through the CIF Home Page). This permits the angstrom symbol to be coded as \%A, for example, but is not very attractive to the eye; and can't handle complex maths at all. We've been thinking for some time of how to address this. The cleanest way at present is probably to have multiple renderings of each definition (in text with typewriter maths, in HTML and in TeX, say), and each rendering should be selected for its designed display purpose - HTML for a web page, TeX for the typeset version, ascii for a "glass teletype". It's not quite as straightforward as it may seem - HTML and LaTeX are, strictly, structured document markup schemes, and to be used entirely properly, the entire dictionary would need to be in HTML or LaTeX. Of course the intention is just to use the relevant subset of the markup that makes it easy to render an Angstrom or an integral sign; but there would need to be a proper and properly maintained concordance of the codings permitted in each scheme. Because SGML (of which HTML is a subset) and TeX (of which LaTeX is a superset) are both metalanguages, the meanings associated with any coding string ( , \int, whatever) could in principle change with a different declaration - it's even possible to run TeX with an instruction set that recognises codes of the HTML-like form <BODY> as meaningful codes. Well, you've probably dropped off to sleep by now. Suffice it to say that if this discussion raises significant interest, I'd be very happy to help explore the questions of maintaining multiple formats further. _______________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm@iucr.ac.uk 5 Abbey Square, Chester CH1 2HU, England bm@iucr.org (Coordinating Secretary, COMCIFS)