# Re: Accent escape sequences

Brian McMahon wrote:
> Dear Joe
>
> We have recently exchanged a few messages off-list, and it is
> clear that you have an interest in, and perhaps some time for,
> working on CIF-based applications. It would be great if you would
> introduce yourself to the list with a brief indication of your
> current interests.
Recently, I have been working on some tools for data management in
macromolecular programming, with an interest in combining force-field
development with crystallography. The software idea is to create a
framework for modular programming. Most applications are tied together
into one big package that makes it difficult for individual
experimentation without digging through a lot of source code. It also
typically means that individual contributions may give up ownership,
such as a lot of community efforts into programs like CNS getting sucked
into Accelrys, where the scientific development pretty much dies. My
plan involves an "in-memory database", where modular units access
molecular data using memory pointer look-ups by name. Then, a module
programmer can (for example) add atom properties without modifying
compiled data structures in the core code. It should also provide a
natural way to tie in to scripting tools.

As for CIF format, it is a fairly good fit to the molecular database
concept. I realized that there seem to be no decent Fortran tools. The
available Fortran code seems to be mostly inflexible F77 spaghetti code.
Also, most of the C/C++ code is generally oriented towards
multi-structure databases. I also want to keep things very simple, where
no CIF dictionary is needed, with float/int types automatically
recognized and stored as such. So, I decided to implement my own
Fortran95 CIF library. In the process, I realized that some parts of CIF
and mmCIF are a bit ill-defined. Now that many people have used CIF, it
seems like now is a good time to work out some of the unfinished details.

>
> Regarding the untidy typographic markup conventions in CIF text
> fields, what we currently have arises from the pragmatic
> requirements of our early 1991 (prehistoric!) CIF-handling
> procedures in Acta Cryst. We used TeX as a formatter, so
> the markup (initially) was somewhat TeX-like; but there was
> pressure on us not to rely on TeX, especially as many of our
> authors would have no experience of it. Thus a minimal set
> of markup was devised, requiring very little learning from
> authors, that covered most markup that in practice we came
> across in Acta C papers (which have rather little
> mathematical content). Very few additional codes were
> introduced; and, for example, the relatively recent <i> and
> <b> markup for italic and bold was chosen because
> non-specialist authors were beginning to become familiar
> with such codes in HTML markup.
>
> The current arrangement is, in my opinion, very inelegant,
> but it is supported by publCIF, the IUCr's own CIF editor,
> and is workable within that tool's reasonably user-friendly
> interface.
>
> To provide better formatting abilities, I think it would be
> preferable to allow text fields to contain markup in various
> different standard formats, suitably identified, and to
> pass the fields to appropriate handlers. The simplest way to
> do so would be to have a 'magic number' introducing each text
> field. There's an undocumented example of this inasmuch as
> ciftex, the old cif->TeX translater, passes through unchanged
> any text field beginning
> ;%T   (i.e. it treats is as containing pure TeX markup).
> The 'magic number' might be a simple character sequence
> (%T for TeX, %L for LaTeX, %H html, %R RTF, %U Unicode...)
> or could be a more general, but more verbose, signature
> ;
> Content-Type: application/tex
> (this mimics the approach for embedding binary data in imgCIF files).
Something along those lines sounds good. One problem with the current
multi-line text is that the text fields often are indented, with one
less character n the first line to offset the semicolon. I think the
multi-line format would be much simpler if the begin and end semicolons
were both required to be the only character on a line, i.e. the
text-block delimiter is "<eol>;<eol>" instead of just "<eol>;". Also, a
line starting with a semicolon within the multiline text is not a
problem. A content-type tag could be placed on the line with the
starting semicolon. A multi-line pattern would then be:
<eol>;<content-type><eol><multi-line text><eol>;<eol>

>
> There's nothing fundamentally wrong with extending the existing
> special character sequences, and I'm happy to consider a
> specific proposal in terms of whether we could easily provide
> publCIF support for it. The problem is that the more one offers
> to the author, the more the author will want to do, and the more
> unwieldy an ad-hoc markup will become. (And recall that even
> TeX, which is unparalleled for mathematics, does not offer as
> primitives anywhere near all the symbols that our authors do
> use.)
>
I think the current set IS fundamentally flawed. Any proper set of
'escape' codes should be able to display the escape characters
literally. Currently, there is no rule for displaying backslash or carat
without potentially being recognized as escape-code characters.

I thought that CIF code were rather ad-hoc, but realized that similar
code sequences have been used elsewhere. The advantage of the current
codes is that they are simple enough to be read fairly well in plain
text form. For an archival format, I think that it a good thing.

My proposal is not just to make a huge list of character codes, but to
define some simple rules that keep things from getting ad-hoc.
Personally, I would not have included <I> and <B>. It would be a better
fit to use old-style /italics/ and *bold*, specifically because CIF
markup is not HTML.

Here is my idea. Note that the second rule provides the unescaped form
of any special character by using a blank second character.

special character sequence          result

\<alphabetic>                       Greek letter
\<not alpha><char>                  combination of 2 chars
\\<one or more alpha chars><space>  named code

style rules:
superscript text:  ~text~
subscript text:    ^text^
italic text:       /text/
bold text:         *text*

Some of the existing named 'by convention' rules might be better written
with the combined-character trigraph:
\\leftarrow   to  \\<-
\\rightarrow  to  \\->
\\simeq       to  \\~=
\\square      to  \\[]

I also think that the bare codes should be changed. How do I write "---"
and not mean single bond?
--    to  \\--
+-    to  \\+-
-+    to  \\-+
---   to  \\sb

Single bond could also be "\\--", but only if other bond types are also
visual.

Also, the italic and bold style suggestion would interfere a bit with
equations if not written with separating spaces. But, the carat sequence
also is a conflict with it's use as an exponential operator, and nobody
seems to mind the lack of a carat escape.

Joe
_______________________________________________
comcifs mailing list
comcifs@iucr.org
http://scripts.iucr.org/mailman/listinfo/comcifs


Reply to: [list | sender only]