Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Accent escape sequences

Brian McMahon wrote:
>> The advantage of a simple escape mechanism, like the current scheme, is
>> that it is fairly easy to read directly. The disadvantage is that it has
>> limited abilities. With MIME, the multipart/alternative could be used,
>> where simple ASCII escapes are combined with a more accurate version
>> that is not directly readable. This give the advantages of both forms.
> In principle, this is a great idea. Consider the CIF dictionaries,
> where the pure-text _definition field sometimes carries inventive
> representations of maths (e.g.
> http://www.iucr.org/iucr-top/cif/cifdic_html/1/cif_core.dic/Irefine_ls_restrained_S_gt.html )
> that have to be reverse-engineered into something more useful (e.g. TeX)
> when typesetting these for International Tables. It would make it
> easier to keep these representations in sync if they were both
> transported as multipart/alternative content in the same text field.
> But ... this does come at the expense of significantly more
> complexity in applications that need to do something with the
> content of text fields. Most scientific CIF applications (the
> ones that work on the data) won't be affected - they just skip
> over text fields. The others will need to have the ability to
> parse and extract MIME content (not too difficult), but also
> to *write* proper multipart content, and that's not necessarily
> so easy if you're to provide tools that ingest content from
> different input streams (TeX-savvy editors, html editors,
> clipboards...). In practice the Acta office doesn't see a
> critical mass of content provision to justify this complexity
> at this stage (it's still really only Acta C and E that use
> CIF text fields extensively, and they're catered for through
> publCIF). Having said which, there's no harm in working through
> the details of how such a system could operate.
As long as the multi-part processing is optional, it should not be a
problem. The extra effort then only needs to be done for those cases
where the content is sufficiently complex that the software is already
dealing with the extra complexity.

> Going back to Joe's original wishes to rationalise and perhaps
> extend the existing CIF markup, it's important also to remember
> that some data items will also occasionally require markup for
> simple string fields - e.g. how to markup the "alpha" Wyckoff
> position in the symmetry CIF dictionary? The use of
> the '\a' digraph in
> http://www.iucr.org/iucr-top/cif/cifdic_html/2/cif_sym.dic/Ispace_group_Wyckoff.letter.html
> clearly derives from the "usual" CIF markup for alpha, but that is
> nowhere made formally clear. It looks like we need unambiguous
> markup rules in these cases too.
Are you saying that the current CIF markup is defined only for
multi-line text? If so, the description sentence is an example where
'\a' needs to represent the character sequence in the non-markup form
(not converted to '<alpha>').

> (I'm hoping to see our publCIF developer later this week so that
> we can discuss the specifics of the proposal Joe posted recently.)
> Brian
When I first looked at this, I thought it would be sufficient to covert
the Latin1 and Latin2 character sets. But, these do not include the
over-bar already defined. I also realized that RFC-1345 covers a lot of
this. It defines 2-character sequences for most Latin characters, 3-4 in
some cases, and longer sequences for languages like Japanese. Maybe it
would be a good goal to cover all of the Latin characters from the
2-letter set from RFC-1345?

Most of those 2-letter codes have the alphabetic character first, then
the modifier. These would be quite similar to the CIF markup by swapping
the two characters, and with a few differences in the modifiers, such as
zero instead of % for ring-above. It also adds Hook (2) and Horn (9)
modifiers. It would be nice to use the RFC-1345 set of modifier codes
for increased standardization. Any chance of having CIF markup "version
2" with some incompatible changes? Maybe it is OK in the context of
including Content-Type headers?

In the case of 2 alphanumeric codes it is simple to map RFC-1345 to
'word based' CIF codes, such as "\\ae " for the ae ligature.

Joe Krahn

Reply to: [list | sender only]