Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Accent escape sequences

The practical reality is that the "usual CIF markup" has been used
in a very large number of existing CIFs on the assumption that
plain text actually means text with that markup.  I think we
should formalize that assumption for all quoted strings, making
the current usage "legal".  Then for anything else:

   true plain text
   CBF binary
   TeX
   XML
   HTML
   ...

we would use text fields and MIME appropriate headers.  As with email
the burden would not fall on the general CIF parsers, except to the
limited extent of being able to reliably find the end of a text field,
and to deliver whatever was found within the text field to the
higher level application to parse further, if they choose to do
so.  CIF writers could invert the path, requiring applications
to deliver properly wrapped MIME ready to put out as text fields.
The CIF writer would add the semicolon quotes, MIME boundary markers and the
content header.   The application would be responsible for any additional
headers and the actual content as a byte stream.  This division of
labor would allow most CIF software to have no changes in
write logic and only modest changes in read logic, but would allow
applications that need to deliver and to be aware of complex internal
semantics to do so in a CIF environment instead of having to move
over to XML.


At 4:30 PM -0500 3/5/07, Joe Krahn wrote:
>Brian McMahon wrote:
>>>  The advantage of a simple escape mechanism, like the current scheme, is
>>>  that it is fairly easy to read directly. The disadvantage is that it has
>>>  limited abilities. With MIME, the multipart/alternative could be used,
>>>  where simple ASCII escapes are combined with a more accurate version
>>>  that is not directly readable. This give the advantages of both forms.
>>
>>  In principle, this is a great idea. Consider the CIF dictionaries,
>>  where the pure-text _definition field sometimes carries inventive
>>  representations of maths (e.g.
>> 
>>http://www.iucr.org/iucr-top/cif/cifdic_html/1/cif_core.dic/Irefine_ls_restrained_S_gt.html 
>>)
>>  that have to be reverse-engineered into something more useful (e.g. TeX)
>>  when typesetting these for International Tables. It would make it
>>  easier to keep these representations in sync if they were both
>>  transported as multipart/alternative content in the same text field.
>>
>>  But ... this does come at the expense of significantly more
>>  complexity in applications that need to do something with the
>>  content of text fields. Most scientific CIF applications (the
>>  ones that work on the data) won't be affected - they just skip
>>  over text fields. The others will need to have the ability to
>>  parse and extract MIME content (not too difficult), but also
>>  to *write* proper multipart content, and that's not necessarily
>>  so easy if you're to provide tools that ingest content from
>>  different input streams (TeX-savvy editors, html editors,
>>  clipboards...). In practice the Acta office doesn't see a
>>  critical mass of content provision to justify this complexity
>>  at this stage (it's still really only Acta C and E that use
>>  CIF text fields extensively, and they're catered for through
>>  publCIF). Having said which, there's no harm in working through
>>  the details of how such a system could operate.
>As long as the multi-part processing is optional, it should not be a
>problem. The extra effort then only needs to be done for those cases
>where the content is sufficiently complex that the software is already
>dealing with the extra complexity.
>
>>
>>  Going back to Joe's original wishes to rationalise and perhaps
>>  extend the existing CIF markup, it's important also to remember
>>  that some data items will also occasionally require markup for
>>  simple string fields - e.g. how to markup the "alpha" Wyckoff
>>  position in the symmetry CIF dictionary? The use of
>  > the '\a' digraph in
>> 
>>http://www.iucr.org/iucr-top/cif/cifdic_html/2/cif_sym.dic/Ispace_group_Wyckoff.letter.html
>>  clearly derives from the "usual" CIF markup for alpha, but that is
>>  nowhere made formally clear. It looks like we need unambiguous
>>  markup rules in these cases too.
>Are you saying that the current CIF markup is defined only for
>multi-line text? If so, the description sentence is an example where
>'\a' needs to represent the character sequence in the non-markup form
>(not converted to '<alpha>').
>
>>
>>  (I'm hoping to see our publCIF developer later this week so that
>>  we can discuss the specifics of the proposal Joe posted recently.)
>>
>>  Brian
>When I first looked at this, I thought it would be sufficient to covert
>the Latin1 and Latin2 character sets. But, these do not include the
>over-bar already defined. I also realized that RFC-1345 covers a lot of
>this. It defines 2-character sequences for most Latin characters, 3-4 in
>some cases, and longer sequences for languages like Japanese. Maybe it
>would be a good goal to cover all of the Latin characters from the
>2-letter set from RFC-1345?
>
>Most of those 2-letter codes have the alphabetic character first, then
>the modifier. These would be quite similar to the CIF markup by swapping
>the two characters, and with a few differences in the modifiers, such as
>zero instead of % for ring-above. It also adds Hook (2) and Horn (9)
>modifiers. It would be nice to use the RFC-1345 set of modifier codes
>for increased standardization. Any chance of having CIF markup "version
>2" with some incompatible changes? Maybe it is OK in the context of
>including Content-Type headers?
>
>In the case of 2 alphanumeric codes it is simple to map RFC-1345 to
>'word based' CIF codes, such as "\\ae " for the ae ligature.
>
>Joe Krahn
>_______________________________________________
>comcifs mailing list
>comcifs@iucr.org
>http://scripts.iucr.org/mailman/listinfo/comcifs



Reply to: [list | sender only]