Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

CIF line folding/reassembly protocol

  • Subject: CIF line folding/reassembly protocol
  • From: Brian McMahon <bm@xxxxxxxx>
  • Date: Tue, 24 Sep 2002 14:45:08 +0100 (BST)
Here is a draft description of the line-folding protocol mentioned earlier
that I wish to add to the semantics document as part of the CIF 1.1
specification. It is a slightly modified version of a proposal elaborated by
Herbert Bernstein. Note that the specific aim of this proposal is to
introduce a technique for folding lines within a text field or comment that
exceed the CIF line-length limit into lines within that limit, for the
purpose of producing a syntactically valid CIF where the semantic
information within the long lines can be recovered without loss by applying
the unfolding part of the protocol.

This requirement has always been present, and has in the past been handled
in various ad hoc ways (Acta Cryst. implemented something similar within
ciftex); this proposal formalises a specific approach that may be used 
robustly by content handlers of text fields.

As a beneficial corollary, it also facilitates mechanical interconversion
between CIF 1.0 and 1.1 files.

Brian

PS: To protect the unwary, I'll draw your attention to a couple of specific
things in the example folded CIF below. One is the way that a
quote-delimited text string has been converted into a folded-multiline
text field where the terminal newline is elided; the second is that one of
the folded lines carries a colon into the first column of the next line - be 
careful not to see that as a semicolon!

==============================================================================

A line-folding/reassembly protocol
----------------------------------

It must be emphasized that most CIF software and applications need not be
concerned with line folding. However, if one has software for CIF 1.0 and
a dataset with long lines, it is useful to have a consistent way in which to
convert the data to conform to CIF 1.0. Line folding using backslashes
allows us to do this.

In order to permit such a folding we define a special semantics for use of
the backslash. It is important to understand that this does not change the
syntax of CIF 1.0. All existing CIFs conforming to the CIF 1.0 
specification can be viewed as having exactly the same semantics as
they now have. Use of these transformational semantics is optional, but
recommended.

In order to avoid confusion between CIFs that have undergone these
transformations and those that have not, the special comment beginning with
a hash mark immediately followed by a backslash (#\) as the last non-blank
characters on a line is reserved to mark the beginning of comments created
by folding long-line comments, and the special text field beginning with the
sequence line-termination, semicolon, backslash (<eol>;\) as the only
non-blank characters on a line is reserved to mark the beginning of text
fields created by folding long-line text fields.

The backslash character is used to fold long lines in character strings and
comments. Consider a comment which extends beyond column 80. In order to
provide a comment with the same meaning which can be fitted into 80
character lines, prefix the comment with the special comment consisting of a
hash mark followed by a backslash (#\) and the line terminator. Then on new
lines take appropriate fragments of the original comment, beginning each
fragment with a hash mark and ending all but the last fragment with a
backslash. In doing this conversion, check for an original line that ends
with a backslash followed only by blanks or tabs. To preserve that backslash
in the conversion, add another backslash after it. If the next lexical token
(not counting blanks or tabs) is another comment, to avoid fusing this
comment with the next comment, be sure to insert a line with just a hash
mark.

Similarly, for a character string that extends beyond column 80,

 - first convert it to be a text field delimited by line-termination-semicolon
   (<eol>;) sequences
 -  then change the initial line-termination-semicolon
    (<eol>;) sequence to line-termination-semicolon-backslash-line-termination
    (<eol>;\<eol>)
 - and break all subsequent lines that do not fit within 80
   columns with a trailing backslash. In the course of doing the translation,
      * check for any original text lines that end with a backslash
        followed only by blanks or tabs.
      *  To preserve that backslash in the conversion, add another
         backslash after it, and then an empty line.

(More formally, the line folding should be done separately and directly on
single line non-semicolon delimited characters strings to allow for
recognition of the fact that no terminal line-termination is intended -- see
below).

In order to understand this scheme, suppose the CIF fragment (1) below were
considered to have long lines, then we could transform them as follows (2):

(1) Initial CIF ==============================================================

###################################################
#                                                 #
#   Converted from PDB format to CIF format by    #
#   pdb2cif version 2.3.1           24 Aug 96     #
#                       by                        #
# P.E. Bourne, H.J. Bernstein and F.C. Bernstein  #
#                                                 #
###################################################


data_1DIN

_entry.id        1DIN

loop_
_struct.entry_id
_struct.title
  1DIN
;      DIENELACTONE HYDROLASE AT 2.8 ANGSTROMS                     
  Compound::
       MOL_ID: 1;                                                  
       MOLECULE: DIENELACTONE HYDROLASE;                          
       CHAIN: NULL;                                               
       SYNONYM: DLH;                                              
       EC: 3.1.1.45;                                              
       ENGINEERED: YES                                            
  Source::
       MOL_ID: 1;                                                  
       ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.;                      
       STRAIN: B13;                                               
       EXPRESSION_SYSTEM: EXPRESSED UNDER OWN PROMOTER;           
       EXPRESSION_SYSTEM_PLASMID: PDC100;                         
       EXPRESSION_SYSTEM_GENE: CLC D                              
; 
_exptl.entry_id 1DIN
_exptl.method ' X-RAY DIFFRACTION '

(2) Transformed CIF ==========================================================

#\
##########################\
##########################
#                                                 #
#\
#   Converted from PDB format\
# to CIF format by    #
#   pdb2cif version 2.3.1           24 Aug 96     #
#                       by                        #
# P.E. Bourne, H.J. Bernstein and F.C. Bernstein  #
#                                                 #
###################################################


data_1DIN

_entry.id        1DIN

loop_
_struct.entry_id
_struct.title
  1DIN
;\
      DIENELACTONE HYDROLASE\
       AT 2.8 ANGSTROMS                     
  Compound:\
:
       MOL_ID: 1;                                                  
       MOLECULE: DIENELACTONE HYDROLASE;                          
       CHAIN: NULL;                                               
       SYNONYM: DLH;                                              
       EC: 3.1.1.45;                                              
       ENGINEERED: YES                                            
  Source::
       MOL_ID: 1;                                                  
       ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.;                      
       STRAIN: B13;                                               
       EXPRESSION_SYSTEM:\
EXPRESSED UNDER OWN PROMOTER;           
       EXPRESSION_SYSTEM_PLASMID: PDC100;                         
       EXPRESSION_SYSTEM_GENE: CLC D                              
; 
_exptl.entry_id 1DIN
_exptl.method 
;\
 X-RAY DIFFRACTION \
;

==============================================================================

In making the transformation from the backslash folded form to long
lines, it is very important to strip trailing blanks before attempting
to recognize a backslash as the last character. When re-assembling
text field lines, no reassembly should be done except in text fields
that begin with the special sequence described above,
line-termination-semicolon-backslash-line-termination, (<eol>;\<eol>),
so that text fields which happen to contain backslashes, but which were not
created by folding long lines, are not changed. It is also important to
remove the trailing backslashes when reassembling long lines. The final
line-termination-semicolon sequence of a text field takes priority over the
reassembly process and ends it, but a trailing backslash on the last line of
a text field very nicely conveys the information that no trailing line
termination is intended to be included within the character string.

Similarly, when reassembling long-line comments, the reassembly begins with
a comment of the form hash-backslash-line-termination. The initial hash mark
is retained and then a forward scan is made through line-terminations and
blanks for the next comment, from which the initial hash mark is stripped
and then the contents of the comment are appended. If that comment ends with
a backslash, the trailing backslash is stripped and the process
repeats. Note that the process will be ended by intervening tags, values,
data blocks or other no-whitespace information, and that the process will
not start at all without the special hash-backslash-line-termination
comment.

Since there are very few, if any, CIFs which contain text fields and
comments beginning this way, in most cases, it is reasonable to adopt the
policy of doing this processing unless it is disabled.

Here is another example of folding. The following three text fields would be
equivalent:


;C:\foldername\filename
;

;\
C:\foldername\filename
;

and

;\
C:\foldername\file\
name
;

but the next example would be a two-line value where the first line had the
value "C:\foldername\file\" and the second had the value "name": 

;
C:\foldername\file\
name
;

When these line-folding transformation are performed on long-line CIFs, and
when long tags are replaced with aliases no longer than 75 characters, it is
then simple to fold the entire CIF into lines of no more than 80 characters,
making it conform to CIF 1.0 specifications. Note that backslashes should
not be used to fold lines outside of comments and text fields. That would
introduce extraneous characters into the CIF and violate the basic syntax
rules. In any case, such an action is not necessary.

Note that the line folding and reassembly mechanism has been introduced to
allow folding of long-line CIFs to the 80-character maximum width of the CIF
1.0 specification; but it is a general mechanism that may be used to fold
lines into any width imposed or required by applications or transmission
mechanisms (for example, some older mail transfer agents fold lines in text
files at 72 characters).

==============================================================================

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.