Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF line folding/reassembly protocol

  • Subject: Re: CIF line folding/reassembly protocol
  • From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
  • Date: Tue, 24 Sep 2002 20:45:57 +0100 (BST)
The only problem with this proposal is a conflict with the recent proposal
to somehow recognize a final end-of-line within all text fields.  The
first step in handling long character fields is to convert them first to
semicolon delimited text fields.  That worked under the original semantics
for semicolon delimited text fields but under the newly proposed
semanrics can create confusion about the number of lines intended.  The
simplest resolution would be to go back to the original semantics, but if
that is not desired, then to allow consistent folding and unfolding the
portion of the document that reads:

> Similarly, for a character string that extends beyond column 80,
>
>  - first convert it to be a text field delimited by line-termination-semicolon
>    (<eol>;) sequences
>  -  then change the initial line-termination-semicolon
>     (<eol>;) sequence to line-termination-semicolon-backslash-line-termination
>     (<eol>;\<eol>)
>  - and break all subsequent lines that do not fit within 80
>    columns with a trailing backslash. In the course of doing the translation,
>       * check for any original text lines that end with a backslash
>         followed only by blanks or tabs.
>       *  To preserve that backslash in the conversion, add another
>          backslash after it, and then an empty line.

should be changed to read

> Similarly, for a character string that extends beyond column 80,
>
>  - first convert it to be a text field delimited by line-termination-semicolon
>    (<eol>;) sequences
>  -  then change the initial line-termination-semicolon
>     (<eol>;) sequence to line-termination-semicolon-backslash-line-termination
>     (<eol>;\<eol>)
>  - and break all subsequent lines that do not fit within 80
>    columns with a trailing backslash. In the course of doing the translation,
>       * check for any original text lines that end with a backslash
>         followed only by blanks or tabs.
>       *  To preserve that backslash in the conversion, add another
>          backslash after it, and then an empty line.
> *** if the original character string was not a multiline text field
>    (i.e. a single-quoted or double-quoted character string), then append
>     one more backslash to the last line of the text field (before the
>     terminal <eol>;) making sure to fold to 80 columns as above, if
>     necessary.

=====================================================
 Herbert J. Bernstein, Professor of Computer Science
   Dowling College, Kramer Science Center, KSC 020
        Idle Hour Blvd, Oakdale, NY, 11769

                 +1-631-244-3035
                 yaya@dowling.edu
=====================================================

On Tue, 24 Sep 2002, Brian McMahon wrote:

> Here is a draft description of the line-folding protocol mentioned earlier
> that I wish to add to the semantics document as part of the CIF 1.1
> specification. It is a slightly modified version of a proposal elaborated by
> Herbert Bernstein. Note that the specific aim of this proposal is to
> introduce a technique for folding lines within a text field or comment that
> exceed the CIF line-length limit into lines within that limit, for the
> purpose of producing a syntactically valid CIF where the semantic
> information within the long lines can be recovered without loss by applying
> the unfolding part of the protocol.
>
> This requirement has always been present, and has in the past been handled
> in various ad hoc ways (Acta Cryst. implemented something similar within
> ciftex); this proposal formalises a specific approach that may be used
> robustly by content handlers of text fields.
>
> As a beneficial corollary, it also facilitates mechanical interconversion
> between CIF 1.0 and 1.1 files.
>
> Brian
>
> PS: To protect the unwary, I'll draw your attention to a couple of specific
> things in the example folded CIF below. One is the way that a
> quote-delimited text string has been converted into a folded-multiline
> text field where the terminal newline is elided; the second is that one of
> the folded lines carries a colon into the first column of the next line - be
> careful not to see that as a semicolon!
>
> ==============================================================================
>
> A line-folding/reassembly protocol
> ----------------------------------
>
> It must be emphasized that most CIF software and applications need not be
> concerned with line folding. However, if one has software for CIF 1.0 and
> a dataset with long lines, it is useful to have a consistent way in which to
> convert the data to conform to CIF 1.0. Line folding using backslashes
> allows us to do this.
>
> In order to permit such a folding we define a special semantics for use of
> the backslash. It is important to understand that this does not change the
> syntax of CIF 1.0. All existing CIFs conforming to the CIF 1.0
> specification can be viewed as having exactly the same semantics as
> they now have. Use of these transformational semantics is optional, but
> recommended.
>
> In order to avoid confusion between CIFs that have undergone these
> transformations and those that have not, the special comment beginning with
> a hash mark immediately followed by a backslash (#\) as the last non-blank
> characters on a line is reserved to mark the beginning of comments created
> by folding long-line comments, and the special text field beginning with the
> sequence line-termination, semicolon, backslash (<eol>;\) as the only
> non-blank characters on a line is reserved to mark the beginning of text
> fields created by folding long-line text fields.
>
> The backslash character is used to fold long lines in character strings and
> comments. Consider a comment which extends beyond column 80. In order to
> provide a comment with the same meaning which can be fitted into 80
> character lines, prefix the comment with the special comment consisting of a
> hash mark followed by a backslash (#\) and the line terminator. Then on new
> lines take appropriate fragments of the original comment, beginning each
> fragment with a hash mark and ending all but the last fragment with a
> backslash. In doing this conversion, check for an original line that ends
> with a backslash followed only by blanks or tabs. To preserve that backslash
> in the conversion, add another backslash after it. If the next lexical token
> (not counting blanks or tabs) is another comment, to avoid fusing this
> comment with the next comment, be sure to insert a line with just a hash
> mark.
>
> Similarly, for a character string that extends beyond column 80,
>
>  - first convert it to be a text field delimited by line-termination-semicolon
>    (<eol>;) sequences
>  -  then change the initial line-termination-semicolon
>     (<eol>;) sequence to line-termination-semicolon-backslash-line-termination
>     (<eol>;\<eol>)
>  - and break all subsequent lines that do not fit within 80
>    columns with a trailing backslash. In the course of doing the translation,
>       * check for any original text lines that end with a backslash
>         followed only by blanks or tabs.
>       *  To preserve that backslash in the conversion, add another
>          backslash after it, and then an empty line.
>
> (More formally, the line folding should be done separately and directly on
> single line non-semicolon delimited characters strings to allow for
> recognition of the fact that no terminal line-termination is intended -- see
> below).
>
> In order to understand this scheme, suppose the CIF fragment (1) below were
> considered to have long lines, then we could transform them as follows (2):
>
> (1) Initial CIF ==============================================================
>
> ###################################################
> #                                                 #
> #   Converted from PDB format to CIF format by    #
> #   pdb2cif version 2.3.1           24 Aug 96     #
> #                       by                        #
> # P.E. Bourne, H.J. Bernstein and F.C. Bernstein  #
> #                                                 #
> ###################################################
>
>
> data_1DIN
>
> _entry.id        1DIN
>
> loop_
> _struct.entry_id
> _struct.title
>   1DIN
> ;      DIENELACTONE HYDROLASE AT 2.8 ANGSTROMS
>   Compound::
>        MOL_ID: 1;
>        MOLECULE: DIENELACTONE HYDROLASE;
>        CHAIN: NULL;
>        SYNONYM: DLH;
>        EC: 3.1.1.45;
>        ENGINEERED: YES
>   Source::
>        MOL_ID: 1;
>        ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.;
>        STRAIN: B13;
>        EXPRESSION_SYSTEM: EXPRESSED UNDER OWN PROMOTER;
>        EXPRESSION_SYSTEM_PLASMID: PDC100;
>        EXPRESSION_SYSTEM_GENE: CLC D
> ;
> _exptl.entry_id 1DIN
> _exptl.method ' X-RAY DIFFRACTION '
>
> (2) Transformed CIF ==========================================================
>
> #\
> ##########################\
> ##########################
> #                                                 #
> #\
> #   Converted from PDB format\
> # to CIF format by    #
> #   pdb2cif version 2.3.1           24 Aug 96     #
> #                       by                        #
> # P.E. Bourne, H.J. Bernstein and F.C. Bernstein  #
> #                                                 #
> ###################################################
>
>
> data_1DIN
>
> _entry.id        1DIN
>
> loop_
> _struct.entry_id
> _struct.title
>   1DIN
> ;\
>       DIENELACTONE HYDROLASE\
>        AT 2.8 ANGSTROMS
>   Compound:\
> :
>        MOL_ID: 1;
>        MOLECULE: DIENELACTONE HYDROLASE;
>        CHAIN: NULL;
>        SYNONYM: DLH;
>        EC: 3.1.1.45;
>        ENGINEERED: YES
>   Source::
>        MOL_ID: 1;
>        ORGANISM_SCIENTIFIC: PSEUDOMONAS SP.;
>        STRAIN: B13;
>        EXPRESSION_SYSTEM:\
> EXPRESSED UNDER OWN PROMOTER;
>        EXPRESSION_SYSTEM_PLASMID: PDC100;
>        EXPRESSION_SYSTEM_GENE: CLC D
> ;
> _exptl.entry_id 1DIN
> _exptl.method
> ;\
>  X-RAY DIFFRACTION \
> ;
>
> ==============================================================================
>
> In making the transformation from the backslash folded form to long
> lines, it is very important to strip trailing blanks before attempting
> to recognize a backslash as the last character. When re-assembling
> text field lines, no reassembly should be done except in text fields
> that begin with the special sequence described above,
> line-termination-semicolon-backslash-line-termination, (<eol>;\<eol>),
> so that text fields which happen to contain backslashes, but which were not
> created by folding long lines, are not changed. It is also important to
> remove the trailing backslashes when reassembling long lines. The final
> line-termination-semicolon sequence of a text field takes priority over the
> reassembly process and ends it, but a trailing backslash on the last line of
> a text field very nicely conveys the information that no trailing line
> termination is intended to be included within the character string.
>
> Similarly, when reassembling long-line comments, the reassembly begins with
> a comment of the form hash-backslash-line-termination. The initial hash mark
> is retained and then a forward scan is made through line-terminations and
> blanks for the next comment, from which the initial hash mark is stripped
> and then the contents of the comment are appended. If that comment ends with
> a backslash, the trailing backslash is stripped and the process
> repeats. Note that the process will be ended by intervening tags, values,
> data blocks or other no-whitespace information, and that the process will
> not start at all without the special hash-backslash-line-termination
> comment.
>
> Since there are very few, if any, CIFs which contain text fields and
> comments beginning this way, in most cases, it is reasonable to adopt the
> policy of doing this processing unless it is disabled.
>
> Here is another example of folding. The following three text fields would be
> equivalent:
>
>
> ;C:\foldername\filename
> ;
>
> ;\
> C:\foldername\filename
> ;
>
> and
>
> ;\
> C:\foldername\file\
> name
> ;
>
> but the next example would be a two-line value where the first line had the
> value "C:\foldername\file\" and the second had the value "name":
>
> ;
> C:\foldername\file\
> name
> ;
>
> When these line-folding transformation are performed on long-line CIFs, and
> when long tags are replaced with aliases no longer than 75 characters, it is
> then simple to fold the entire CIF into lines of no more than 80 characters,
> making it conform to CIF 1.0 specifications. Note that backslashes should
> not be used to fold lines outside of comments and text fields. That would
> introduce extraneous characters into the CIF and violate the basic syntax
> rules. In any case, such an action is not necessary.
>
> Note that the line folding and reassembly mechanism has been introduced to
> allow folding of long-line CIFs to the 80-character maximum width of the CIF
> 1.0 specification; but it is a general mechanism that may be used to fold
> lines into any width imposed or required by applications or transmission
> mechanisms (for example, some older mail transfer agents fold lines in text
> files at 72 characters).
>
> ==============================================================================
>


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.