CIFFOLD CIFFOLD 0.5.4 Pre-Release 1 February 2006 by Kostadin Mitev, Georgi Todorov and Herbert J. Bernstein User's Manual Copyright (c) Kostadin Mitev 2005, 2006 Work funded in part by the International Union of Crystallography under a grant to Dowling College. 1. Copyright and Distribution 2. Introduction 3. Installation 4. Using CIFFOLD 5. List of Options 6. Default Options 7. Logical integrity checks 8. Terse Formatting 9. Non-terse Formatting 10. MAP 11. Command-line Arguments 12. How are files folded/wrapped 13. How are files unfolded/unwrapped 14. OTHER SOURCES 15. Change Log 16. Known Bugs 1. Copyright and Distribution This software is covered by the GNU General Public License. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. 2. INTRODUCTION Until recently, information in Crystallographic Information File (CIF) format was limited to 80 characters per line and there was no way to represent longer data items and comments faithfully. With the release of CIF version 1.1, the maximum line size has been increased to 2048 characters and a protocol has been specified for folding and unfolding text fields and comments that exceed any given maximum line size. The C/C++ program CIFFOLD implements this line folding/unfolding protocol without loss of the semantic information in the files. This allows new, long-line CIF 1.1 files to be converted to a form suitable for processing by existing software for 80-character line CIF 1.0 files and to recover long-line CIF 1.1 files from CIFs produced by CIF 1.0 software. In addition to folding and unfolding, the software performs logical integrity checks and allows the user to set a variety of options providing control over the tradeoff between faithful versus compact representations. 3. INSTALLATION You must first obtain a copy of the source kit of CIFFOLD, CIFFOLD.tar.gz. To unpack the file on a UNIX machine type the command gunzip CIFFOLD.tar.gz and then the command tar -xvf CIFFOLD.tar to extract the files in a subdirectory named CIFFOLD_0.5.4 under the current directory. To create the executable run make in the CIFFOLD_0.5.4 directory, which will create the executable named "ciffold". To run the program interactively simply type the command "./ciffold -g" and hit enter. 4. USING CIFFOLD To run ciffold`s GUI form the UNIX prompt type ./ciffold -g in the CIFFOLD directory and you will be shown the startup menu. The menu is comprised of several windows that are shown one by one. The top frame of each window contains the option, while the bottom one contains either the available options from which you have to select one or he prompt "Enter:" after which you have to enter your choice and hit enter. You can select an option by using the up and down arrow keys to highlight the desired option and hit enter. 5. LIST OF OPTIONS: * ENTER INPUT FILE: This is the first window that you will see after you run the program. You have to enter the name of the input file after the prompt "Enter:" and hit enter. If the file does not exist you will be issued an error message giving a choice to either "exit" the program or "continue" by having to reenter the name of the input file. * ENTER THE OUTPUT FILE: This is the second window that will be shown after you have entered the name of the input file. You have to enter the name of the output file here. If the name of the output file coincides with the name of the input file an error message will be shown asking to either "exit" the program or "continue" by having to reenter the name of the output file. If the name of the output file you have entered coincides with a file in the directory the program is being run from, you will be warned and allowed to either "continue" using the same name or "change" the name of the output file. * File version: This option allows you to insert or change the file version given by the special comment "#/#CIF_filevers" where filevers is the version number of the file. You can choose using the arrow keys from the following options: a) 1.0- file version will be 1.0 b) 1.1- file version will be 1.1 c) Do not change -the file version will not be changed or not be given if one does not exist. * Folding (Yes) or Unfolding (No): This option allows you to choose between folding and unfolding. If "yes" is selected then the given input file will be folded according to the folding unfolding protocol of cifs (see "How are files folded ?" below for detailed description on the process). If you choose "no" then the input file will be unfolded. * Minimal Folding (Yes or No): This option allows you to suppress the reformatting of loops. If no other options are selected, this results in a minimal amount of folding, so the the output files is organized the same as the input files, except when long lines to text fields containing long lines are ewncountered. * Create a MAP ?: If you have chosen folding then you will be asked whether to create a MAP of the input file. You can select "yes"to create or "no" not to(see MAP bellow for further information). * Terse Folding/Unfolding?: This option allows you to choose between terse(see "Terse formatting" below) or nonterse( see "Nonterse formatting" below). To choose terse select "yes" otherwise select "no" * Terse formatting on loops?: This option allows you to choose between terse or nonterse formatting on loops. To choose terse select "yes" otherwise select "no". Terse formatting on loops attempts to format the data items of loops having items(data + tags) larger than a given by you number according to the Terse formatting rules(see "Terse formatting" below) * How many items is a big loop? : This option is displayed after selecting "yes" from "Terse formatting on loops ?" option. This allows you to specify how large should a large (tags + data) loop considered to be. The input range is an integer between 5 and (232)-1. For example if the value entered is 70 then the data items of every loop having number of data + number of tags bigger than or equal to 70 will be tersely(see "Terse formatting" below) formated. * Preserve leading blanks ?: Allows you to choose between preserving and disregarding the leading blanks in the file. To choose preserve select "yes" otherwise select "no". Leading blanks are the blanks at the beginning of each line. The blanks at the beginning of a line inside a textfield or folded comment are not considered leading blanks and are preserved regardless of your choice. * Process the entire file ?: This option allows you to choose between processing the entire file or only portions of the file(file chunks). To process the entire file select "yes" otherwise select "no". * Enter chunk pairs or END to continue: This option is displayed if you have selected "no" from the "Process the entire file ?" option. On the prompt "Enter:" you can enter pairs one at a time consisting of n-n pairs where n is nonnegative integer in the range 0 to 232-1. After you have entered a pair hit enter to enter the next one. The first integer of the pair has to be bigger than the integers in any previously entered pair. The second integer of the pair has to be bigger than the first integer in the pair. To finish entering the pairs type "end" and hit enter. * Format only comments? : This option allows you to choose to format only the comments in the file. To choose format only comments select "yes" otherwise select "no". If you select "yes" then only the comments of the file are folded/unfolded while the rest of the file is not being formated. * Format everything except comments ?: This option allows you to choose to format only the data of the input file (without the comments). If you want to format only the data select "yes" otherwise select "no". If you selected "yes" only the data of the input file will be folded/unfolded, the comments will not be formated. * Is this a dictionary file ?: This option allows to specify whether the input file is dictionary or not. If it is select "yes" otherwise select "no". As of the current version of CIFFOLD this option is not fully implemented and does not have any effect in the processing of the file. * Output the warning messages ?: The warning messages are the ones that warn you about changes made by the program such as changing a delimiter of a string.(see "Logical integrity checks" below). * Output the error messages ?: Select yes if you want to have the error messages outputted as a special comment i.e. one that starts with #_# at the end of the file. Error messages contain the type of error that occurred in the logical integrity checks of the file (see "Logical integrity checks" below). * Read from a MAP?: This option allows you to specify if the file should be formated according to its MAP file(see "MAP" below) or no. To use a MAP file to format it select "yes" otherwise select "no". The MAP file should be at the end of the input file with each line of the MAP file prefixed with the special comment #_M# . If there the MAP file does not exist or it does not conform to its specification then the default options for file unfolding are used to format the rest of the file(see "Default options" below). * The column with respect to which the data should be aligned: This option is provided only when unfolding files. It allows you to left justify data associated to a tag with respect to the column specified by you. If you do not want to left justify the data either enter "0" and press enter or just press enter. The option is useful when unfolding tersely formated file without using a MAP and allows to layout the data in a more structural and easy to read and understand way. If a column is specified then the program will attempt to left justify each data field with respect to the given column, if this is not possible then at least a single space will be used to separate the tag from the data to produce a valid cif file. * Specify the maximum line length?: This options lets you specify the maximum line length of the file that will be outputted and it appears only if you choose folding. It takes as an input a positive integer between 60 and 2048. As of the version 0.3 of CIFFOLD this options is implemented only when folding files. The maximum line length is forced to 2048 for unfolding. 6. DEFAULT OPTIONS CIFFOLD has some default options for the options that have not been selected. These options are used if during processing of the file something goes wrong for example if the file should be formated according to a MAP but it does not contain a MAP or the MAP becomes invalid at some point then the default options will be used and the user will be warned. The program uses the following default options: * For folding: a) Do not change the file version b) Do not fold the file tersely c) Do not fold tersely large loops d) Preserve the leading blanks e) Process the entire file f) Do not format only comments g) Do not format everything except comments h) The file is not a dictionary file i) Do not output the warning messages j) Output the error messages k) Specify the maximum line length is set to 80 l) Do reorganize loops * For unfolding: a) Do not change the file version b) Do not fold the file tersely c) Do not fold tersely large loops d) Preserve the leading blanks e) Process the entire file f) Do not format only comments g) Do not format everything except comments h) Do not specify a column with respect to which the data should be left justified(0). i) The file is not a dictionary file j) Do not output the warning messages k) Output the error messages 7. LOGICAL INTEGRITY CHECKS CIFFOLD checks the file for some basic logical integrity errors and generates warnings about them. The checks performed are: * is there corresponding data to a tag * do two tags have the same name within one datablock * do two data-block headers have the same name * are there non delimited deprecated tags such as global_ , start_ and stop_ * is there non delimited occurrence of save_ if the file is not a dictionary. * are there nested loops * is there a longer than the maximum allowed length(2048) line * is there a delimited data field such that its opening delimiter does not match its closing delimiter * is there a datablock with no name * are there tags longer than 80 characters * are the number of the data items in the loop an exact multiple of the number of tags. In addition to the logical integrity checks CIFFOLD will detect and change the delimiter of a string with the following peculiarity: The same character as the delimiter appears right after the opening delimiter or before the closing delimiter. The delimiter of such a string will be changed to its alternative one for example " to ' and vice versa so the string "rambo"" will be changed to 'rambo"'. A warning will be issued about the change and if the option "Output the warning messages ?" is selected then it will be outputted as a special comment at the end of the output file. A warning will be issued if there is a presence of non delimited reserved character such as([, ], _, etc.) 8. TERSE FORMATTING If the option terse is chosen then the program will attempt to reduce the amount of white space to a minimum by putting as much information as possible on one line, while the file is still a valid cif. This option is considered user unfriendly and is used to reduce the size and length of the file. If a string is delimited with a single/double quote and immediately after the opening delimiter there is another single/double quote or immediately before the closing delimiter there is a single/double quote. Then the delimiter is changed to its alternative which is single quote for the double quote and vise versa. For example if we have a string of the type ""rambo" it will be converted to a string of type '"rambo'. This is done to avoid ambiguity and improve the clarity of the content of cif files. Any single hashmark will be put on new line. 9. NON TERSE FORMATTING * If the option Nonterse is selected then the program uses the following rules to format the input file: * every tag is put on a new line * the data corresponding to a tag is put on the same line as the tag if it will fit * every special tag such as loop_ data_ etc. is also put on a new line * if the data in a loop can be aligned in columns and rows such that one row holds the as many distinct data as are the tags in the loop it is done. If the data cannot be aligned in this way then the original formatting is preserved as much as possible. * If a string is delimited with a single/double quote and immediately after the opening delimiter there is another single/double quote or immediately before the closing delimiter there is a single/double quote. Then the delimiter is changed to its alternative which is single quote for the double quote and vise versa. For example if we have a string of the type ""rambo" it will be converted to a string of type '"rambo'. This is done to avoid ambiguity and improve the clarity of the content of cif files. * if there is a long data field that is not a text field (i.e. delimited by ";" then it is converted to a text field and folded. 10. MAP The optional map is used to save information on the original positions of information when a files is folded. The MAP is a file that contains of "dh" for data and h is the delimiter of the data either ;, ' , " or nothing if no delimiter is used "sn" for space and "tn" for tabs where n is the number of spaces/tabs. For each line of the input file there is a line in the MAP file that shows the layout of the line. For example d's7d shows that there is data delimited by a single quote followed by 7 spaces and nondelimeted data. The MAP file is then concatenated to the output file such that each line is prefixed by #_M# indicating that the line is part of the map file. The MAP file is useful if a file is folded and then it is necessary to recostruct exactly the same file by unfolding it. WARNING: As of version 0.3 of CIFFOLD the line length of the map may be of arbitrary length. This means that if there are 60 separate items on a single line of the input file the corresponding line in the MAP file will be more than 60 characters long and if maxline length has been selected to be 60 the MAP will exceed it. 11. Command-line Arguments ciffold [-i input_cif] [-o output_cif] [-x n-n,n-n] [-l n] [-m n] [-C n] [-p a[w][e]] [-v file_vers] [-c] [-d] [-e] [-g] [-w [-n]] [-u] [-L] [-t] [-h] [-M] [-V] If you want to run CIFFOLD with the options specified on the command line you can do that by typing "./ciffold specify the options here" and then hit enter. The options provided are: [-i input_cif] corresponds to "ENTER INPUTFILE:" (see above) for command line use, a "-" indicates standard input input_cif defaults to stdin [-o output_cif] corresponds to "ENTER OUTPUT FILE:" (see above) for command line use, a "-" indicates standard output output_cif defaults to stdout [-d ] corresponds to "Is this a dictionary file:?" with value of "yes" (see above) [-u ] corresponds to "Folding (Yes) or Unfolding (No):" with value of "no" (see above) [-w ] corresponds to "Folding (Yes) or Unfolding (No):" with a value of "yes" (see above) [-n ] corresponds to the "Minimal Folding (Yes or No)" with a value of "yes" (see above) [-m maxline] corresponds to "Specify the maximum line length?:" (see above) Note: this option is considered only when folding files. In unfolding the maximum line length will be forced to be 2048. [-v file_version] corresponds to "File version:" valid file_versions are 1.0 or 1.1 (see above) [-t ] corresponds to "Terse Folding/Unfolding?:" with a value of "yes" (see above) [-l integer] corresponds to "Terse formatting on loops?:" with a value of "yes" and digit corresponds to "How many items is a big loop? :" (see above) [-L] corresponds to "Preserve leading blanks" with a value of "yes" (see above) [-c] corresponds to "Format only comments:" with a value of "yes" (see above) [-e] corresponds to "Format everything except comments:" with a value of "yes" (see above) [-C integer] corresponds to "The column with respect to which the data should be aligned:" (see above) [-p character] Valid characters for "character" are: "a"- corresponds to "Output the error message:"with a value of "yes" and "Output the warnings:" with a value of of "yes". "w"- corresponds to "Output the warnings:" with a value of "yes". "e"- corresponds to "Output the error messages:" with a value of "yes" (see above) [-g] Takes no values and invokes the GUI interface [-M] If folding corresponds to "Create a MAP?" with a value of "yes". If unfolding corresponds to "Read from a MAP?" with a value of "yes" (see above) [-h] Takes no values. Prints a help message and exits. [-x n-n,n-n] corresponds to "Process the entire file?" with a value of "no". n-n correspond to "Enter chunk pairs or END to continue:"with n-n being a string where the first n is the starting integer and the following is the ending. Example: if you want to format only the chunks 9-10 40-70 you would specify that as -x 9-10,40-70 [-V] Takes no values. Prints the current version and exits. 12. How are files folded/wrapped CIFFOLD will make two passes through the file. On the first pass it will perform logical integrity checks, issue the appropriate warnings and error messages and will create a temporary file where the input file will be stored. It will also create a MAP for the file if the MAP option is selected and will create a temporary file for the MAP. Some additional information about the file is gathered as well. On the second pass CIFFOLD will actually fold/wrap the file according to the following rules: Lines will be folded/wrapped only if they exceed the maximum line length. Thus if a text field has lines that are less than the maximum allowed line length it will not be folded/wrapped. Strings that have lines less than the maximum allowed line length but they end beyond the column of the maximum allowed line length will either be brought back to the left by deleting blank characters or will be placed on a new line if the former is not possible. The loops will be formated according to the following rules: Every tag is placed on a new line. If possible the data tokens in the loop will be aligned into rows and columns such that each row contains as many data tokens as are the number of tags. If such alignment is not possible the original formatting will be preserved as much as possible. The option preserve the leading blanks will not preserve the leading blanks for the tokens that fall within a loop. Unless the trailing blanks fall within a text field they will be deleted. When finished processing the temporary files are deleted. 13. How are files unfolded/unwrapped CIFFOLD will make two passes through the file. On the first pass it will perform logical integrity checks, issue the appropriate warnings and error messages and will create a temporary file where the input file will be stored. It will also create a temporary MAP file which will hold the MAP of the input file if it exists and the MAP option is selected. Some additional information about the file is gathered as well. On the second pass CIFFOLD will actually unfold/unwrap the file according to the following rules (if the default options are used): Every tag will be placed on a new line. A data associated with a tag will be placed on the same line as the tag if: the resulted line length does not exceed the maximum allowed line length and the new line characters between the tag and the data are not more than 1 . Example: _a_tag data and _a_tag data will be unfolded/unwrapped as: _a_tag data but: _a_tag data will be unfolded/unwrapped as: _a_tag data The loops will be formated according to the following rules unless the -n option has been selected: Every tag is placed on a new line. If possible the data tokens in the loop will be aligned into rows and columns such that each row contains as many data tokens as are the number of tags. If such alignment is not possible the original formatting will be preserved as much as possible. Unless the trailing blanks fall within a text field they will be deleted. The option preserve the leading blanks will not preserve the leading blanks for the tokens that fall within a loop. The only way the original file can be exactly recovered is by using the MAP option. When finished processing the temporary files are deleted. 14. OTHER SOURCES: For information about cif files visit: http://www.iucr.org/. For information about the the folding/unfolding protocol of cifs visit: http://www.iucr.org/iucr-top/lists/cif-developers/msg00147.html 15. Change Log * Release 0.5.4, 1 February 2006 KM+HJB Correct text field blank stripping and unfolding of text fields to quoted strings. * Release 0.5.3, 30 September 2005 HJB Add command line option -n for minimal folding, suppressing loop reformatting. * Release 0.5.2, 1 August 2005 HJB Changed handling of folded quotes strings to end with backslash in the text field to avoid an extra newline and improved handling of embedded semicolons. * Release 0.5.1, 25 July 2005 GT+HJB Updated output of -h option and redirected that output to cerr. Updated version number in all source code file headers. * Release 0.5, 23 July 2005 HJB Added code to fold comments and text fields on a blank when available. Moved temporary files to /tmp Cleaned up some white space in FUCIF.c (HJB) * Release 0.4.5, 11 July 2005 KM, 22 July 2005 HJB Correction in FUCIF.c to correct infinite loop on some terminal comments discovered by I. Awuah Asiamah. (KM) Cleanup of bad characters in README and addition of logo. (HJB) * Release 0.4.4, 31 May 05 KM Corrections made in FUCIF.c in outputTextField to properly terminate if end of file is reached before the closing delimiter of the textfield. Corrections made in FUCIF.c in formatLoop to insert a new line before comment that falls in a loop, it is the only item of the input line and there is a data item on the output line. * Post Release 0.4.3, 14 May 05 KM+HJB Added manual sections on folding and unfolding. * Release 0.4.3, 7 May 2005 KM+HJB Changed MAP to one that has linelength restricted to the maximum allowed and introduced the character 'n' to represent new line Fixed some bugs including the ouput of ambiguous closing text delimiter and scrambling of comments when format everything except comments is used Added local definition of isblank for systems that do not have it in ctype Revised Makefile to try /usr/local/... for ncurses * Release 0.4.2, 29 April 2005 KM Corrected failure to output opening and closing delimiters when converting a non-delimited string to a folded textfield, and fixed the handling of unfolding according to a MAP when it would output a single space before it converting the folded text field back to nondelimited string. * Release 0.4.1, 28 April 2005 GT Corrected "then then" and changed "formating" to "formatting" in README..., ReadFile.cpp, getOpt.cpp. * Release 0.4, 27 April 2005 KM, GT -fixed concatenation of closing text delimiter with the next token -fixed some segmentation fault problems occuring when the input file is incorrect -in menus I updated the version of the program and fixed a bug that sets -the maxlength of a line in folding using -g after the user's input(overwrites the user input making the option useless) -in getOpts made the program exit upon invalid input, -V or -h(basically every argument except the valid ones will print the help and exit) * Release 0.3, 23 April 2005 KM Changes in the command line options: -x to be used instead of -h. -h used to print a help message -C instead of -r -V used to print the current version and exit -M instead -g for creating/using a MAP -g used to invoke the GUI -L instead -p for preserve leading blanks -p used for print the error messages and warnings -m to take values within the range 60-2048 The file chunks to be in form n-n,n-n instead of n-n-n-n Corrections in code in FUCIF.c to preserve empty lines in a more consistent way. Corrections in ReadFile.cpp to handle the appearance of "global_" within loops Changes in menus.cpp allowing to use ciffold without any arguments by opening stdin and stdout and passing stdin to stdout without altering it. Fixed the MAP option to not be selected by default. Disabled the warning "#_#WARNING: AMBIGUOUS STRING DELIMETER CHANGED TO AN ALTERNATIVE ONE(\' to \" or \"to \')" Made the error messages and warnings to be outputed to stderr. * Release 0.2, 19 April 2005 HJB Corrections to handling of command line input, enabling - as an indicator for standard input or standard output to allow the use of ciffold as a filter. * Release 0.1, 16 April 2005 KM, GT and HJB Initial pre-release. 16. Known Bugs * In some cases, mapped files are not recopnstructed correctly. Use of the -M option is not recommended at this time. * The temporary file is not always cleaned up. * The line length of the MAP file is not restricted and can exceed the maximum allowed line length. * The CIFFOLD 0.3 release appears to fold and unfold correctly formatted CIFS, but, in some cases, invalid CIFs cause segmentation faults instead of providing validation messages. The known cases have been addressed on CIFFOLD 0.4, but caution is advised. * Some combinations of the options -M -x -e -c will format the file incorrectly which does not necessarily result in invalid cif. Written by K. Mitev, 15 April 2005, revised, H. J. Bernstein, 16 April 2005, 19 April 2005, K. Mitev, 22 April 2005, H. J. Bernstein, K. Mitev, G. Todorov, 27 April 2005, G. Todorov 28 April 2005, K. Mitev 29 April 2005, K. Mitev 6 May 2005, H. J. Bernstein, 7 May 2005, K. Mitev, H. J. Bernstein 14 May 2005, K. Mitev 31 May 2005, K. Mitev 11 July 2005, H. J. Bernstein 22 July 2005, H. J. Bernstein 23 July 2005, G. Todorov, H. J. Bernstein 25 July 2005, H. J. Bernstein 1 August 2005, H. J. Bernstein 30 September 2005, K. Mitev, H. J. Bernstein 1 February 2006