Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Revised draft of CIF 1.1 syntax document

  • Subject: Re: Revised draft of CIF 1.1 syntax document
  • From: Brian McMahon <bm@xxxxxxxx>
  • Date: Tue, 24 Sep 2002 11:32:25 +0100 (BST)
Following the latest round of discussions during this review, I intend to
make the following changes to the draft specification document. It may take
a little while to implement them, but input is welcomed on the proposals in
the meantime.

1. Remove the STAR *use* of stop_ as a loop or loop header delimiter, but
   retain it as a reserved word.

   Reason: while there is some support for using stop_, there is even more
   passionate argument against it. I am particularly persuaded by Brian
   Toby's comment that adding a feature to CIF that is formally
   unnecessary will have the practical effect of giving people more ways
   to break the files.

2. Change the definition of semicolon-delimited text values to *include* the 
   terminal newline.

   Reason: I am swayed by John Bollinger's reference to the STAR emphasis on
   lines of text. For those to whom it matters, the semicolons allow a
   distinction to be drawn between inline and line-delimited
   strings. Specific applications may if desired choose to elide the terminal
   newline - I see now that effectively that is what ciftex does.

   However, there is a possible way of excluding the terminal newline, which 
   I shall refer to in a different context below.

3. Amend the productions for number values to permit 1e5 as valid. Greg
   Shields has pointed out to me that this was an error already flagged that I
   had forgotten to correct.

4. Review the representations for floating point numbers in scientific
   notation. I wish to exclude the version that permits an exponent to be
   identified solely by an embedded +/- sign. (Reason: Greg has pointed
   out that 12-14, which is often entered erroneously by authors intending
   to specify a range of values, could be parsed as a (rather small) number.)

   I would prefer to retain only the 'e' notation to express an exponential
   with machine-independent precision.

   Reason: if we retain the 'd' notation also, the assumption would be that
   a distinction should be made between e and d in the manner of the IEEE 754
   standard (referenced below) for floating-point representations. This then
   makes rather specific statements about machine storage (and also raises
   such questions as whether NaN should be included as a valid string value
   for a floating-point number representation).

   I am however troubled by the fact that Herbert is already using 'd', and
   would like to know more about how it affects internal storage in his
   applications. I am amenable to further debate on this point.

      The IEEE Floating Point Standard (IEEE 754) is an IEEE standard, used
      by many CPUs and FPUs, which defines formats for representing floating
      point numbers; representations of special values (i.e. zero, infinity,
      very small values (denormal? numbers), and bit combinations that
      don't represent a number (NaN)); five exceptions, when they occur, and
      what happens when they do occur; four rounding modes; and a set of
      floating-point operations that will work identically on any conforming
      system.
 
      IEEE 754 specifies four formats for representing floating-point values:
      single-precision (32-bit), double-precision (64-bit), single-extended
      precision (>= 43-bit, not commonly used) and double-extended precision
      (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are
      required by the standard, the others are optional. Many languages
      specify that they implement IEEE arithmetic, although sometimes it is
      optional. The C programming language for example allows but does not
      require IEEE arithmetic. IEEE is commonly used in C where float
      implemented IEEE single precision and double implements IEEE double
      precision.
 
      Also known as IEEE Standard for Binary Floating-Point Arithmetic
      (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic
      for microprocessor systems.


5. Lastly, I want to introduce into the *semantics* document a protocol for
escaping newlines in text fields (and in comments). This idea has been
discussed before by COMCIFS members and has had a chequered history.
Nevertheless, I think the current work that CCDC are doing on their 
CIF editor demonstrates again its usefulness. The idea is that for a text
field or a comment line, a convention is introduced that allows the
end-of-line to be escaped (i.e. ignored) and the text on the following line
to be concatenated to the current line.

Why is this useful?
 - It allows one to preprocess a CIF 1.1 with long lines of text
   (> 80 characters) and fold them into the 80-character limit of CIF 1.0
   without loss of information. Thus the 'folded' file can be processed by
   older CIF 1.0 software with 80-character line buffers. If needed, a
   postprocessor can reconstitute the longer lines.
 - Similar of processing can wrap text into still narrower columns
   (sometimes needed even today as text file are autowrapped by certain
   mailers to 72 characters or less).
 - Even with the more generous line lengths in CIF 1.1, it may be
   necessary to handle strings longer than 2048 characters (for example a
   protein aminoacid sequence or very complex systematic chemical name).
   The protocol then allows wrapping into the 2048 buffer.
 - The CCDC editor may be required to import a text stream from a word
   processor document where embedded newlines are not used.

The arguments against this convention in an earlier round of discussion had
to do with the burden of accommodating such additional processing within any 
CIF application. However, only applications that really need to handle (in
some sense 'understand') the contents of text fields have to worry about
this. For many applications text fields are not parsed for content, and can
simply be passed through the application unchanged. Standalone utilities
will be provided to perform the line wrapping or unwrapping.

I shall send out a more complete description of the proposal separately,
because it should be discussed in the context of semantics - the meaning of 
the content of a text field - rather than syntax. I mention it here
because it does provide a method for eliding the terminal newline of
a text field at a semantic level, if such a result is needed in the
light of proposal (2) above.

Regards
Brian

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.