[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] options/text vs binary/end-of-line

Now to deal with the real issues -- should CIF2 allow multiple
optional representations? is CIF2 a binary file or a text file? and
how do we treat end-of-line?

The code point for the end of line in a "normal" unix-style UTF-8 file is
U+000A (LF or NL), but all of the following are also used as line
terminators (see http://en.wikipedia.org/wiki/Newline):

   U+000C (FF)
   U+000D (CR)
   U+000D U+000A (CF LF)
   U+0085 NEL
   U+2028 LS
   U+2029 PS

There are system dependent problems and conflicts with some of these 
characters:  NEL is sometimes used for an ellipsis character.

The proponents of a rigid binary CIF2 format for the actual files,
as opposed the going back to CIF being a text file with mutliple
system-dependent encodings need to consider whether they are going
to restrict "valid" CIF2 to the world of unix, or shall we perhaps
allow people working with text editors on MS windows machines and
Macs to produce "valid" CIF2 files directly, bend a little and,
instead of mandating the external representation of a CIF2 so
rigidly, allow some reasonable range to text files that map
cleanly to and from the sequences of unicode code points currently
specified in the proposal?

To be specific, I propose that the paragraph that now reads:

"CIF2 files are standard variable length binary files, but for historical 
reasons will have a maximum record length of 2048 bytes. In a general 
sense the contents of the file are characters that are encoded in UTF8, 
however there are some restrictions on the character set for token 
delimiters, separators and for data names."

be changed to read

"CIF2 is a specification for the interchange of text files.  Text files
have many possible system dependent represenations and encodings.  To
ensure clarity in the specification of CIF2, this document is written
in terms of a sequence of unicode code points, and all fully compliant
CIF2 processing systems should, at a minimum be able to process
text files as unicode code points represented in UTF-8, subject to the
XML-based restrictions below.  This approach is not meant to prevent
people from preparing valid CIF2 files with non-UTF-8-based text
editors, but, if a non-UTF-8 file format is produced, it is important
to clearly specify the intended mapping to UTF-8.  This is particularly
important in dealing with end-of-line indicators (see 
http://en.wikipedia.org/wiki/Newline).  When handling CIF2 files
produced under MS windows, CR-LF sequences should be accepted as
an alternative to LF, and when handling CIF2 files produced under
Mac OS, CR should be accepted as an alternative to LF.  This document
will only refer to LF as a line terminator and will assume that some
appropriate system-dependent text processing system will handle
the necessary conversion.

To ensure compatibility with older Fortran text processing software,
lines in CIF2 files should be restricted to no more than 2048
code points in length, not including the line temrinator itself.
Not that the UTF-8 encoding of such a line may well be much longer."






=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]