Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...

On first reading, I am tempted to support this, recognizing that
this issue is going to require a deal of effort to accommodate current
practice whatever is specified, and hoping that considerable support
(documentation and software utilities) will be available to both users and developers
(e.g. along the lines of Brian's recent contribution).

However, I would like UTF-8 to be established as the default encoding that all
CIF processors should be able to handle and should assume in the absence of
any pointers to the contrary (but then this could be championed within the supporting
material if it cannot find a place in the formal specification).

Cheers

Simon



From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Friday, 17 September, 2010 19:34:08
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .. .

It may help this discussion to refer to the CIF 1.1 syntax specification,
which says:

Character set

22. Characters within a CIF are restricted to certain printable or
white-space characters. Specifically, these are the ones located in
the ASCII character set at decimal positions 09 (HT or horizontal
tab), 10 (LF or line feed), 13 (CR or carriage return) and the
letters, numerals and punctuation marks at positions 32-126.

The ASCII characters at decimal positions 11 (VT or vertical tab) and
12 (FF or form feed), often included in library implementations as
white space characters, are explicitly excluded from the CIF
character set at this revision.

23. The reference to the ASCII character set is specifically to
identify characters in an established and widely available standard.
It is understood that CIFs may be constructed and maintained on
computer platforms that implement other character-set encodings.
However, for maximum portability only the characters identified in
the section above may be used. Other printable characters, even if
available in an accessible character set such as Unicode, must be
indicated by some encoding mechanism using only the permitted
characters. At this revision, only the encoding convention detailed
in paragraphs 30-37 of the document Common semantic features is
recognised for this purpose.

To end this promptly and get on with actually using CIF2, I formally
propose to a vote on the following wording, which combines what has
already been put forth in "CIF Changes to the specification
05 July 2010" with the beginning of the CIF 1.1 syntax specification
paragraph 23, and that we leave all the remaining details on how
best to deal with multiple character encodings for future discussion.

===============================================================

Proposed position on CIF2 character encodings submitted to
COMCIFS for a vote as an interim agreement on what can be
agreed thus far, subject to extension and refinement in
the future.

===============================================================

Reference to character(s) means abstract characters assigned code
points by Unicode.  Specific characters are referenced according to
Unicode convention, U+xxxx[x[x]], where  xxxx[x[x]] is the four- to
six-digit hexadecimal representation of the assigned code point.

The designated character encoding for CIF2 is UTF-8 as the preferred
concrete representation of the information in a CIF2 document.

Reference to ASCII characters means characters U+0000 through U+007F, or,
equivalently the first 128 characters of the ISO-8859-1 (LATIN-1)
character set.

Reference to newline or \n means the sequence that conventionally
terminates a line record (which is environment dependent).
Reference to whitespace means the characters ASCII space (U+0020),
ASCII horizontal tab (U+0009) and the newline characters. Without
regard to local  convention, the various other characters that
Unicode classifies as whitespace (character categories Zs and Zp) do
not constitute whitespace for the purposes of CIF2.

CIF2 files are standard variable length text files, which for
compatibility with older processing systems will have a maximum line
length of 2048 characters. As discussed above and below, however,
there are some restrictions on the  character set for token
delimiters, separators and data names.

References to Unicode and UTF-8 are specifically to identify characters
and a concrete representation of those characters in an established and
widely available standard.  It is understood that CIF2 documents may
be constructed and maintained on computer that implements other character
encodings.  However, for maximum portability only the clearly
identified equivalents to the Unicode characters identified above and
below should
be used and use of UTF-8 for a concrete representation is highly
recommended.

A CIF2 file is uniquely identified by a required magic code at the
beginning of its first line. The code is, #\#CIF_2.0 followed
immediately by whitespace.  The addition of further information
to assist in disambiguation among multiple characters sets is
under discussion.  Encodings, such a UTF-16, which prefix a file
by a BOM (byte-order-message) or other encoding disambiguation
prefix are not precluded.  In such a case, the magic code should
follow the encoding disambiguation prefix.

In keeping with XML restrictions we allow the characters

U+0009 U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD
U+10000 -- U+10FFFD

In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where
x is any hexadecimal digit are disallowed. Unicode reserves the code
points E000 - F8FF for private use. The IUCr and only the IUCr may specify
what characters  are assigned to these code points in the context of
CIF2.

CIF2 processors are required to treat <U+000A>, <U+000D> and
<U+000D><U+000A> as newline characters, by normalising them to
<U+000A> on read. No other  characters or character sequences may
represent newline. In particular, CIF2  processors should not
interpret the Unicode characters U+2028 (line separator) or U+2029
(paragraph separator) as newline.



--
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.