[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Fri, 17 Sep 2010 14:34:08 -0400
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTintziXhwVCEFD0yUtTDo9KG8ut=oL4OgmkjmEBe@mail.gmail.com><alpine.BSF.2.00.1008240629120.23114@epsilon.pair.com><AANLkTi=+qZQrWJ3duOzWyPq5H=w1GOVbeKRfFLTR8u5a@mail.gmail.com><alpine.BSF.2.00.1008240920580.23114@epsilon.pair.com><AANLkTikRLKp6oREvD4KcgUd-H-Cu6xoOrGWgQE1zUyx7@mail.gmail.com><alpine.BSF.2.00.1009022333190.52468@epsilon.pair.com><AANLkTimLUnUjNuS9EmMbtTurxB3MGtGvM6gWxZw6aRLE@mail.gmail.com><alpine.BSF.2.00.1009030735110.95035@epsilon.pair.com><AANLkTinxkquC5cY0m23yzBVgm7afmYYfh6+2yMz=Hr_w@mail.gmail.com><alpine.BSF.2.00.1009100711070.59446@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTikuoQEU-rv9GkTqqc0u0qgd1ugf+cGTfqF77j-E@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local> <AANLkTiks-tEAU9T_ygwvNhs_YpzE1+ZVb=K_=0DT8UuK@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDCF@SJMEMXMBS11.stjude.sjcrh.local>
It may help this discussion to refer to the CIF 1.1 syntax specification, which says: Character set 22. Characters within a CIF are restricted to certain printable or white-space characters. Specifically, these are the ones located in the ASCII character set at decimal positions 09 (HT or horizontal tab), 10 (LF or line feed), 13 (CR or carriage return) and the letters, numerals and punctuation marks at positions 32-126. The ASCII characters at decimal positions 11 (VT or vertical tab) and 12 (FF or form feed), often included in library implementations as white space characters, are explicitly excluded from the CIF character set at this revision. 23. The reference to the ASCII character set is specifically to identify characters in an established and widely available standard. It is understood that CIFs may be constructed and maintained on computer platforms that implement other character-set encodings. However, for maximum portability only the characters identified in the section above may be used. Other printable characters, even if available in an accessible character set such as Unicode, must be indicated by some encoding mechanism using only the permitted characters. At this revision, only the encoding convention detailed in paragraphs 30-37 of the document Common semantic features is recognised for this purpose. To end this promptly and get on with actually using CIF2, I formally propose to a vote on the following wording, which combines what has already been put forth in "CIF Changes to the specification 05 July 2010" with the beginning of the CIF 1.1 syntax specification paragraph 23, and that we leave all the remaining details on how best to deal with multiple character encodings for future discussion. =============================================================== Proposed position on CIF2 character encodings submitted to COMCIFS for a vote as an interim agreement on what can be agreed thus far, subject to extension and refinement in the future. =============================================================== Reference to character(s) means abstract characters assigned code points by Unicode. Specific characters are referenced according to Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit hexadecimal representation of the assigned code point. The designated character encoding for CIF2 is UTF-8 as the preferred concrete representation of the information in a CIF2 document. Reference to ASCII characters means characters U+0000 through U+007F, or, equivalently the first 128 characters of the ISO-8859-1 (LATIN-1) character set. Reference to newline or \n means the sequence that conventionally terminates a line record (which is environment dependent). Reference to whitespace means the characters ASCII space (U+0020), ASCII horizontal tab (U+0009) and the newline characters. Without regard to local convention, the various other characters that Unicode classifies as whitespace (character categories Zs and Zp) do not constitute whitespace for the purposes of CIF2. CIF2 files are standard variable length text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. References to Unicode and UTF-8 are specifically to identify characters and a concrete representation of those characters in an established and widely available standard. It is understood that CIF2 documents may be constructed and maintained on computer that implements other character encodings. However, for maximum portability only the clearly identified equivalents to the Unicode characters identified above and below should be used and use of UTF-8 for a concrete representation is highly recommended. A CIF2 file is uniquely identified by a required magic code at the beginning of its first line. The code is, #\#CIF_2.0 followed immediately by whitespace. The addition of further information to assist in disambiguation among multiple characters sets is under discussion. Encodings, such a UTF-16, which prefix a file by a BOM (byte-order-message) or other encoding disambiguation prefix are not precluded. In such a case, the magic code should follow the encoding disambiguation prefix. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 - F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. CIF2 processors are required to treat <U+000A>, <U+000D> and <U+000D><U+000A> as newline characters, by normalising them to <U+000A> on read. No other characters or character sequences may represent newline. In particular, CIF2 processors should not interpret the Unicode characters U+2028 (line separator) or U+2029 (paragraph separator) as newline. -- ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== _______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ... (SIMON WESTRIP)
- References:
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . (Bollinger, John C)
- Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... . (Bollinger, John C)
- Prev by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Next by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...
- Prev by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Next by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...
- Index(es):