[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Thursday, 30 September, 2010 14:40:21
Subject: [Cif2-encoding] Revised Motion
Dear Colleagues,
James and I had a good e-meeting and came up with the following
revised wording. If anybody objects to this motion, please speak
up now. We intend to bring this to the DDLm group and then
COMCIFS and add annexes on particular encoding-disambiguation
algorithms and signature later.
Regards,
Herbert
===============================================================
Proposed position on CIF2 character encodings submitted to
COMCIFS for a vote as an interim agreement on what can be
agreed thus far, subject to extension and refinement in
the future.
===============================================================
Reference to character(s) means abstract characters assigned code
points by Unicode. Specific characters are referenced according to
Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to
six-digit hexadecimal representation of the assigned code point.
The designated character encoding for CIF2 is UTF-8 as the preferred
concrete representation of the information in a CIF2 document.
Reference to ASCII characters means characters U+0000 through U+007F, or,
equivalently the first 128 characters of the ISO-8859-1 (LATIN-1)
character set.
Reference to newline or \n means the sequence that conventionally
terminates a line record (which is environment dependent).
Reference to whitespace means the characters ASCII space (U+0020),
ASCII horizontal tab (U+0009) and the newline characters. Without
regard to local convention, the various other characters that
Unicode classifies as whitespace (character categories Zs and Zp) do
not constitute whitespace for the purposes of CIF2.
CIF2 files are standard variable length text files, which for
compatibility with older processing systems will have a maximum line
length of 2048 characters. As discussed above and below, however,
there are some restrictions on the character set for token
delimiters, separators and data names.
References to Unicode and UTF-8 are specifically to identify characters
and a concrete representation of those characters in an established and
widely available standard. It is understood that CIF2 documents may
be constructed and maintained on computers that implement other character
encodings. However, for maximum portability only the clearly
identified equivalents to the Unicode characters identified above and
below should be used and use of UTF-8 for a concrete representation is
highly recommended.
If a CIF2 file contains characters equivalent to Unicode code points
greater than U+0076 (126 decimal), then the particular encoding used
must be either be UTF8 or algorithmically identifiable from the CIF2
file itself. UTF16 with a BOM conforms to this requirement. The use
of a
BOM for unicode encodings including UTF8 is recommended. Acceptable
identification algorithms will be published as necessary as annexes
to this standard (see discussion of magic code and encoding-disambiguation
below).
A CIF2 file is uniquely identified by a required magic code at the
beginning of its first line. The code is
#\#CIF_2.0 followed
immediately by whitespace. The immediately following space on this
is reserved for encoding-disambiguation signatures (see above).
If there is a BOM the magic code should follow the BOM.
In keeping with XML restrictions we allow the characters
U+0009 U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD
U+10000 -- U+10FFFD
In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where
x is any hexadecimal digit are disallowed. Unicode reserves the code
points E000 - F8FF for private use. The IUCr and only the IUCr may specify
what characters are assigned to these code points in the context of
CIF2.
CIF2 processors are required to treat <U+000A>, <U+000D> and
<U+000D><U+000A> as newline characters, by normalising them to
<U+000A> on read. No other characters or character sequences may
represent newline. In particular, CIF2 processors should not
interpret the Unicode characters U+2028 (line separator) or U+2029
(paragraph separator) as newline.
--
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya@dowling.edu
=====================================================
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
Re: [Cif2-encoding] Revised Motion
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] Revised Motion
- From: SIMON WESTRIP <simonwestrip@xxxxxxxxxxxxxx>
- Date: Thu, 30 Sep 2010 08:13:56 -0700 (PDT)
- In-Reply-To: <a06240803c8ca416a932e@[192.168.2.104]>
- References: <alpine.BSF.2.00.1009271801070.86201@epsilon.pair.com><alpine.BSF.2.00.1009271900080.86201@epsilon.pair.com><AANLkTikudiXBk7orHSAH=JonoeQHeNXVrzvAZmH3Wt94@mail.gmail.com><646265.82162.qm@web87004.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDE7@SJMEMXMBS11.stjude.sjcrh.local> <alpine.BSF.2.00.1009281501030.93180@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDE9@SJMEMXMBS11.stjude.sjcrh.local> <a06240801c8c840b90dc7@[192.168.2.104]><20100929102536.GB24670@emerald.iucr.org><alpine.BSF.2.00.1009291001300.12237@epsilon.pair.com><20100930084028.GC9485@emerald.iucr.org><alpine.BSF.2.00.1009300540110.389@epsilon.pair.com><629785.55688.qm@web87004.mail.ird.yahoo.com><a06240802c8ca32fa3108@[192.168.2.104]><a06240803c8ca416a932e@[192.168.2.104]>
I do not object :-)
Cheers
Simon
Cheers
Simon
From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Thursday, 30 September, 2010 14:40:21
Subject: [Cif2-encoding] Revised Motion
Dear Colleagues,
James and I had a good e-meeting and came up with the following
revised wording. If anybody objects to this motion, please speak
up now. We intend to bring this to the DDLm group and then
COMCIFS and add annexes on particular encoding-disambiguation
algorithms and signature later.
Regards,
Herbert
===============================================================
Proposed position on CIF2 character encodings submitted to
COMCIFS for a vote as an interim agreement on what can be
agreed thus far, subject to extension and refinement in
the future.
===============================================================
Reference to character(s) means abstract characters assigned code
points by Unicode. Specific characters are referenced according to
Unicode convention, U+xxxx[x[x]], where xxxx[x[x]] is the four- to
six-digit hexadecimal representation of the assigned code point.
The designated character encoding for CIF2 is UTF-8 as the preferred
concrete representation of the information in a CIF2 document.
Reference to ASCII characters means characters U+0000 through U+007F, or,
equivalently the first 128 characters of the ISO-8859-1 (LATIN-1)
character set.
Reference to newline or \n means the sequence that conventionally
terminates a line record (which is environment dependent).
Reference to whitespace means the characters ASCII space (U+0020),
ASCII horizontal tab (U+0009) and the newline characters. Without
regard to local convention, the various other characters that
Unicode classifies as whitespace (character categories Zs and Zp) do
not constitute whitespace for the purposes of CIF2.
CIF2 files are standard variable length text files, which for
compatibility with older processing systems will have a maximum line
length of 2048 characters. As discussed above and below, however,
there are some restrictions on the character set for token
delimiters, separators and data names.
References to Unicode and UTF-8 are specifically to identify characters
and a concrete representation of those characters in an established and
widely available standard. It is understood that CIF2 documents may
be constructed and maintained on computers that implement other character
encodings. However, for maximum portability only the clearly
identified equivalents to the Unicode characters identified above and
below should be used and use of UTF-8 for a concrete representation is
highly recommended.
If a CIF2 file contains characters equivalent to Unicode code points
greater than U+0076 (126 decimal), then the particular encoding used
must be either be UTF8 or algorithmically identifiable from the CIF2
file itself. UTF16 with a BOM conforms to this requirement. The use
of a
BOM for unicode encodings including UTF8 is recommended. Acceptable
identification algorithms will be published as necessary as annexes
to this standard (see discussion of magic code and encoding-disambiguation
below).
A CIF2 file is uniquely identified by a required magic code at the
beginning of its first line. The code is
#\#CIF_2.0 followed
immediately by whitespace. The immediately following space on this
is reserved for encoding-disambiguation signatures (see above).
If there is a BOM the magic code should follow the BOM.
In keeping with XML restrictions we allow the characters
U+0009 U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD
U+10000 -- U+10FFFD
In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where
x is any hexadecimal digit are disallowed. Unicode reserves the code
points E000 - F8FF for private use. The IUCr and only the IUCr may specify
what characters are assigned to these code points in the context of
CIF2.
CIF2 processors are required to treat <U+000A>, <U+000D> and
<U+000D><U+000A> as newline characters, by normalising them to
<U+000A> on read. No other characters or character sequences may
represent newline. In particular, CIF2 processors should not
interpret the Unicode characters U+2028 (line separator) or U+2029
(paragraph separator) as newline.
--
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya@dowling.edu
=====================================================
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
_______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- References:
- Re: [Cif2-encoding] How we wrap this up (Herbert J. Bernstein)
- Re: [Cif2-encoding] How we wrap this up (Herbert J. Bernstein)
- Re: [Cif2-encoding] How we wrap this up (James Hester)
- Re: [Cif2-encoding] How we wrap this up (SIMON WESTRIP)
- Re: [Cif2-encoding] How we wrap this up (Bollinger, John C)
- Re: [Cif2-encoding] How we wrap this up (Herbert J. Bernstein)
- Re: [Cif2-encoding] How we wrap this up (Bollinger, John C)
- Re: [Cif2-encoding] How we wrap this up (Herbert J. Bernstein)
- Re: [Cif2-encoding] How we wrap this up (Brian McMahon)
- [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010 (Herbert J. Bernstein)
- Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010 (Brian McMahon)
- Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010 (Herbert J. Bernstein)
- Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010 (SIMON WESTRIP)
- Re: [Cif2-encoding] Skype conference call 8:45 am EDT,Thursday 30 September 2010 (Herbert J. Bernstein)
- [Cif2-encoding] Revised Motion (Herbert J. Bernstein)
- Prev by Date: Re: [Cif2-encoding] Revised Motion
- Next by Date: Re: [Cif2-encoding] Revised Motion
- Prev by thread: Re: [Cif2-encoding] Revised Motion
- Next by thread: Re: [Cif2-encoding] How we wrap this up
- Index(es):