[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] How we wrap this up

Dear John,

   The topic I wish to discuss is how to move CIF2 forward.
The norm I am used to in standards work is one of continuity
and gradual change.  I am used to the approach of taking
existing features and deprecating them for some period of time
(often a few years) rather than dropping them abruptly.  I
find many aspects of the current approach to CIF2 troubling
because of the discontinuity from CIF1 practice without
such a transition.

   My motion is an effort to try to bring the transition to
CIF2 more in line with the traditional standards approach
of gradual change through deprecation of non-unicode
encodings rather than trying to abruptly wipe them out.
XML has not managed to complete that transition after
many years for favoring UTF8+UTF16, and we have much
weaker support infrastructure than does XML.

   So, the point of the meeting is not so much to refight
well-discussed technical issues, but to do the critical
work of finding a process to allow the community to
move forward with CIF2 in a way that actually works.  I
think my motion does that.  I would suggest you reread it.
It stops just short of formally deprecating non-unicode
encodings -- very similar to the current approach in
XML.  My guess is that that is a far as we can go right
now.  Maybe by next summer it would be possible to
actually formally deprecate non-Unicode encodings.  I
doubt it, but I could turn out to be wrong.

   Having a meeting should help to clarify this.

   Regards,
     Herbert

===============================================================

Proposed position on CIF2 character encodings submitted to
COMCIFS for a vote as an interim agreement on what can be
agreed thus far, subject to extension and refinement in
the future.

===============================================================

Reference to character(s) means abstract characters assigned code
points by Unicode.  Specific characters are referenced according to
Unicode convention, U+xxxx[x[x]], where  xxxx[x[x]] is the four- to
six-digit hexadecimal representation of the assigned code point.

The designated character encoding for CIF2 is UTF-8 as the preferred
concrete representation of the information in a CIF2 document.

Reference to ASCII characters means characters U+0000 through U+007F, or,
equivalently the first 128 characters of the ISO-8859-1 (LATIN-1)
character set.

Reference to newline or \n means the sequence that conventionally
terminates a line record (which is environment dependent).
Reference to whitespace means the characters ASCII space (U+0020),
ASCII horizontal tab (U+0009) and the newline characters. Without
regard to local  convention, the various other characters that
Unicode classifies as whitespace (character categories Zs and Zp) do
not constitute whitespace for the purposes of CIF2.

CIF2 files are standard variable length text files, which for
compatibility with older processing systems will have a maximum line
length of 2048 characters. As discussed above and below, however,
there are some restrictions on the  character set for token
delimiters, separators and data names.

References to Unicode and UTF-8 are specifically to identify characters
and a concrete representation of those characters in an established and
widely available standard.  It is understood that CIF2 documents may
be constructed and maintained on computer that implements other character
encodings.  However, for maximum portability only the clearly
identified equivalents to the Unicode characters identified above and
below should be used and use of UTF-8 for a concrete representation is 
highly recommended.

A CIF2 file is uniquely identified by a required magic code at the
beginning of its first line. The code is, #\#CIF_2.0 followed
immediately by whitespace.  The addition of further information
to assist in disambiguation among multiple characters sets is
under discussion.  Encodings, such a UTF-16, which prefix a file
by a BOM (byte-order-message) or other encoding disambiguation
prefix are not precluded.  In such a case, the magic code should
follow the encoding disambiguation prefix.

In keeping with XML restrictions we allow the characters

U+0009 U+000A U+000D
U+0020 -- U+007E
U+00A0 -- U+D7FF
U+E000 -- U+FDCF
U+FDF0 -- U+FFFD
U+10000 -- U+10FFFD

In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where
x is any hexadecimal digit are disallowed. Unicode reserves the code
points E000 - F8FF for private use. The IUCr and only the IUCr may specify
what characters  are assigned to these code points in the context of
CIF2.

CIF2 processors are required to treat <U+000A>, <U+000D> and
<U+000D><U+000A> as newline characters, by normalising them to
<U+000A> on read. No other  characters or character sequences may
represent newline. In particular, CIF2  processors should not
interpret the Unicode characters U+2028 (line separator) or U+2029
(paragraph separator) as newline.


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 28 Sep 2010, Bollinger, John C wrote:

> Dear Herb,
>
> On Monday, September 27, 2010 5:19 PM, Herbert J. Bernstein wrote:
>
>> I hope the rest of you will have the
>> courtesy to participate in a Skype meeting.  Perhaps no new facts
>> or logic will come to light.  Perhaps something will come to light
>> that leads to better common understanding and concensus.  We'll never know
>> unless we try.  I for one think that most of us are open minded and
>> willing to try to reach an accomodation that serves the community well.
>
> I apologize for any discourtesy you perceive, and I assure you that none 
> is intended.  At the same time, I am confident that the opportunities 
> for careful consideration, research, and revision inherent in the 
> written form and extended time frame of our discussion to date have 
> already afforded all of us ample opportunity to communicate our 
> positions, to explore each others', and to attempt to reach a consensus 
> compromise.
>
> I have no general objection to conference calls or face-to-face meetings 
> as forums for discussion, and I have participated in many of both.  I do 
> not, however, see anything to be gained by moving this discussion to 
> such a venue at this point.  How will there be anything other than more 
> of the "endless repetition" of positions you so deplore?  What topic do 
> you wish to discuss that we have not already covered in detail?  If 
> there is such a topic then we can save the discussion for a call, but it 
> will go better if we all have the opportunity to prepare.
>
> As for being "open minded and willing to try to reach an accomodation 
> that serves the community well," I point to the record of our 
> discussion.  If my efforts to find such an accommodation are not plainly 
> evident, if my attempts to understand the various positions and 
> viewpoints are not clearly visible, and if my willingness to consider 
> alternative approaches is not a matter of record, then I shall have to 
> accept your insinuation.  I deeply regret that I have left that 
> impression on you.
>
>
> Regards,
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]