[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Cif2-encoding] Drafting issues

To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
Subject: Re: [Cif2-encoding] Drafting issues
From: James Hester <jamesrhester@xxxxxxxxx>
Date: Fri, 1 Oct 2010 22:59:42 +1000
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]><[email protected]><[email protected]>
Herbert, you have proposed an entirely reasonable rewriting of what I
proposed with an entirely reasonable justification.  I'm happy to
accept your new wording.

The worst is behind us, and we are currently mopping up.  After making
it through the mountain pass, surely you didn't expect to just fall
off a cliff to the meadows below?  Perhaps that should be a haiku:

Crunching through a snowlit pass
A distant eagle floats above the sunny meadows
Ah! The roads of the air.

On Fri, Oct 1, 2010 at 9:53 PM, Herbert J. Bernstein
<[email protected]> wrote:
> Sigh, here we go again into the loop.
>
> Yes, I object to saying
>>
>> UTF8 and UTF16 [are] the only acceptable CIF2 encodings where
>> non-ASCII codepoints are present and in the absence of a
>> disambiguation signature?"
>
> Because there are BOMs for _all_ the various Unicode encodings
> (and there are a lot), and we have not resolved any of the
> disambiguation signatures, so I suggest changing
>
> "In the absence of an encoding
>>
>> disambiguation signature, it is safe to assume that the encoding of a
>> CIF2 file containing characters outside the ASCII range is either UTF8
>> or UTF16."
>
>
> to
>
> "A CIF2 file containing charaters outside the ASCII range with no BOM
> and no disambiguation signature wiill be a UTF8 file. �A CIF2 file
> containing charcaters outside the ASCII range with a valid UTF8 or
> UTF16 BOM and no disambiguation signature, will be a Unicode file
> written in the indicated encoding."
>
> Frankly, even thought this revised sentence is reaonable, after
> the fuss we have had over equally reasoable sentences, I think
> it is a mistake to include either version at all. �I can just hear
> somebody, e.g. raising the issue that "...but, but, but, you
> did not resolve the open question of the various canoncalized
> encodings..." or "...but, but, but you did not deal with UCS2 versus
> UTF16..."
>
> As we have discovered, the encoding issue is a quagmire, the more
> specific we try to get, the more we struggle, the more we get stuck
> and CIF2 gets delayed.
>
>
> Could we please, please, please, put an end to this!!!!!
>
> Regards,
> �Herbert
>
>
>
> =====================================================
> �Herbert J. Bernstein, Professor of Computer Science
> � Dowling College, Kramer Science Center, KSC 121
> � � � �Idle Hour Blvd, Oakdale, NY, 11769
>
> � � � � � � � � +1-631-244-3035
> � � � � � � � � [email protected]
> =====================================================
>
> On Fri, 1 Oct 2010, James Hester wrote:
>
>> I have included below revised text, this time in plain text format in
>> case the HTML format of the previous email was a problem.
>>
>> I have made the following changes:
>>
>> (1) In first paragraph of the TERMINOLOGY section I have written that
>> UTF8 is the 'preferred' encoding of CIF2 rather than the 'designated'
>> encoding. �This is in keeping with the newfound status of UTF16 as an
>> acceptable encoding
>> (2) In Change 1, I've slightly altered the text on encoding
>> disambiguation that Herbert and I added and changed the commentary on
>> BOM for clarity
>> (3) From the 3rd sentence to the end of the first paragraph of CHANGE
>> 2, I have incorporated the paragraph that Herbert and I worked on, and
>> largely removed the preceding paragraph of Herbert's motion. �The
>> removal of the preceding paragraph is an attempt to avoid confusion
>> and because it was largely discussion, rather than specification. �You
>> will also note some small changes to Herbert's and my text, which I
>> hope is acceptable under the rubric of cleanup. �I've also added a
>> clarificatory comment about UTF8 and UTF16.
>> In general this paragraph is quite scrappy and I believe could be
>> better focussed. �In particular, does anybody have an objection to
>> UTF8 and UTF16 being the only acceptable CIF2 encodings where
>> non-ASCII codepoints are present and in the absence of a
>> disambiguation signature? �If we accept this, then we can chop out the
>> algorithmic determination part (which is really just a statement of
>> principle, and by itself not something that you can write a program
>> based on).
>> (4) I have changed 'text' in the first line of that same paragraph to
>> read 'plain text'. �I hope this is acceptable as a proxy for a more
>> long-winded definition. �If it is not, then we need to add a full
>> definition of 'text' to the definitions section, and I believe that
>> there are several floating around that are acceptable to everybody.
>>
>> If anybody has an objection to these changes, please identify the
>> change, state your objection *precisely*, and give an alternative that
>> would be satisfactory to you.
>>
>> James.
>>
>> ==========================================================================
>> CIF - Changes to the specification
>> 01 October �2010
>> This document specifies changes to the syntax of CIF. We refer to the
>> current syntax specification of CIF as CIF1, and the new specification
>> as CIF2. To date all archival CIFs are CIF1.
>> The changes to syntax are necessitated by the adoption of new
>> dictionary functionalities that introduce several extensions,
>> including new data types, and method definitions using dREL.
>> It is assumed the reader has a thorough understanding of the CIF1
>> specification.
>> TERMINOLOGY
>> Reference to character(s) means abstract characters assigned code
>> points by Unicode. �Specific characters are referenced according to
>> Unicode convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to
>> six-digit hexadecimal representation of the assigned code point. The
>> preferred character encoding for CIF2 is UTF-8.
>> Reference to ASCII characters means characters U?+?0000 through
>> U?+?007F, or, equivalently the first 128 characters of the ISO?8859?1
>> (LATIN?1) character set.
>> Reference to newline or \n means the sequence that conventionally
>> terminates a line record (which is environment dependent). �See Change
>> 3.
>> Reference to whitespace means the characters ASCII space (U?+?0020),
>> ASCII horizontal tab (U?+?0009) and the newline characters. Without
>> regard to local convention, the various other characters that Unicode
>> classifies as whitespace (character categories Zs and Zp) do not
>> constitute whitespace for the purposes of CIF2.
>> PREAMBLE
>> CIF2 significantly extends CIF1 functionality, primarily through new
>> dictionary features. CIF2 is not fully backwards-compatible with CIF1:
>> many files compliant with CIF1 are also compliant with CIF2, but some
>> are not (see especially change 5, below). �The CIF1 standard will
>> continue to operate for the foreseeable future in parallel with CIF2.
>> CHANGE 1 ? NEW (MAGIC CODE)
>> A CIF2 file is uniquely identified by a required magic code at the
>> beginning of its first line. The code is,
>> #\#CIF_2.0
>> followed immediately by whitespace. �The immediately following space
>> on this line is reserved for encoding disambiguation signatures. �Note
>> that where a Unicode BOM is used, it would appear prior to the magic
>> code in the byte stream and does not form part of the CIF text.
>> CHANGE 2 ? NEW (CHARACTER SET)
>> CIF2 files are standard variable length plain text files, which for
>> compatibility with older processing systems will have a maximum line
>> length of 2048 characters. As discussed above and below, however,
>> there are some restrictions on the character set for token delimiters,
>> separators and data names. For compatibility with CIF1 behaviour,
>> there is no formal restriction on the encoding of CIF2 files providing
>> they contain only code points from the ASCII range. �If a CIF2 file
>> contains characters equivalent to Unicode code points greater than
>> U+0077 (127 decimal), then the particular encoding used must either be
>> UTF8 or algorithmically identifiable from the CIF2 file itself. Note
>> that UTF16 with a BOM conforms to this requirement. �The use of a BOM
>> for unicode encodings, including UTF8, is recommended. �Acceptable
>> identification algorithms will be published as necessary as annexes to
>> this standard (see description of magic code and encoding
>> disambiguation in Change 1). �In the absence of an encoding
>> disambiguation signature, it is safe to assume that the encoding of a
>> CIF2 file containing characters outside the ASCII range is either UTF8
>> or UTF16.
>> In keeping with XML restrictions we allow the characters
>> U?+?0009 U?+?000A U?+?000D
>> U?+?0020 ? U+007E
>> U+00A0 - U?+?D7FF
>> U?+?E000 ? U+FDCF
>> U?+?FDF0 - U+FFFD
>> U?+?10000 - U?+?10FFFD
>> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where
>> x is any hexadecimal digit are disallowed. Unicode reserves the code
>> points E000 ? F8FF for private use. The IUCr and only the IUCr may
>> specify what characters are assigned to these code points in the
>> context of CIF2.
>> Reasoning: There is growing demand for the wider character set
>> afforded by Unicode to be made available in applications, especially
>> those where internationalisation is an issue.
>>
>> On Fri, Oct 1, 2010 at 2:44 PM, James Hester <[email protected]>
>> wrote:
>>>
>>> Before I post my revised text, I have only just realised (upon close
>>> perusal of the two texts) that Herbert's motion is substantially the same as
>>> the 'Changes' document, just without the headings etc, so we are discussing
>>> almost the same document.� My apologies for the confusion.
>>>
>>> James.
>>>
>>> On Fri, Oct 1, 2010 at 2:37 PM, James Hester <[email protected]>
>>> wrote:
>>>>
>>>> Dear Group,
>>>>
>>>> As I think we have reached a consensus in principle, and are now moving
>>>> into discussion of precise definitions, let us have wording arguments only
>>>> once (that is, for a single document).� I think that our base document must
>>>> be the one that the DDLm group agreed on - the link once again is
>>>> http://www.iucr.org/__data/assets/pdf_file/0017/41426/cif2_syntax_changes_jrh20100705.pdf
>>>> - simply because it will be unnecessarily confusing for the DDLm group to
>>>> deal with two documents at once, and the 'Changes' document is admirably
>>>> precise.� I reiterate once again that I am happy with the motion that
>>>> Herbert presented, with the proviso that one paragraph is rewritten as I
>>>> have recently proposed.� Herbert - if you would like to negotiate that
>>>> paragraph with me by Skype, I'm happy to do that too.
>>>>
>>>> I have appended a text version of what I consider to be the relevant
>>>> sections of the 'changes' document to this message.� I am happy to provide
>>>> the complete document in OpenOffice format to anybody who would like it.
>>>> Herbert - if you think any of the non-encoding discussion in your motion is
>>>> not already covered in the 'Changes' document, please advise.
>>>>
>>>> I will be posting my own suggestion, largely based on parts of the
>>>> motion that Herbert and I drafted yesterday, in a reply to this email.
>>>>
>>>> CIF - Changes to the specification
>>>>
>>>> 05 July 2010
>>>>
>>>> This document specifies changes to the syntax of CIF. We refer to the
>>>> current syntax specification of CIF as CIF1, and the new specification as
>>>> CIF2. To date all archival CIFs are CIF1.
>>>>
>>>> The changes to syntax are necessitated by the adoption of new dictionary
>>>> functionalities that introduce several extensions, including new data types,
>>>> and method definitions using dREL.
>>>>
>>>> It is assumed the reader has a thorough understanding of the CIF1
>>>> specification.
>>>>
>>>> TERMINOLOGY
>>>>
>>>> Reference to character(s) means abstract characters assigned code points
>>>> by Unicode. Specific characters are referenced according to Unicode
>>>> convention, U?+?xxxx[x[x]], where xxxx[x[x]] is the four- to six-digit
>>>> hexadecimal representation of the assigned code point. The designated
>>>> character encoding for CIF2 is UTF-8.
>>>>
>>>> Reference to ASCII characters means characters U?+?0000 through
>>>> U?+?007F, or, equivalently the first 128 characters of the ISO�8859�1
>>>> (LATIN�1) character set.
>>>>
>>>> Reference to newline or \n means the sequence that conventionally
>>>> terminates a line record (which is environment dependent). See Change 3.
>>>>
>>>> Reference to whitespace means the characters ASCII space (U?+?0020),
>>>> ASCII horizontal tab (U?+?0009) and the newline characters. Without regard
>>>> to local convention, the various other characters that Unicode classifies as
>>>> whitespace (character categories Zs and Zp) do not constitute whitespace for
>>>> the purposes of CIF2.
>>>>
>>>> PREAMBLE
>>>>
>>>> CIF2 significantly extends CIF1 functionality, primarily through new
>>>> dictionary features. CIF2 is not fully backwards-compatible with CIF1: many
>>>> files compliant with CIF1 are also compliant with CIF2, but some are not
>>>> (see especially change 5, below). The CIF1 standard will continue to operate
>>>> for the foreseeable future in parallel with CIF2.
>>>>
>>>> CHANGE 1 ? NEW (MAGIC CODE)
>>>>
>>>> A CIF2 file is uniquely identified by a required magic code at the
>>>> beginning of its first line. The code is,
>>>>
>>>> #\#CIF_2.0
>>>>
>>>> followed immediately by whitespace.
>>>>
>>>> CHANGE 2 ? NEW (CHARACTER SET)
>>>>
>>>> CIF2 files are standard variable length text files, which for
>>>> compatibility with older processing systems will have a maximum line length
>>>> of 2048 characters. As discussed above and below, however, there are some
>>>> restrictions on the character set for token delimiters, separators and data
>>>> names.
>>>>
>>>> In keeping with XML restrictions we allow the characters
>>>>
>>>> U?+?0009 U?+?000A U?+?000D
>>>> U?+?0020 ? U+007E
>>>> U+00A0 - U?+?D7FF
>>>> U?+?E000 ? U+FDCF
>>>> U?+?FDF0 - U+FFFD
>>>> U?+?10000 - U?+?10FFFD
>>>>
>>>> In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x
>>>> is any hexadecimal digit are disallowed. Unicode reserves the code points
>>>> E000 ? F8FF for private use. The IUCr and only the IUCr may specify what
>>>> characters are assigned to these code points in the context of CIF2.
>>>>
>>>> Reasoning: There is growing demand for the wider character set afforded
>>>> by Unicode to be made available in applications, especially those where
>>>> internationalisation is an issue.
>>>>
>>>> --
>>>> T +61 (02) 9717 9907
>>>> F +61 (02) 9717 3145
>>>> M +61 (04) 0249 4148
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> cif2-encoding mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
> _______________________________________________
> cif2-encoding mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
cif2-encoding mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]

Follow-Ups:

Re: [Cif2-encoding] Drafting issues (Bollinger, John C)

References:

[Cif2-encoding] Drafting issues (James Hester)

Re: [Cif2-encoding] Drafting issues (James Hester)

Re: [Cif2-encoding] Drafting issues (James Hester)

Re: [Cif2-encoding] Drafting issues (Herbert J. Bernstein)

Prev by Date: Re: [Cif2-encoding] Drafting issues

Next by Date: Re: [Cif2-encoding] Drafting issues

Prev by thread: Re: [Cif2-encoding] Drafting issues

Next by thread: Re: [Cif2-encoding] Drafting issues

Index(es):

Date

Thread
Discussion List Archives

Re: [Cif2-encoding] Drafting issues