[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .
From: SIMON WESTRIP <[email protected]>
Date: Mon, 21 Jun 2010 13:09:26 -0700 (PDT)
In-Reply-To: <a06240803c845518a843e@[192.168.2.104]>
References: <[email protected]><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected]><[email protected] m><8F77913624F7524AACD2A92EAF3BFA541661229516@SJMEMXMBS11.stjude.sjcrh.local> <[email protected]><8F77913624F7524AACD2A92EAF3BFA54166122951C@SJMEMXMBS11.stjude.sjcrh.local> <a06240803c845518a843e@[192.168.2.104]>

I'm fairly happy with a 'text'-based description. However, I'm struggling with the following sentence:

"This approach is not meant to prevent
people from preparing valid CIF2 files with non-UTF-8-based text
editors, but, if a non-UTF-8 file format is produced, it is important
to clearly specify the intended mapping to UTF-8."

By accepting this it would seem that, at best, I'm agreeing to multiple encodings for Unicode,
while at worst, I'm agreeing to 'any old text encoding' as long as its been specified some how.

Forgive me if I've misinterpretted this, but in particular the phrase
"clearly specify the intended mapping to UTF-8"
worries me. How is this specification to be made?

Cheers

Simon

From: Herbert J. Bernstein <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Monday, 21 June, 2010 18:44:45
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Dear Colleagues,

The IUCr is an international organization. Is it really politically
wise to insist that CIF2 tags be restricted to unaccented roman letters?

Before we go much further, may we please have a vote on explicitly
changing CIF2 from the current draft wording that it is a binary
format to the wording I suggested making it a text format. Most of the
rest of the issues we are dealing with hinge on that basic decision.

The wording I proposed was:

"CIF2 is a specification for the interchange of text files. Text files
have many possible system dependent represenations and encodings. To
ensure clarity in the specification of CIF2, this document is written
in terms of a sequence of unicode code points, and all fully compliant
CIF2 processing systems should, at a minimum be able to process
text files as unicode code points represented in UTF-8, subject to the
XML-based restrictions below. This approach is not meant to prevent
people from preparing valid CIF2 files with non-UTF-8-based text
editors, but, if a non-UTF-8 file format is produced, it is important
to clearly specify the intended mapping to UTF-8. This is particularly
important in dealing with end-of-line indicators (see
http://en.wikipedia.org/wiki/Newline). When handling CIF2 files
produced under MS windows, CR-LF sequences should be accepted as
an alternative to LF, and when handling CIF2 files produced under
Mac OS, CR should be accepted as an alternative to LF. This document
will only refer to LF as a line terminator and will assume that some
appropriate system-dependent text processing system will handle
the necessary conversion.

To ensure compatibility with older Fortran text processing software,
lines in CIF2 files should be restricted to no more than 2048
code points in length, not including the line temrinator itself.
Not that the UTF-8 encoding of such a line may well be much longer."

If anybody objects to some specific wording in this text, let us
settle on revised wording. We need to get this basic issue
clarified in writing or we will be going in circles forever.

Regards,
Herbert

At 11:30 AM -0500 6/21/10, Bollinger, John C wrote:
>On Monday, June 21, 2010 1:13 AM, James Hester wrote:
>
>>I prefer the XML treatment of newline (ie translated to 0x000A for
>>processing purposes). I would be in favour of restricting newline to
>><0x000A>, <0x000D> or <0x000D 0x000A>, which means that only these
>>combinations have the syntactic significance of a newline.
>
>I would be satisfied with that approach.
>
>> From
>>memory, this significance is restricted to:
>>
>>1. end of comment
>>2. whitespace
>>3. use in <eol><semicolon> digraph
>
>The significance also extends to 'single'- and "double"-quote
>delimited data values, in that these cannot contain end-of-line.
>
>>I would also restrict the appearance of the remaining Unicode newline
>>characters to delimited datavalues, to maintain consistent display of
>>data files.
>
>I'm seeing more and more upside to restricting *all* non-ASCII
>characters to delimited data values. I don't have any objection to
>restricting U+0085, U+2028, and U+2029 (did I miss any?) to such
>contexts.
>
>
>John
>--
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>
>
>
>
>Email Disclaimer: www.stjude.org/emaildisclaimer
>
>_______________________________________________
>ddlm-group mailing list
>[email protected]
>http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769

+1-631-244-3035
[email protected]
=====================================================
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Brian McMahon)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] options/text vs binary/end-of-line. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. . (James Hester)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Bollinger, John C)

Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Next by Date: Re: [ddlm-group] Recommended character set and use restrictions. .

Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] options/text vs binary/end-of-line. .. .