[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Wednesday, 23 June, 2010 19:35:59
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .
All that is required to avoid the trap of unintended text transformations
from UTF-8 as if it were, say, Latin 1, is to add any string from the
Latin 1 supplement of the Unicode BMP. I would suggest
:#x00F2#x00F3#x00F4#x00F5#x00F6:
which as utf8 would be
:#x00c3#x00b2#x00c3#x00b3#x00c3#x00b4#x00c3#x00b5#x00c3#0x00b6
which would come out as 5 accented lower case o's running through the
full set of accents if transmitted correctly, but as
capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
of a UTF8 file as a Latin-1 file.
Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6
the transmission check <tc>. Then the proposed magic number would be
#\#CIF_2.0:<encoding>:<tc>:
Both the encoding and the tc would be optional, but highly recommended.
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability.
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya@dowling.edu
=====================================================
On Wed, 23 Jun 2010, Bollinger, John C wrote:
>
> On Wednesday, June 23, 2010 9:47 AM, Herbert J. Bernstein wrote:
>
>> If we impose a non-text canonical UTF-8 encoding that does not contain an
>> internal encoding signature, and that file is transmitted as text and
>> not binary from a machine for which, say, ASCII with code pages for, say,
>> western europe, is the native encoding, and the transmission converts
>> the UTF-8 charcaters as if they were accented characters in Latin-1,
>> then what is received may appear plausible at the receiving end, just
>> wrong.
>
> Surely that is a general issue with exchanging encoded text. It is not caused by designating a canonical encoding, and it would not be solved either by declining to designate a canonical encoding or by mandating UTF-8 as the only allowed encoding.
>
>> Therefore, I would suggest that we be very careful to make such a
>> canonical UTF-8 cif self identifying, by including not only a BOM,
>> but by adding some text in the range of #x128-#x254 to the magic
>> number to help in detecting such unintended transmission conversions.
>
> It would definitely ease encoding detection / correction if the magic number contained non-ASCII characters. Doing so, however, either will require CIF2 to be a hybrid binary/text format, or will effectively restrict CIF to be used only with encodings that support the chosen characters. (Or am I missing something?) I disfavor the former, and I think the latter is a serious restriction indeed.
>
>> In addition, I would suggest that, just as the first line of an XML
>> document specifies its encoding in plain text, that we add the same
>> information to our magic number.
>
> I have been giving some consideration to exactly that possibility. It works for all encodings that are supersets of ASCII. Other encodings would need to be detected some other way (e.g. byte-order mark, analysis of the encoded magic number), but they are not at such risk of encoding confusion.
>
> The signature of a CIF2 might then be something like these:
>
> #\#CIF_2.0
> #\#CIF_2.0:UTF-8
> #\#CIF_2.0:KOI8-R
> #\#CIF_2.0:ISO-8859-1
>
> where the first two mean the same thing. If we do choose to not require UTF-8 then I favor this approach.
>
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .
- From: SIMON WESTRIP <simonwestrip@btinternet.com>
- Date: Wed, 23 Jun 2010 20:04:59 +0000 (GMT)
- In-Reply-To: <alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><alpine.BSF.2.00.1006212018430.91069@epsilon.pair.com><AANLkTilolZk4SzLF8mzqOz4EagFJcEHDKOAblGMnoqpW@mail.gmail.com><alpine.BSF.2.00.1006212120510.91069@epsilon.pair.com><AANLkTiklvzlKquqlRQIrpPGZjJfuRzLqiv2E6Stcq6wd@mail.gmail.com><alpine.BSF.2.00.1006212241210.4105@epsilon.pair.com><AANLkTilACXxnPRtJXEjGD39eleDl9dxlAcwar8j9MBPr@mail.gmail.com><alpine.BSF.2.00.1006220753471.87930@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122951E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA541661229521@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA541661229523@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com>
"Both the encoding and the tc would be optional, but highly recommended.
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability."
Agreed, anything that gives us a hint about the encoding would be useful.
I think the analogy with xml should be used with care though. In the xml world, the majority of users never work
with raw xml, while the majority of CIF users work with raw CIF (core CIFs anyway - mmCIF is subject to less 'raw' editing I think).
I suspect that this difference is why we (at the IUCr at least) will encounter far more problems by allowing multiple encodings
than if we were introducing CIF2 from scratch (i.e. if we didnt already have an established user base).
I note also that xml's 'core' or 'desired' encoding is restricted; other encodings are allowed if the signatures are correct, but I suspect
xml was not intended to be used 'casually' in its raw form, so the chances are that such specification will be respected by the developers of xml systems (rather than end users).
Cheers
Simon
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability."
Agreed, anything that gives us a hint about the encoding would be useful.
I think the analogy with xml should be used with care though. In the xml world, the majority of users never work
with raw xml, while the majority of CIF users work with raw CIF (core CIFs anyway - mmCIF is subject to less 'raw' editing I think).
I suspect that this difference is why we (at the IUCr at least) will encounter far more problems by allowing multiple encodings
than if we were introducing CIF2 from scratch (i.e. if we didnt already have an established user base).
I note also that xml's 'core' or 'desired' encoding is restricted; other encodings are allowed if the signatures are correct, but I suspect
xml was not intended to be used 'casually' in its raw form, so the chances are that such specification will be respected by the developers of xml systems (rather than end users).
Cheers
Simon
From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Wednesday, 23 June, 2010 19:35:59
Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .
All that is required to avoid the trap of unintended text transformations
from UTF-8 as if it were, say, Latin 1, is to add any string from the
Latin 1 supplement of the Unicode BMP. I would suggest
:#x00F2#x00F3#x00F4#x00F5#x00F6:
which as utf8 would be
:#x00c3#x00b2#x00c3#x00b3#x00c3#x00b4#x00c3#x00b5#x00c3#0x00b6
which would come out as 5 accented lower case o's running through the
full set of accents if transmitted correctly, but as
capital A-tides alternating with SUPERSCRIPT TWO, SUPERSCRIPT THREE,
ACUTE ACCENT, MICRO SIGN, PILCROW SIGN in the most likely mis-transmission
of a UTF8 file as a Latin-1 file.
Let us call that the code-point sequence #x00F2#x00F3#x00F4#x00F5#x00F6
the transmission check <tc>. Then the proposed magic number would be
#\#CIF_2.0:<encoding>:<tc>:
Both the encoding and the tc would be optional, but highly recommended.
This might not allow fully automated decoding, but it would at least
provide a decent error check for many of the most common cases that
cause trouble, and would actually give us an edge over the XML
convention (which only give th encoding) in terms of reliability.
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
yaya@dowling.edu
=====================================================
On Wed, 23 Jun 2010, Bollinger, John C wrote:
>
> On Wednesday, June 23, 2010 9:47 AM, Herbert J. Bernstein wrote:
>
>> If we impose a non-text canonical UTF-8 encoding that does not contain an
>> internal encoding signature, and that file is transmitted as text and
>> not binary from a machine for which, say, ASCII with code pages for, say,
>> western europe, is the native encoding, and the transmission converts
>> the UTF-8 charcaters as if they were accented characters in Latin-1,
>> then what is received may appear plausible at the receiving end, just
>> wrong.
>
> Surely that is a general issue with exchanging encoded text. It is not caused by designating a canonical encoding, and it would not be solved either by declining to designate a canonical encoding or by mandating UTF-8 as the only allowed encoding.
>
>> Therefore, I would suggest that we be very careful to make such a
>> canonical UTF-8 cif self identifying, by including not only a BOM,
>> but by adding some text in the range of #x128-#x254 to the magic
>> number to help in detecting such unintended transmission conversions.
>
> It would definitely ease encoding detection / correction if the magic number contained non-ASCII characters. Doing so, however, either will require CIF2 to be a hybrid binary/text format, or will effectively restrict CIF to be used only with encodings that support the chosen characters. (Or am I missing something?) I disfavor the former, and I think the latter is a serious restriction indeed.
>
>> In addition, I would suggest that, just as the first line of an XML
>> document specifies its encoding in plain text, that we add the same
>> information to our magic number.
>
> I have been giving some consideration to exactly that possibility. It works for all encodings that are supersets of ASCII. Other encodings would need to be detected some other way (e.g. byte-order mark, analysis of the encoded magic number), but they are not at such risk of encoding confusion.
>
> The signature of a CIF2 might then be something like these:
>
> #\#CIF_2.0
> #\#CIF_2.0:UTF-8
> #\#CIF_2.0:KOI8-R
> #\#CIF_2.0:ISO-8859-1
>
> where the first two mean the same thing. If we do choose to not require UTF-8 then I favor this approach.
>
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
>
>
> Email Disclaimer: www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. . (Brian McMahon)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ...
- Next by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .
- Index(es):