[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 18 Jun 2010 11:01:01 +1000
- In-Reply-To: <alpine.BSF.2.00.1006172025070.91418@epsilon.pair.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com><alpine.BSF.2.00.1006152223480.59900@epsilon.pair.com><AANLkTimmOPFkQhY1KY24Dg5kz3MUB4mO2sjoM848bqjV@mail.gmail.com><alpine.BSF.2.00.1006160719520.58405@epsilon.pair.com><881462.27872.qm@web87009.mail.ird.yahoo.com><AANLkTin51hXra-cIPzH3VMcUxJHMaUPWL71Kf1zM8SNt@mail.gmail.com><alpine.BSF.2.00.1006172025070.91418@epsilon.pair.com>
I suggest you look again (perhaps you found 0xFFFE instead?). Unicode Hexadecimal code point 0xFEFF is Zero Width Non-Breaking Space (ZWNBSP). Previous recent emails have discussed this at some length. On Fri, Jun 18, 2010 at 10:55 AM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote: > Dear Colleagues, > > As I said, I reject the false trichotomy presented, and vote to reject > this binary approach to CIF2. Asking what should be done if the > Unicode code point 0xFEFF is encountered in the text stream. FFFE is > not a Unicode text character (I just checked the latest Unicode standard, > and it is still not a character, explicitly call as "noncharacter") so > a properly functioning text system simply will not deliver it as text > to an application, just as in older ASCII-based systems, characters such as > NUL and SYN are stripped before delivery of text to an application. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Fri, 18 Jun 2010, James Hester wrote: > >> Herbert and Simon: regardless of your concerns about what encodings >> should be acceptable for CIF2, I would invite you to vote on the >> treatment of Unicode code point 0xFEFF when encountered in the decoded >> text stream. If you think a initial BOM should not be part of the >> decoded text, then you are deciding how to treat code point 0xFEFF as >> the first character in a CIF2 file, and the only consistent stance >> would be that such a file is non-conformant, as the magic number >> convention is violated. >> >> On Thu, Jun 17, 2010 at 9:21 PM, SIMON WESTRIP >> <simonwestrip@btinternet.com> wrote: >>> >>> Dear all >>> >>> I've been watching this thread with the viewpoint that whatever is >>> decided >>> for the spec, >>> I am going to have to be aware that CIFs may contain mixed encoding or >>> encoding that >>> isnt as specified. We meet this situation elsewhere, especially with text >>> uploaded from >>> web forms. >>> >>> So I quite like Herbert's latest description and would prefer to hold >>> back >>> from voting until I've considered this in more detail. >>> >>> Cheers >>> >>> Simon >>> >>> >>> ________________________________ >>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> >>> To: Group finalising DDLm and associated dictionaries >>> <ddlm-group@iucr.org> >>> Sent: Wednesday, 16 June, 2010 12:29:41 >>> Subject: Re: [ddlm-group] UTF-8 BOM >>> >>> Dear Colleagues, >>> >>> As I said in my last message, I am proposing that we do what >>> most of the world really does with unicode -- treat a CIF2 as >>> a text file in which the information presented is a sequence >>> os valid printable unicode code points no matter what the >>> encoding. >>> >>> For convenience in interchange, I am proposing that all >>> CIF2 processing software working on systems that provide >>> support for UTF-8 must provide support for that particular >>> encoding, but if someone happens to be working in a system >>> the only supports a UTF-7 or a UTF-16 or an old code-page-based >>> encoding then I see no reason to declare what they produce >>> erroneous in any way -- just a reason to require that they >>> clearly identify the encoding used so that one of the >>> many reliable encoding conversion programs that are available >>> may be passed over their file when it needs to be handled >>> in the preferred encoding. I happen to use cyclone on my >>> mac for that purpose. >>> >>> The use of a BOM is just a quick, simply way to clearly >>> specify an ecnoding if the file encoding a text file >>> is a unicode file, but it really is not part of the text >>> itself. >>> >>> I, the strong proponent of supporting binary with CIF, >>> am proposing that we return to the original approach >>> to CIF -- that it really is a text file, not a binary file. >>> I do so precisely to help me support the handling of >>> binary with CIF. >>> >>> Regards, >>> Herbert >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >>> >>> On Wed, 16 Jun 2010, James Hester wrote: >>> >>>> Dear Herbert, >>>> >>>> Would you mind enlarging a little on what you are responding to here, >>>> as I don't follow your thinking. >>>> Perhaps I was not clear: I am not in favour of allowing a variety of >>>> encodings to be included within the CIF2 standard. I am advocating >>>> UTF8 only. Is this what you are responding to, or are you discussing >>>> the suggestion of allowing a variety of encodings? >>>> >>>> On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein >>>> <yaya@bernstein-plus-sons.com> wrote: >>>>> >>>>> Dear Colleagues, >>>>> >>>>> This is quite a disruptive change. Until now CIF has always had >>>>> machine-dependent encoding changes assumed. I am in favor of >>>>> working the entire world towards a common representation of text, >>>>> and the use of multiple Unicode representations supported on >>>>> current systems is going to be a large positive step. I think >>>>> it is a little premature (by about 10 years) to assume a >>>>> world of UTF-8 purity. We ain't there yet. >>>>> >>>>> You are essentially making CIF2 into a binary format instead >>>>> of a text format. That is a truly disruptive change. I think >>>>> it is a serious mistake that will discourage use of CIF as an >>>>> interchange format, not encourage it. >>>>> >>>>> Regards, >>>>> Herbert >>>>> >>>>> ===================================================== >>>>> Herbert J. Bernstein, Professor of Computer Science >>>>> Dowling College, Kramer Science Center, KSC 121 >>>>> Idle Hour Blvd, Oakdale, NY, 11769 >>>>> >>>>> +1-631-244-3035 >>>>> yaya@dowling.edu >>>>> ===================================================== >>>>> >>>>> On Wed, 16 Jun 2010, James Hester wrote: >>>>> >>>>>> My concern with opening up the suite of possible CIF encodings is that >>>>>> we >>>>>> need to maintain a guarantee that any CIF2-conformant writer will >>>>>> produce >>>>>> files that any CIF2-conformant reader can read. As we are a data >>>>>> transfer >>>>>> and archiving standard, this is a core guarantee that we make, so we >>>>>> cannot >>>>>> specify optional behaviour. Note that we are not restricted to >>>>>> someone >>>>>> transferring files between computers at a single point in time, when >>>>>> some >>>>>> negotiation of encoding protocol could take place; we may be talking >>>>>> about >>>>>> a >>>>>> third party retrieving a file archived some years ago by someone else >>>>>> in >>>>>> the >>>>>> local university repository. >>>>>> >>>>>> What people are and have always been free to do is to encapsulate and >>>>>> encode >>>>>> CIFs in whatever way they wish, as long as the result is not touted as >>>>>> being >>>>>> 'CIF2 conformant'. The optional UTF8 BOM that we have more or less >>>>>> agreed >>>>>> to is purely in deference to poorly-written text editors, rather than >>>>>> an >>>>>> encoding signature as such. >>>>>> >>>>>> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C >>>>>> <John.Bollinger@stjude.org> wrote: >>>>>> On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote: >>>>>> >>>>>> >I'm coming to this late, I fear, but I would prefer that the >>>>>> spec >>>>>> >be kept as simple as possible. I note the following comments in >>>>>> >the Unicode FAQ document referenced by John B >>>>>> >(http://www.unicode.org/faq/utf_bom.html): >>>>>> > >>>>>> > "Where UTF-8 is used transparently in 8-bit environments, >>>>>> the use >>>>>> > of a BOM will interfere with any protocol or file format >>>>>> that expects >>>>>> > specific ASCII characters at the beginning, such as the use >>>>>> of "#!" >>>>>> > of at the beginning of Unix shell scripts." >>>>>> >>>>>> Well yes, but that applies to protocols defined in terms of 8-bit, >>>>>> ASCII-derived character sets ("8-bit environments"). It does not >>>>>> argue for BOMs to be forbidden in Unicode environments such as CIF2. >>>>>> Of course, neither does it require that BOMs be accepted or >>>>>> recognized in Unicode environments. >>>>>> >>>>>>> "In the absence of a protocol supporting its use as a BOM and >>>>>> >>>>>> when >>>>>>> >>>>>>> not at the beginning of a text stream, U+FEFF should normally not >>>>>>> occur." >>>>>> >>>>>> I'm disappointed that you truncated the quote there. It continues >>>>>> with "For backwards compatibility it should be treated as ZERO WIDTH >>>>>> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the >>>>>> file or string." It goes on to advocate using U+2060 instead, and (in >>>>>> the interest of full disclosure) it closes by commenting that a >>>>>> language or protocol can specify that U+FEFF is unsupported in the >>>>>> middle of a file. >>>>>> >>>>>>> I suggest the CIF specification deprecate the use of U+FEFF so that >>>>>>> *any* occurrence of it be treated formally as an error. However, a >>>>>>> note should acknowledge that U+FEFF is permitted according to the >>>>>>> Unicode standard at the start of a data stream, and that therefore a >>>>>>> CIF reading application may at its discretion accept U+FEFF followed >>>>>>> by #\#CIF2.0 as a valid magic number at the start of a file. >>>>>> >>>>>> I don't see what is gained by forbidding U+FEFF from appearing inside >>>>>> data values, where one might arrive via any number of innocent means. >>>>>> As it currently stands, the draft permits this. It is somewhat >>>>>> problematic to allow it at the beginning or end of a >>>>>> whitespace-delimited value, but U+FEFF is by no means the only >>>>>> character that is allowed but problematic at such a position. >>>>>> >>>>>> On the other hand, it is viable to specify that CIF itself does not >>>>>> (directly) include a BOM. That's where we started. (Pedantic note: >>>>>> "initial BOM" is redundant. As the term is used in relation to >>>>>> Unicode, a BOM necessarily appears at the beginning of a data stream; >>>>>> anywhere else, U+FEFF is just U+FEFF.) If CIF does not formally allow >>>>>> a BOM then an otherwise well-formed CIF stream headed by a BOM would >>>>>> then need to be interpreted either >>>>>> >>>>>> 1) as an unrecognized file, or >>>>>> >>>>>> 2) as an ill-formed CIF, or >>>>>> >>>>>> 3) as a well-formed CIF (any version) encapsulated in another >>>>>> protocol. Such "another protocol" does not need to be the concern of >>>>>> CIF. >>>>>> >>>>>>> The idea is that any fully-conformant CIF writer will never write an >>>>>>> initial UTF-8 BOM, and so any software designed to handle only fully >>>>>>> conformant CIFs will not be troubled by it. >>>>>> >>>>>> I could live with that. I can't imagine writing a CIF processor >>>>>> limited to that mode of operation, nor would I want to use one, but I >>>>>> can handle CIF's formal scope being limited in that way. >>>>>> >>>>>> In that case, however, let's carry it to the logical conclusion. >>>>>> Rather than put one particular encoding detail outside CIF's scope, >>>>>> why not put character encoding out of scope altogether? CIF can >>>>>> easily be defined simply in terms of "Unicode characters". Perhaps >>>>>> instead of anointing UTF-8 as the One True Encoding for CIF, it would >>>>>> be better to make encoding an entirely separate concern. >>>>>> >>>>>> Practically speaking, you're going to have that anyway. Even >>>>>> disregarding imgCIF, does anyone really expect never to hear "it's a >>>>>> CIF, except encoded in <FOO-13> instead of UTF-8"? Does anyone really >>>>>> think they need the authority of the CIF specification to require that >>>>>> CIFs be delivered to them in a particular encoding? How is that >>>>>> qualitatively different from requiring particular CIF content, as most >>>>>> programs do? >>>>>> >>>>>>> Of course the world does >>>>>>> contain CIFs created other than by fully-conformant CIF writers. To >>>>>>> an extent the community should decide for itself how best to attempt >>>>>>> to handle deviations from full conformance. It would help, perhaps, >>>>>> >>>>>> if >>>>>>> >>>>>>> those of us writing CIF readers would document specific practices >>>>>> >>>>>> that >>>>>>> >>>>>>> the software takes to accommodate such deviations. Ideally, such >>>>>>> software should have a verbose logging mode that can be activated >>>>>>> whenever surprising behaviour in reading CIFs is encountered by >>>>>>> the user. >>>>>> >>>>>> I think it's exceedingly optimistic to expect "the community" to >>>>>> arrive at and abide by a single, consistent set of best practices. >>>>>> The best you can hope for is that a small number of organizations and >>>>>> / or programs will exert enough influence to establish their own de >>>>>> facto standards. >>>>>> >>>>>> We can exert some influence there, however. Either the CIF spec or a >>>>>> companion spec could establish conformance requirements for CIF >>>>>> *processors*, including, for example, the ability to diagnose >>>>>> particular malformations. The XML spec does this, as do some >>>>>> programming language specs. >>>>>> >>>>>> Such a document could also establish, perhaps, that CIF processors >>>>>> must be able to accept the UTF-8 encoding, and maybe even that they >>>>>> must assume UTF-8 by default. That would establish the baseline and a >>>>>> guaranteed interoperability mode that we would otherwise lose by >>>>>> pushing character encoding outside the format specification. >>>>>> >>>>>>> Notice that naive concatenation of CIFs will remain a bad idea for >>>>>>> all sorts of reasons - beyond the purely syntactic issues, one will >>>>>>> get multiple "data_TOZ" declarations for example. Undoubtedly this >>>>>>> will continue to happen, but perhaps increasing the number of >>>>>>> occasions when blindly concatenating files triggers software errors >>>>>>> will help to raise awareness and/or the use of better software tools. >>>>>> >>>>>> You are preaching to the choir with that as far as I am concerned. It >>>>>> has never been altogether safe or reliable to assemble CIFs by >>>>>> concatenation of fragments or complete CIFs, and I don't see why CIF2 >>>>>> needs to make special accommodation for behavior that was never >>>>>> correct in the first place. No matter what treatment is chosen for >>>>>> U+FEFF, people who exercise due care will still be able to assemble >>>>>> well-formed CIF2 files from fragments, even by using 'cat' if they do >>>>>> so shrewdly. >>>>>> >>>>>> John >>>>>> -- >>>>>> John C. Bollinger, Ph.D. >>>>>> Department of Structural Biology >>>>>> St. Jude Children's Research Hospital >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Email Disclaimer: www.stjude.org/emaildisclaimer >>>>>> >>>>>> _______________________________________________ >>>>>> ddlm-group mailing list >>>>>> ddlm-group@iucr.org >>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> T +61 (02) 9717 9907 >>>>>> F +61 (02) 9717 3145 >>>>>> M +61 (04) 0249 4148 >>>>>> >>>>> >>>>> _______________________________________________ >>>>> ddlm-group mailing list >>>>> ddlm-group@iucr.org >>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>>> >>>>> >>>> >>>> >>>> >>>> -- >>>> T +61 (02) 9717 9907 >>>> F +61 (02) 9717 3145 >>>> M +61 (04) 0249 4148 >>>> _______________________________________________ >>>> ddlm-group mailing list >>>> ddlm-group@iucr.org >>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>>> >>> _______________________________________________ >>> ddlm-group mailing list >>> ddlm-group@iucr.org >>> http://scripts.iucr.org/mailman/listinfo/ddlm-group >>> >>> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):