[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: James Hester <[email protected]>
- Date: Wed, 16 Jun 2010 13:54:32 +1000
- In-Reply-To: <[email protected]>
- References: <[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]>
Dear Herbert, Would you mind enlarging a little on what you are responding to here, as I don't follow your thinking. Perhaps I was not clear: I am not in favour of allowing a variety of encodings to be included within the CIF2 standard. I am advocating UTF8 only. Is this what you are responding to, or are you discussing the suggestion of allowing a variety of encodings? On Wed, Jun 16, 2010 at 12:33 PM, Herbert J. Bernstein <[email protected]> wrote: > Dear Colleagues, > > �This is quite a disruptive change. �Until now CIF has always had > machine-dependent encoding changes assumed. �I am in favor of > working the entire world towards a common representation of text, > and the use of multiple Unicode representations supported on > current systems is going to be a large positive step. �I think > it is a little premature (by about 10 years) to assume a > world of UTF-8 purity. �We ain't there yet. > > �You are essentially making CIF2 into a binary format instead > of a text format. �That is a truly disruptive change. �I think > it is a serious mistake that will discourage use of CIF as an > interchange format, not encourage it. > > �Regards, > � �Herbert > > ===================================================== > �Herbert J. Bernstein, Professor of Computer Science > � Dowling College, Kramer Science Center, KSC 121 > � � � �Idle Hour Blvd, Oakdale, NY, 11769 > > � � � � � � � � +1-631-244-3035 > � � � � � � � � [email protected] > ===================================================== > > On Wed, 16 Jun 2010, James Hester wrote: > >> My concern with opening up the suite of possible CIF encodings is that we >> need to maintain a guarantee that any CIF2-conformant writer will produce >> files that any CIF2-conformant reader can read.� As we are a data transfer >> and archiving standard, this is a core guarantee that we make, so we >> cannot >> specify optional behaviour.� Note that we are not restricted to someone >> transferring files between computers at a single point in time, when some >> negotiation of encoding protocol could take place; we may be talking about >> a >> third party retrieving a file archived some years ago by someone else in >> the >> local university repository. >> >> What people are and have always been free to do is to encapsulate and >> encode >> CIFs in whatever way they wish, as long as the result is not touted as >> being >> 'CIF2 conformant'.� The optional UTF8 BOM that we have more or less agreed >> to is purely in deference to poorly-written text editors, rather than an >> encoding signature as such. >> >> On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C >> <[email protected]> wrote: >> � � �On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote: >> >> � � �>I'm coming to this late, I fear, but I would prefer that the >> � � �spec >> � � �>be kept as simple as possible. I note the following comments in >> � � �>the Unicode FAQ document referenced by John B >> � � �>(http://www.unicode.org/faq/utf_bom.html): >> � � �> >> � � �> � �"Where UTF-8 is used transparently in 8-bit environments, >> � � �the use >> � � �> � �of a BOM will interfere with any protocol or file format >> � � �that expects >> � � �> � �specific ASCII characters at the beginning, such as the use >> � � �of "#!" >> � � �> � �of at the beginning of Unix shell scripts." >> >> Well yes, but that applies to protocols defined in terms of 8-bit, >> ASCII-derived character sets ("8-bit environments"). �It does not >> argue for BOMs to be forbidden in Unicode environments such as CIF2. >> �Of course, neither does it require that BOMs be accepted or >> recognized in Unicode environments. >> >> > � �"In the absence of a protocol supporting its use as a BOM and >> when >> > � �not at the beginning of a text stream, U+FEFF should normally not >> > � �occur." >> >> I'm disappointed that you truncated the quote there. �It continues >> with "For backwards compatibility it should be treated as ZERO WIDTH >> NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the >> file or string." �It goes on to advocate using U+2060 instead, and (in >> the interest of full disclosure) it closes by commenting that a >> language or protocol can specify that U+FEFF is unsupported in the >> middle of a file. >> >> >I suggest the CIF specification deprecate the use of U+FEFF so that >> >*any* occurrence of it be treated formally as an error. However, a >> >note should acknowledge that U+FEFF is permitted according to the >> >Unicode standard at the start of a data stream, and that therefore a >> >CIF reading application may at its discretion accept U+FEFF followed >> >by #\#CIF2.0 as a valid magic number at the start of a file. >> >> I don't see what is gained by forbidding U+FEFF from appearing inside >> data values, where one might arrive via any number of innocent means. >> �As it currently stands, the draft permits this. �It is somewhat >> problematic to allow it at the beginning or end of a >> whitespace-delimited value, but U+FEFF is by no means the only >> character that is allowed but problematic at such a position. >> >> On the other hand, it is viable to specify that CIF itself does not >> (directly) include a BOM. �That's where we started. �(Pedantic note: >> "initial BOM" is redundant. �As the term is used in relation to >> Unicode, a BOM necessarily appears at the beginning of a data stream; >> anywhere else, U+FEFF is just U+FEFF.) �If CIF does not formally allow >> a BOM then an otherwise well-formed CIF stream headed by a BOM would >> then need to be interpreted either >> >> 1) as an unrecognized file, or >> >> 2) as an ill-formed CIF, or >> >> 3) as a well-formed CIF (any version) encapsulated in another >> protocol. �Such "another protocol" does not need to be the concern of >> CIF. >> >> >The idea is that any fully-conformant CIF writer will never write an >> >initial UTF-8 BOM, and so any software designed to handle only fully >> >conformant CIFs will not be troubled by it. >> >> I could live with that. �I can't imagine writing a CIF processor >> limited to that mode of operation, nor would I want to use one, but I >> can handle CIF's formal scope being limited in that way. >> >> In that case, however, let's carry it to the logical conclusion. >> �Rather than put one particular encoding detail outside CIF's scope, >> why not put character encoding out of scope altogether? �CIF can >> easily be defined simply in terms of "Unicode characters". �Perhaps >> instead of anointing UTF-8 as the One True Encoding for CIF, it would >> be better to make encoding an entirely separate concern. >> >> Practically speaking, you're going to have that anyway. �Even >> disregarding imgCIF, does anyone really expect never to hear "it's a >> CIF, except encoded in <FOO-13> instead of UTF-8"? �Does anyone really >> think they need the authority of the CIF specification to require that >> CIFs be delivered to them in a particular encoding? �How is that >> qualitatively different from requiring particular CIF content, as most >> programs do? >> >> > � � � � � � � � � � � � � � � � � � � � � � Of course the world does >> >contain CIFs created other than by fully-conformant CIF writers. To >> >an extent the community should decide for itself how best to attempt >> >to handle deviations from full conformance. It would help, perhaps, >> if >> >those of us writing CIF readers would document specific practices >> that >> >the software takes to accommodate such deviations. Ideally, such >> >software should have a verbose logging mode that can be activated >> >whenever surprising behaviour in reading CIFs is encountered by >> >the user. >> >> I think it's exceedingly optimistic to expect "the community" to >> arrive at and abide by a single, consistent set of best practices. >> �The best you can hope for is that a small number of organizations and >> / or programs will exert enough influence to establish their own de >> facto standards. >> >> We can exert some influence there, however. �Either the CIF spec or a >> companion spec could establish conformance requirements for CIF >> *processors*, including, for example, the ability to diagnose >> particular malformations. �The XML spec does this, as do some >> programming language specs. >> >> Such a document could also establish, perhaps, that CIF processors >> must be able to accept the UTF-8 encoding, and maybe even that they >> must assume UTF-8 by default. �That would establish the baseline and a >> guaranteed interoperability mode that we would otherwise lose by >> pushing character encoding outside the format specification. >> >> >Notice that naive concatenation of CIFs will remain a bad idea for >> >all sorts of reasons - beyond the purely syntactic issues, one will >> >get multiple "data_TOZ" declarations for example. Undoubtedly this >> >will continue to happen, but perhaps increasing the number of >> >occasions when blindly concatenating files triggers software errors >> >will help to raise awareness and/or the use of better software tools. >> >> You are preaching to the choir with that as far as I am concerned. �It >> has never been altogether safe or reliable to assemble CIFs by >> concatenation of fragments or complete CIFs, and I don't see why CIF2 >> needs to make special accommodation for behavior that was never >> correct in the first place. �No matter what treatment is chosen for >> U+FEFF, people who exercise due care will still be able to assemble >> well-formed CIF2 files from fragments, even by using 'cat' if they do >> so shrewdly. >> >> John >> -- >> John C. Bollinger, Ph.D. >> Department of Structural Biology >> St. Jude Children's Research Hospital >> >> >> >> >> Email Disclaimer: �www.stjude.org/emaildisclaimer >> >> _______________________________________________ >> ddlm-group mailing list >> [email protected] >> http://scripts.iucr.org/mailman/listinfo/ddlm-group >> >> >> >> >> -- >> T +61 (02) 9717 9907 >> F +61 (02) 9717 3145 >> M +61 (04) 0249 4148 >> > > _______________________________________________ > ddlm-group mailing list > [email protected] > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] Vote on BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):