[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] UTF-8 BOM
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] UTF-8 BOM
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 15 Jun 2010 22:33:35 -0400 (EDT)
- In-Reply-To: <AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com>
Dear Colleagues, This is quite a disruptive change. Until now CIF has always had machine-dependent encoding changes assumed. I am in favor of working the entire world towards a common representation of text, and the use of multiple Unicode representations supported on current systems is going to be a large positive step. I think it is a little premature (by about 10 years) to assume a world of UTF-8 purity. We ain't there yet. You are essentially making CIF2 into a binary format instead of a text format. That is a truly disruptive change. I think it is a serious mistake that will discourage use of CIF as an interchange format, not encourage it. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 16 Jun 2010, James Hester wrote: > My concern with opening up the suite of possible CIF encodings is that we > need to maintain a guarantee that any CIF2-conformant writer will produce > files that any CIF2-conformant reader can read. As we are a data transfer > and archiving standard, this is a core guarantee that we make, so we cannot > specify optional behaviour. Note that we are not restricted to someone > transferring files between computers at a single point in time, when some > negotiation of encoding protocol could take place; we may be talking about a > third party retrieving a file archived some years ago by someone else in the > local university repository. > > What people are and have always been free to do is to encapsulate and encode > CIFs in whatever way they wish, as long as the result is not touted as being > 'CIF2 conformant'. The optional UTF8 BOM that we have more or less agreed > to is purely in deference to poorly-written text editors, rather than an > encoding signature as such. > > On Tue, Jun 15, 2010 at 6:09 AM, Bollinger, John C > <John.Bollinger@stjude.org> wrote: > On Monday, June 14, 2010 9:26 AM, Brian McMahon wrote: > > >I'm coming to this late, I fear, but I would prefer that the > spec > >be kept as simple as possible. I note the following comments in > >the Unicode FAQ document referenced by John B > >(http://www.unicode.org/faq/utf_bom.html): > > > > "Where UTF-8 is used transparently in 8-bit environments, > the use > > of a BOM will interfere with any protocol or file format > that expects > > specific ASCII characters at the beginning, such as the use > of "#!" > > of at the beginning of Unix shell scripts." > > Well yes, but that applies to protocols defined in terms of 8-bit, > ASCII-derived character sets ("8-bit environments"). It does not > argue for BOMs to be forbidden in Unicode environments such as CIF2. > Of course, neither does it require that BOMs be accepted or > recognized in Unicode environments. > > > "In the absence of a protocol supporting its use as a BOM and > when > > not at the beginning of a text stream, U+FEFF should normally not > > occur." > > I'm disappointed that you truncated the quote there. It continues > with "For backwards compatibility it should be treated as ZERO WIDTH > NON-BREAKING SPACE (ZWNBSP), and is then part of the content of the > file or string." It goes on to advocate using U+2060 instead, and (in > the interest of full disclosure) it closes by commenting that a > language or protocol can specify that U+FEFF is unsupported in the > middle of a file. > > >I suggest the CIF specification deprecate the use of U+FEFF so that > >*any* occurrence of it be treated formally as an error. However, a > >note should acknowledge that U+FEFF is permitted according to the > >Unicode standard at the start of a data stream, and that therefore a > >CIF reading application may at its discretion accept U+FEFF followed > >by #\#CIF2.0 as a valid magic number at the start of a file. > > I don't see what is gained by forbidding U+FEFF from appearing inside > data values, where one might arrive via any number of innocent means. > As it currently stands, the draft permits this. It is somewhat > problematic to allow it at the beginning or end of a > whitespace-delimited value, but U+FEFF is by no means the only > character that is allowed but problematic at such a position. > > On the other hand, it is viable to specify that CIF itself does not > (directly) include a BOM. That's where we started. (Pedantic note: > "initial BOM" is redundant. As the term is used in relation to > Unicode, a BOM necessarily appears at the beginning of a data stream; > anywhere else, U+FEFF is just U+FEFF.) If CIF does not formally allow > a BOM then an otherwise well-formed CIF stream headed by a BOM would > then need to be interpreted either > > 1) as an unrecognized file, or > > 2) as an ill-formed CIF, or > > 3) as a well-formed CIF (any version) encapsulated in another > protocol. Such "another protocol" does not need to be the concern of > CIF. > > >The idea is that any fully-conformant CIF writer will never write an > >initial UTF-8 BOM, and so any software designed to handle only fully > >conformant CIFs will not be troubled by it. > > I could live with that. I can't imagine writing a CIF processor > limited to that mode of operation, nor would I want to use one, but I > can handle CIF's formal scope being limited in that way. > > In that case, however, let's carry it to the logical conclusion. > Rather than put one particular encoding detail outside CIF's scope, > why not put character encoding out of scope altogether? CIF can > easily be defined simply in terms of "Unicode characters". Perhaps > instead of anointing UTF-8 as the One True Encoding for CIF, it would > be better to make encoding an entirely separate concern. > > Practically speaking, you're going to have that anyway. Even > disregarding imgCIF, does anyone really expect never to hear "it's a > CIF, except encoded in <FOO-13> instead of UTF-8"? Does anyone really > think they need the authority of the CIF specification to require that > CIFs be delivered to them in a particular encoding? How is that > qualitatively different from requiring particular CIF content, as most > programs do? > > > Of course the world does > >contain CIFs created other than by fully-conformant CIF writers. To > >an extent the community should decide for itself how best to attempt > >to handle deviations from full conformance. It would help, perhaps, > if > >those of us writing CIF readers would document specific practices > that > >the software takes to accommodate such deviations. Ideally, such > >software should have a verbose logging mode that can be activated > >whenever surprising behaviour in reading CIFs is encountered by > >the user. > > I think it's exceedingly optimistic to expect "the community" to > arrive at and abide by a single, consistent set of best practices. > The best you can hope for is that a small number of organizations and > / or programs will exert enough influence to establish their own de > facto standards. > > We can exert some influence there, however. Either the CIF spec or a > companion spec could establish conformance requirements for CIF > *processors*, including, for example, the ability to diagnose > particular malformations. The XML spec does this, as do some > programming language specs. > > Such a document could also establish, perhaps, that CIF processors > must be able to accept the UTF-8 encoding, and maybe even that they > must assume UTF-8 by default. That would establish the baseline and a > guaranteed interoperability mode that we would otherwise lose by > pushing character encoding outside the format specification. > > >Notice that naive concatenation of CIFs will remain a bad idea for > >all sorts of reasons - beyond the purely syntactic issues, one will > >get multiple "data_TOZ" declarations for example. Undoubtedly this > >will continue to happen, but perhaps increasing the number of > >occasions when blindly concatenating files triggers software errors > >will help to raise awareness and/or the use of better software tools. > > You are preaching to the choir with that as far as I am concerned. It > has never been altogether safe or reliable to assemble CIFs by > concatenation of fragments or complete CIFs, and I don't see why CIF2 > needs to make special accommodation for behavior that was never > correct in the first place. No matter what treatment is chosen for > U+FEFF, people who exercise due care will still be able to assemble > well-formed CIF2 files from fragments, even by using 'cat' if they do > so shrewdly. > > John > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: [ddlm-group] Vote on BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):