[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] imgCIF versus CIF2
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: [ddlm-group] imgCIF versus CIF2
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Mon, 24 May 2010 12:08:17 -0400 (EDT)
- In-Reply-To: <AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
- References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><4BEB2CE6.3060900@niehs.nih.gov><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005131228500.12350@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTimlen0jl2p5SsvvizSNN37HZmMs2XOCc0KW7RMG@mail.gmail.com><alpine.BSF.2.00.1005180700530.27091@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com>
Allow me to clarify the relationship between imgCIF/CBF and CIF2 imgCIF is an addon to mmCIF for synchrotron images. It exists in multiple representations: pure ASCII CIFs, pure binary CBFs, mixed ASCII CIFs with binutf "binary" sections and NeXus files (HDF5/HDF5/XML files). The binutf files use UCS-2/UTF-16 to carry the binary information of a CBF with only a 7% overhead. While the pure binary form is better, the bin-utf form is an important compromise in working in the XML world, just as the CIF ASCII form has been an important compromise in work in the DDL2 CIF world. -- Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 24 May 2010, James Hester wrote: > To run through the alternatives and some of the arguments so far: > > (i) treating an embedded BOM as an ordinary character runs against the > Unicode recommendations. If we wish our standard to be respected, I think > we should at least respect other standards and the thinking that has gone > into them > > (ii) treating an embedded BOM as whitespace is OK with the Unicode standard, > but means that a non-ASCII character now has syntactic meaning in the CIF. > I think this would be completely inconsistent on our part, as an invisible > character (when displayed) can actually be used to delimit strings. This is > my least preferred solution, as it goes against the human-readability > expected of CIFs > > (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a > serious problem'. > > (iv) treating embedded BOMs as syntax errors will cause issues when CIF2 > files are naively concatenated > > I think the only viable alternatives are to choose (iii) or (iv). > > So: why exactly is ignoring a BOM a problem? If the embedded BOM is the > leading BOM from a UTF16 file that has been naively concatenated, it will > have bytes 0xFE 0xFF. This byte sequence (and the reverse) is not > acceptable UTF8, leading to a decoding error from the UTF8 decoding step. > The subsequent bytes will be UTF16, which should cause a decoding failure in > any case. So I deduce that we are simply discussing how to treat a UTF8 > BOM, which can only find its way into a CIF file by naive concatenation of > UTF8-encoded files written by certain programs. > > If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't > see that it is indicative of any problems beyond misguided choice of text > editor. > > So I would advocate ignoring (and removing) UTF8-BOMs in the input stream, > and treating all other BOMs as syntax errors. Individual applications may > wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP > (and translating to the correct character) on the understanding that if this > occurs outside a delimited string it will cause a syntax error. > > James > > PS am I the only one who thinks it unlikely that Wordpad users would choose > to use 'cat' to join file fragments together? > > On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein > <yaya@bernstein-plus-sons.com> wrote: > Allow me to clarify my position, so there is no > misunderstanding: > > I believe that we will be dealing with a world with at least > UTF-8 > and UCS-2/UTF-16 encodings for many years to come. I have no > objection to CIF2 being specified solely in terms of UTF-8 for > simplicity and consistency, but if we are to write software that > people can use, we must have a reasonable position with respect > to the encodings people use, and that means that, at the very > least, we need to accept and process UTF-8 BOMs as harmless > additional text. Some of us will also be supporting > UCS-2/UTF-16 > directly in our applications. I don't mind if other > applications > are only going to support UTF-8, but inasmuch as, as long as > we have java and web browsers, we are going to encounter > UCS-2/UTF-16, > we should do something sensible when a UCS-2/UTF-16 BOM pops up, > either doing the internal translation if we so choose, or, if > that > is not handled by a particular application, issuing a polite > warning > suggesting the used of an external translator if the application > does > not wish to handle UCS-2/UTF-16. > > BOMS will almost always appear in modern UCS-2/UTF-16 files, and > when > they are converted to UTF-8 that will give us yet another source > of > UTF-8 BOMs. I believe the sensible thing to so it to recognize > BOMs. > > Regards, > Herbert > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Tue, 18 May 2010, Bollinger, John C wrote: > > > Herbert Bernstein wrote: > >> Let me see if I understand this correctly -- a user takes 2 > perfectly good > >> CIF2 files, edits each to clean up, say, some comments to keep > straight where > >> one begins and one ends, using a well-designed modern text editor > that > >> happens to put a BOM at the start of each file, concatenates the > two files > >> with cat to ship them into the IUCr, and suddenly they have a > syntax error > >> caused by a character that they cannot see!!! > >> > >> To me this seems pointless when it is trivial for software to > recognize the > >> character and handle it sensibly. > > > > And that is my principal rationale for preferring that embedded > U+FEFF be recognized as CIF whitespace. With that approach, the > concatenation of two well-formed CIF2 files is always a well-formed > CIF2 file, regardless of the presence or absence of BOMs in the > original files. Note, too, that such concatenation cannot produce a > mixed-encoding file because files encoded in UTF-16[BE|LE], > UTF-32[BE|LE], or any other encoding that can be distinguished from > UTF-8 are not well-formed CIF2 files to start. The file concatenation > scenario thus does not provide a use case for the CIF2 *specification* > to recognize embedded U+FEFF as an encoding marker. > > > > On the other hand, I again feel compelled to distinguish program > behaviors from the CIF2 format specification. None of the above would > prevent a CIF processor from recognizing and handling CIF-like > character streams encoded via schemes other than UTF-8, nor from > recognizing embedded U+FEFF code sequences in various encodings as > encoding switches, thereby handling mixed-encoding files. Indeed, > such a program or library would be invaluable for correcting > encoding-related errors. That does not, however, mean that such files > must be considered well-formed CIF2, no matter how likely they may (or > may not) be to arise. > > > > > > James Hester wrote: > >> I would be happy to call an embedded BOM a syntax error. > > > > In light of the possibility of U+FEFF appearing in a data value (for > example, from cutting text from a Unicode manuscript and pasting it > into a CIF), I need to refine my earlier blanket alternative of > treating embedded U+FEFF as a syntax error. I now think it would be > ok to treat U+FEFF as a syntax error *provided* that it appears > outside a delimited string. That's still not my preference, though, > and I feel confident that Herb will still disagree. > > > > > > Regards, > > > > John > > -- > > John C. Bollinger, Ph.D. > > Computing and X-Ray Scientist > > Department of Structural Biology > > St. Jude Children's Research Hospital > > John.Bollinger@StJude.org > > (901) 595-3166 [office] > > www.stjude.org > > > > > > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Joe Krahn)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] Case sensitivity
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] UTF-8 BOM
- Index(es):