Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] imgCIF versus CIF2

Allow me to clarify the relationship between imgCIF/CBF and CIF2

imgCIF is an addon to mmCIF for synchrotron images.  It exists
in multiple representations:  pure ASCII CIFs, pure binary CBFs,
mixed ASCII CIFs with binutf "binary" sections and NeXus files
(HDF5/HDF5/XML files).  The binutf files use UCS-2/UTF-16 to
carry the binary information of a CBF with only a 7% overhead.
While the pure binary form is better, the bin-utf form is an
important compromise in working in the XML world, just as the
CIF ASCII form has been an important compromise in work in
the DDL2 CIF world.

-- Herbert

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 24 May 2010, James Hester wrote:

> To run through the alternatives and some of the arguments so far:
> (i) treating an embedded BOM as an ordinary character runs against the
> Unicode recommendations.  If we wish our standard to be respected, I think
> we should at least respect other standards and the thinking that has gone
> into them
> (ii) treating an embedded BOM as whitespace is OK with the Unicode standard,
> but means that a non-ASCII character now has syntactic meaning in the CIF. 
> I think this would be completely inconsistent on our part, as an invisible
> character (when displayed) can actually be used to delimit strings.  This is
> my least preferred solution, as it goes against the human-readability
> expected of CIFs
> (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a
> serious problem'.
> (iv) treating embedded BOMs as syntax errors will cause issues when CIF2
> files are naively concatenated
> I think the only viable alternatives are to choose (iii) or (iv).
> So: why exactly is ignoring a BOM a problem?  If the embedded BOM is the
> leading BOM from a UTF16 file that has been naively concatenated, it will
> have bytes 0xFE 0xFF.  This byte sequence (and the reverse) is not
> acceptable UTF8, leading to a decoding error from the UTF8 decoding step. 
> The subsequent bytes will be UTF16, which should cause a decoding failure in
> any case.   So I deduce that we are simply discussing how to treat a UTF8
> BOM, which can only find its way into a CIF file by naive concatenation of
> UTF8-encoded files written by certain programs.
> If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't
> see that it is indicative of any problems beyond misguided choice of text
> editor.
> So I would advocate ignoring (and removing) UTF8-BOMs in the input stream,
> and treating all other BOMs as syntax errors.  Individual applications may
> wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP
> (and translating to the correct character) on the understanding that if this
> occurs outside a delimited string it will cause a syntax error.
> James
> PS am I the only one who thinks it unlikely that Wordpad users would choose
> to use 'cat' to join file fragments together?
> On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>       Allow me to clarify my position, so there is no
>       misunderstanding:
>       I believe that we will be dealing with a world with at least
>       UTF-8
>       and UCS-2/UTF-16 encodings for many years to come.  I have no
>       objection to CIF2 being specified solely in terms of UTF-8 for
>       simplicity and consistency, but if we are to write software that
>       people can use, we must have a reasonable position with respect
>       to the encodings people use, and that means that, at the very
>       least, we need to accept and process UTF-8 BOMs as harmless
>       additional text.  Some of us will also be supporting
>       UCS-2/UTF-16
>       directly in our applications.  I don't mind if other
>       applications
>       are only going to support UTF-8, but inasmuch as, as long as
>       we have java and web browsers, we are going to encounter
>       UCS-2/UTF-16,
>       we should do something sensible when a UCS-2/UTF-16 BOM pops up,
>       either doing the internal translation if we so choose, or, if
>       that
>       is not handled by a particular application, issuing a polite
>       warning
>       suggesting the used of an external translator if the application
>       does
>       not wish to handle UCS-2/UTF-16.
>       BOMS will almost always appear in modern UCS-2/UTF-16 files, and
>       when
>       they are converted to UTF-8 that will give us yet another source
>       of
>       UTF-8 BOMs.  I believe the sensible thing to so it to recognize
>       BOMs.
>       Regards,
>           Herbert
>       =====================================================
>        Herbert J. Bernstein, Professor of Computer Science
>          Dowling College, Kramer Science Center, KSC 121
>               Idle Hour Blvd, Oakdale, NY, 11769
>                        +1-631-244-3035
>                        yaya@dowling.edu
>       =====================================================
> On Tue, 18 May 2010, Bollinger, John C wrote:
> > Herbert Bernstein wrote:
> >> Let me see if I understand this correctly -- a user takes 2
> perfectly good
> >> CIF2 files, edits each to clean up, say, some comments to keep
> straight where
> >> one begins and one ends, using a well-designed modern text editor
> that
> >> happens to put a BOM at the start of each file, concatenates the
> two files
> >> with cat to ship them into the IUCr, and suddenly they have a
> syntax error
> >> caused by a character that they cannot see!!!
> >>
> >> To me this seems pointless when it is trivial for software to
> recognize the
> >> character and handle it sensibly.
> >
> > And that is my principal rationale for preferring that embedded
> U+FEFF be recognized as CIF whitespace.  With that approach, the
> concatenation of two well-formed CIF2 files is always a well-formed
> CIF2 file, regardless of the presence or absence of BOMs in the
> original files.  Note, too, that such concatenation cannot produce a
> mixed-encoding file because files encoded in UTF-16[BE|LE],
> UTF-32[BE|LE], or any other encoding that can be distinguished from
> UTF-8 are not well-formed CIF2 files to start.  The file concatenation
> scenario thus does not provide a use case for the CIF2 *specification*
> to recognize embedded U+FEFF as an encoding marker.
> >
> > On the other hand, I again feel compelled to distinguish program
> behaviors from the CIF2 format specification.  None of the above would
> prevent a CIF processor from recognizing and handling CIF-like
> character streams encoded via schemes other than UTF-8, nor from
> recognizing embedded U+FEFF code sequences in various encodings as
> encoding switches, thereby handling mixed-encoding files.  Indeed,
> such a program or library would be invaluable for correcting
> encoding-related errors.  That does not, however, mean that such files
> must be considered well-formed CIF2, no matter how likely they may (or
> may not) be to arise.
> >
> >
> > James Hester wrote:
> >> I would be happy to call an embedded BOM a syntax error.
> >
> > In light of the possibility of U+FEFF appearing in a data value (for
> example, from cutting text from a Unicode manuscript and pasting it
> into a CIF), I need to refine my earlier blanket alternative of
> treating embedded U+FEFF as a syntax error.  I now think it would be
> ok to treat U+FEFF as a syntax error *provided* that it appears
> outside a delimited string.  That's still not my preference, though,
> and I feel confident that Herb will still disagree.
> >
> >
> > Regards,
> >
> > John
> > --
> > John C. Bollinger, Ph.D.
> > Computing and X-Ray Scientist
> > Department of Structural Biology
> > St. Jude Children's Research Hospital
> > John.Bollinger@StJude.org
> > (901) 595-3166 [office]
> > www.stjude.org
> >
> >
> >
> > Email Disclaimer:  www.stjude.org/emaildisclaimer
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.