[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] imgCIF versus CIF2

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: [ddlm-group] imgCIF versus CIF2
From: "Herbert J. Bernstein" <[email protected]>
Date: Mon, 24 May 2010 12:08:17 -0400 (EDT)
In-Reply-To: <[email protected]>
References: <8F77913624F7524AACD2A92EAF3BFA54165DF337D5@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DB@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337DD@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]><8F77913624F7524AACD2A92EAF3BFA54165DF337E1@SJMEMXMBS11.stjude.sjcrh.local><[email protected]><[email protected]>

Allow me to clarify the relationship between imgCIF/CBF and CIF2

imgCIF is an addon to mmCIF for synchrotron images.  It exists
in multiple representations:  pure ASCII CIFs, pure binary CBFs,
mixed ASCII CIFs with binutf "binary" sections and NeXus files
(HDF5/HDF5/XML files).  The binutf files use UCS-2/UTF-16 to
carry the binary information of a CBF with only a 7% overhead.
While the pure binary form is better, the bin-utf form is an
important compromise in working in the XML world, just as the
CIF ASCII form has been an important compromise in work in
the DDL2 CIF world.

-- Herbert


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Mon, 24 May 2010, James Hester wrote:

> To run through the alternatives and some of the arguments so far:
> 
> (i) treating an embedded BOM as an ordinary character runs against the
> Unicode recommendations.� If we wish our standard to be respected, I think
> we should at least respect other standards and the thinking that has gone
> into them
> 
> (ii) treating an embedded BOM as whitespace is OK with the Unicode standard,
> but means that a non-ASCII character now has syntactic meaning in the CIF.�
> I think this would be completely inconsistent on our part, as an invisible
> character (when displayed) can actually be used to delimit strings.� This is
> my least preferred solution, as it goes against the human-readability
> expected of CIFs
> 
> (iii) ignoring embedded BOMs is bad because they can be a 'tip off to a
> serious problem'.
> 
> (iv) treating embedded BOMs as syntax errors will cause issues when CIF2
> files are naively concatenated
> 
> I think the only viable alternatives are to choose (iii) or (iv).
> 
> So: why exactly is ignoring a BOM a problem?� If the embedded BOM is the
> leading BOM from a UTF16 file that has been naively concatenated, it will
> have bytes 0xFE 0xFF.� This byte sequence (and the reverse) is not
> acceptable UTF8, leading to a decoding error from the UTF8 decoding step.�
> The subsequent bytes will be UTF16, which should cause a decoding failure in
> any case.�� So I deduce that we are simply discussing how to treat a UTF8
> BOM, which can only find its way into a CIF file by naive concatenation of
> UTF8-encoded files written by certain programs.
> 
> If the embedded BOM is a UTF-8 BOM, then ignoring it would be OK, as I don't
> see that it is indicative of any problems beyond misguided choice of text
> editor.
> 
> So I would advocate ignoring (and removing) UTF8-BOMs in the input stream,
> and treating all other BOMs as syntax errors.� Individual applications may
> wish to give users the option of interpreting U+FEFF as the deprecated ZWNBP
> (and translating to the correct character) on the understanding that if this
> occurs outside a delimited string it will cause a syntax error.
> 
> James
> 
> PS am I the only one who thinks it unlikely that Wordpad users would choose
> to use 'cat' to join file fragments together?
> 
> On Wed, May 19, 2010 at 3:46 AM, Herbert J. Bernstein
> <[email protected]> wrote:
>       Allow me to clarify my position, so there is no
>       misunderstanding:
>
>       I believe that we will be dealing with a world with at least
>       UTF-8
>       and UCS-2/UTF-16 encodings for many years to come. �I have no
>       objection to CIF2 being specified solely in terms of UTF-8 for
>       simplicity and consistency, but if we are to write software that
>       people can use, we must have a reasonable position with respect
>       to the encodings people use, and that means that, at the very
>       least, we need to accept and process UTF-8 BOMs as harmless
>       additional text. �Some of us will also be supporting
>       UCS-2/UTF-16
>       directly in our applications. �I don't mind if other
>       applications
>       are only going to support UTF-8, but inasmuch as, as long as
>       we have java and web browsers, we are going to encounter
>       UCS-2/UTF-16,
>       we should do something sensible when a UCS-2/UTF-16 BOM pops up,
>       either doing the internal translation if we so choose, or, if
>       that
>       is not handled by a particular application, issuing a polite
>       warning
>       suggesting the used of an external translator if the application
>       does
>       not wish to handle UCS-2/UTF-16.
>
>       BOMS will almost always appear in modern UCS-2/UTF-16 files, and
>       when
>       they are converted to UTF-8 that will give us yet another source
>       of
>       UTF-8 BOMs. �I believe the sensible thing to so it to recognize
>       BOMs.
>
>       Regards,
>       � � Herbert
>       =====================================================
>       �Herbert J. Bernstein, Professor of Computer Science
>       � �Dowling College, Kramer Science Center, KSC 121
>       � � � � Idle Hour Blvd, Oakdale, NY, 11769
>
>       � � � � � � � � �+1-631-244-3035
>       � � � � � � � � �[email protected]
>       =====================================================
> 
> On Tue, 18 May 2010, Bollinger, John C wrote:
> 
> > Herbert Bernstein wrote:
> >> Let me see if I understand this correctly -- a user takes 2
> perfectly good
> >> CIF2 files, edits each to clean up, say, some comments to keep
> straight where
> >> one begins and one ends, using a well-designed modern text editor
> that
> >> happens to put a BOM at the start of each file, concatenates the
> two files
> >> with cat to ship them into the IUCr, and suddenly they have a
> syntax error
> >> caused by a character that they cannot see!!!
> >>
> >> To me this seems pointless when it is trivial for software to
> recognize the
> >> character and handle it sensibly.
> >
> > And that is my principal rationale for preferring that embedded
> U+FEFF be recognized as CIF whitespace. �With that approach, the
> concatenation of two well-formed CIF2 files is always a well-formed
> CIF2 file, regardless of the presence or absence of BOMs in the
> original files. �Note, too, that such concatenation cannot produce a
> mixed-encoding file because files encoded in UTF-16[BE|LE],
> UTF-32[BE|LE], or any other encoding that can be distinguished from
> UTF-8 are not well-formed CIF2 files to start. �The file concatenation
> scenario thus does not provide a use case for the CIF2 *specification*
> to recognize embedded U+FEFF as an encoding marker.
> >
> > On the other hand, I again feel compelled to distinguish program
> behaviors from the CIF2 format specification. �None of the above would
> prevent a CIF processor from recognizing and handling CIF-like
> character streams encoded via schemes other than UTF-8, nor from
> recognizing embedded U+FEFF code sequences in various encodings as
> encoding switches, thereby handling mixed-encoding files. �Indeed,
> such a program or library would be invaluable for correcting
> encoding-related errors. �That does not, however, mean that such files
> must be considered well-formed CIF2, no matter how likely they may (or
> may not) be to arise.
> >
> >
> > James Hester wrote:
> >> I would be happy to call an embedded BOM a syntax error.
> >
> > In light of the possibility of U+FEFF appearing in a data value (for
> example, from cutting text from a Unicode manuscript and pasting it
> into a CIF), I need to refine my earlier blanket alternative of
> treating embedded U+FEFF as a syntax error. �I now think it would be
> ok to treat U+FEFF as a syntax error *provided* that it appears
> outside a delimited string. �That's still not my preference, though,
> and I feel confident that Herb will still disagree.
> >
> >
> > Regards,
> >
> > John
> > --
> > John C. Bollinger, Ph.D.
> > Computing and X-Ray Scientist
> > Department of Structural Biology
> > St. Jude Children's Research Hospital
> > [email protected]
> > (901) 595-3166 [office]
> > www.stjude.org
> >
> >
> >
> > Email Disclaimer: �www.stjude.org/emaildisclaimer
> >
> > _______________________________________________
> > ddlm-group mailing list
> > [email protected]
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> 
> 
> 
> 
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> 
>

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Joe Krahn)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)

Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)

Re: [ddlm-group] UTF-8 BOM (James Hester)

Prev by Date: Re: [ddlm-group] UTF-8 BOM

Next by Date: Re: [ddlm-group] Case sensitivity

Prev by thread: Re: [ddlm-group] UTF-8 BOM

Next by thread: Re: [ddlm-group] UTF-8 BOM

Index(es):

Date

Thread

Discussion List Archives

[ddlm-group] imgCIF versus CIF2