Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] UTF-8 BOM

Dear Colleagues,

   People make CIF out of pieces joined by cat or editors all the
time.  We cannot tell them that thay can only make CIF2s out using
a short list of applications, nor can we tell them that they
cannot pick up material from old CIF1s.  In most cases, if we
treat the BOMs reasonably, the concatenated CIFs will make sense
and probably sense that the user intended.

   We are going to have to deal with files that will have embedded
BOMs, and treating them as errors when the meaning is obvious
is not a great idea.

   I see no immediate harm in treating an embedded BOM as whitespace,
but also no specific need to do so.  The main thing is not to treat
it as a printing characters and not to completely ignore it -- it
can be a tip off to a serious problem.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 13 May 2010, Bollinger, John C wrote:

> I agree with Joe that the best way to deal with character decoding and 
> encoding is in the I/O layer, separate from and below the lexical 
> analysis.  Previously, however, this group seems seen potential 
> performance advantage in deferring decoding until some later point, and 
> therefore CIF2 is currently designed with the intention of making that 
> possible.  (I may be missing something here, though, as I have had to 
> rely on reading the archives after the fact.)  The present CIF2 
> specification thus avoids assigning any special syntactic significance 
> to any non-ASCII character, such as by designating U+FEFF to be 
> whitespace.
>
> All of us have been in this business long enough, I think, to know how 
> little performance assertions are worth without the support of good 
> performance tests.  Here's one data point, therefore: on my personal 
> workstation, using a well-tuned, pure Java decoder, UTF-8 decoding of a 
> 100 KB html file written in French takes about 0.2 milliseconds. 
> Performance will vary with hardware, file content, and decoder 
> implementation, but based on that model, I feel comfortable asserting 
> that the decoding cost for CIFs should generally be on the order of 
> microseconds per kilobyte.  In absolute terms, then, I don't see much 
> improvement to be had from avoiding some of that cost, even when scaling 
> to thousands of large CIFs.
>
> Another relevant test would be the performance of a byte-based scanner 
> operating on the encoded byte stream, versus a character-based scanner 
> operating on the decoded stream.  This is a more difficult test to 
> perform (fairly), because it requires two separate, yet equivalent 
> scanners.  I'm not prepared to make such a measurement myself, but I am 
> also unwilling to make any assumption about what the results would show. 
> My guess -- not suitable for supporting a design decision -- would be 
> that with appropriate tuning, the two scanners' performance would be 
> quite similar.
>
> Why does all that matter?  Because where there is agreement on desirable 
> CIF2 processor behavior, I would much rather incorporate it into the 
> specification than relegate it to "best practice" status.  This will 
> yield better uniformity and fewer incompatibilities among CIF2 
> implementations, which is why I raised the issue of an initial UTF-8 BOM 
> in the first place.  As for an internal U+FEFF, I now see that it is 
> important to make an explicit decision on its meaning, and to put that, 
> too, in the spec.  I see three viable possibilities:
>
> 1) U+FEFF after the beginning of a CIF is an ordinary character.  This 
> is the current status, and with this interpretation, U+FEFF may appear 
> in data names and whitespace-delimited data values.  Character decoding 
> can be deferred under this option.  This is my least preferred of these 
> options, but it would work.
>
> 2) U+FEFF after the beginning of a CIF is whitespace.  This runs against 
> the principle that non-ASCII characters should not be assigned special 
> syntactic significance, but it would not require giving any such 
> significance to any other character.  It might still be possible to 
> defer character decoding with this option, at the cost of a more 
> complicated lexical scanner.  This would be my preference.  This 
> interpretation yields the behavior most likely to be expected of CIFs 
> formed by naively concatenating multiple well-formed, CIF2-compliant 
> files (but see below).
>
> 3) U+FEFF after the beginning of a CIF is not allowed.  This would be my 
> second choice.  In the event that naive concatenation of CIFs introduced 
> U+FEFF into the body of a CIF, a CIF2-compliant processor would be 
> obliged to treat it as an error (at which point the processor's further 
> behavior would be outside the scope of the CIF specification). It might 
> still be possible to defer character decoding with this option, at the 
> cost of a more complicated lexical scanner.
>
> Options (2) and (3) are both consistent with Unicode's recommendations; 
> option (1) is not.
>
> Overall, I think the possibility of U+FEFF being introduced into the 
> body of a CIF by naive concatenation of two CIFs is a bit of a straw 
> man, because CIF2 introduces much deeper problems with naive 
> concatenation of CIFs.  Because CIF2 is not 100% backwards compatible, a 
> CIF formed by concatenating a CIF2 file and a CIF1 file might not be 
> compliant with either spec, and if it were compliant, it might not have 
> the same meaning as the two component CIFs taken separately.  Moreover, 
> whether it was compliant, with which spec, and with what meaning could 
> depend on the order in which the files were concatenated.
>
> It would be possible to design CIF2 so that concatenation of two 
> CIF2-compliant files is always well-formed CIF2 (see option (2) above), 
> but that's already a break from naivete, as you have to distinguish 
> between CIF1 and CIF2.  I think it would be better to simply acknowledge 
> that with the introduction of CIF2, it is no longer safe to combine CIFs 
> using CIF-unaware tools.
>
>
> In Summary: CIF2 should explicitly specify a particular meaning for 
> U+FEFF in the body of a CIF, rather than leaving any doubt as to 
> compliant processor behavior in this area.  CIF2 already implicitly 
> provides such a specification, but either the current specification 
> should be made explicit, or an alternative should be chosen and 
> specified.  In evaluating which way to proceed, it should be recognized 
> that quite apart from the handling of U+FEFF, CIF2 already fails to 
> support naively concatenating CIFs.
>
>
> Best Regards,
>
> John
>
> P.S.: My, what a novel!  Sorry for being so long winded!  -- JCB
>
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
> -----Original Message-----
> From: ddlm-group-bounces@iucr.org [mailto:ddlm-group-bounces@iucr.org] On Behalf Of Joe Krahn
> Sent: Wednesday, May 12, 2010 5:34 PM
> To: Group finalising DDLm and associated dictionaries
> Subject: Re: [ddlm-group] UTF-8 BOM. .
>
> In general, CIF does not directly deal with encoding. It should be
> possible to allow a low-level I/O library to deal with all encoding
> issues. Therefore, supporting non-standard stream interpretation should
> be avoided.
>
> It is a good practical idea to allow mid-stream BOMs by interpreting
> them as character 0xFEFF, and allow it as whitespace, with a warning. It
> should not be a required feature, because it is non-standard, and only
> exists in UTF-8 for backwards-compatibility. Eventually, conforming I/O
> libraries will interpret them as invalid. Ideally, text concatenation
> software will become BOM-aware.
>
> Interpret mid-stream BOMs and allowing mixed encodings is a major hack,
> and impractical for systems that deal with encoding at the I/O level. It
> is reasonable to allow it as an non-standard extension, but should
> always give a warning so that people realize that such files are likely
> to be broken elsewhere.
>
> In summary: Standard CIF2 needs to support standard UTF-8 BOMs at the
> beginning of a file. Anything else should be considered a non-standard
> extension. For practical reasons, CIF2 parsers should be encouraged but
> not required to allow mid-stream UTF-8 0xFEFF as whitespace.
>
> Joe
>
> Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    While it is certainly prudent to tell people to either write a pure
>> UTF-8 file with no BOM or to prefix it with a BOM, and that is
>> home a compliant CIF writer should work, it is not practical to
>> insist the CIF readers should reject embedded BOMs.  Indeed, the
>> URL cited by John does not tell you they are illegal, but that
>> you should treat them as a zero width non-breaking space.
>>
>>    The reason we cannot insist on readers demanding that BOMs occur
>> at the beginning is that users may concatenate whole CIF or
>> build one CIF out of fragments of text, and this will very likely
>> result in embedded BOMs and possibly switches in encodings.  If
>> we fail to handle the BOMs were are much more likely to garble
>> such files.  I strongly recommend the approach in my prior
>> message -- recognize BOMs are all times.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 11 May 2010, Bollinger, John C wrote:
>>
>>> Dear Colleagues,
>>>
>>> I think CIF processor behavior such as Herb describes would be
>>> outstanding, and I commend Herb for his dedication to providing such
>>> capable and robust software.  I do disagree about one of his specific
>>> points, however:
>>>
>>>> The
>>>> minimum to do with any BOM is:
>>> [...]
>>>
>>>>   1.  Accept it at any point in a character stream.
>>> It would be both unconventional and programmatically inconvenient to
>>> give special significance to U+FEFF anywhere other than at the very
>>> beginning of a file.  The Unicode consortium in fact addresses this exact
>>> question in its FAQ: http://www.unicode.org/faq/utf_bom.html#bom6.
>>> Although the Unicode's comments do allow for protocol-specific support
>>> for accepting U+FEFF as a BOM other than at the beginning of the stream,
>>> I see little advantage to adding such a complication to the CIF2
>>> specifications.
>>>
>>> This all expands the scope of the topic far beyond what I had intended,
>>> however.  I think it is perhaps useful to recognize at this point,
>>> therefore, that the CIF2 language specification and the behavior of CIF2
>>> processors are separate questions.  This group has already decided that
>>> files compliant with CIF 2.0 are encoded in UTF-8, period.  I do not want
>>> to reopen that debate.  On the other hand, that in no way prevents CIF
>>> processors from -- as an extension -- recognizing and handling putative
>>> CIFs that violate the spec by employing character encodings different
>>> from UTF-8.  That sort of thing is generally heralded as beneficial for
>>> ease of use, and it is consistent with the good design principle of being
>>> relaxed about inputs but strict about outputs.  (And in that vein I would
>>> hope that any CIF 2.0 writer's normal behavior would be to encode in
>>> UTF-8.)
>>>
>>> My suggestion is slightly different, as I hope this restatement will
>>> show: *in light of the fact that spec-compliant CIF2 files are encoded in
>>> UTF-8*, I suggest that the spec allow a file beginning with a UTF-8 BOM
>>> to be spec-compliant (subject to the compliance of the rest of the
>>> contents).  Like Herb, I intend that my parsers will accept such CIFs
>>> whether they strictly comply with the spec or not, but the question is
>>> whether accepting such files should be a compliance requirement or an
>>> extension.  Either way, I think it will be valuable to document this
>>> decision in the spec, if only to draw attention to the issue.
>>>
>>>
>>> Best Regards,
>>>
>>> John
>>> --
>>> John C. Bollinger, Ph.D.
>>> Department of Structural Biology
>>> St. Jude Children's Research Hospital
>>>
>>>
>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.