Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics

Dear James,

> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream.  For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."

is almost reasoanble, but basically says that it will be easier to
handle CIF2 is almost any external container, rather than as itself.
I would suggest saying.

The description of a conformant CIF2 in terms of a UTF8 encoding
is intended to provide clarity in the description of a CIF2, not
to prevent use of CIF2 in terms of other encodings, such as UCS-2
unicode  or code-page-based encodings needed for editors in
particular system, nor to prevent used of transformed CIF2 in other
containers such as HDF5 and XML or imgCIF/CBF, as long as the 
decodings/encoding or other transformations that would be necessary to go 
to and from a UTF8 CIF2 representation are clearly and unambiguously 
defined.

This would bring us back essentially to where we have been for more than a 
decade with imgCIF/CBF and for nearly 2 decades with CIF1 itself.

Regards,
   Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 10 Sep 2010, James Hester wrote:

> Thanks Herbert for this detailed information, which is a great help to
> me in forming an opinion.  Please understand that we are not even
> close to considering excluding imgCIF from CIF.  Rather, I am
> collecting information in order to form an opinion and work with
> everybody to find a solution which then goes back to the DDLm group
> and then on to COMCIFS regarding CIF2.  Speculation about potential
> consequences for imgCIF are just part of the information-gathering
> process.  In general terms, CIF is now a 'framework', which I think
> will make bringing XML and HDF5 developments under the CIF umbrella
> relatively simple.
>
> Please also understand that my comments about the usefulness of CBFlib
> were in the context of a typical beamline user wishing to handle their
> data, rather than from a programmer's point of view.  I was not
> casting aspersions on CBFlib, rather seeking more information (which
> you have provided).
>
> I am afraid that terminology here may be confusing me: I would like to
> talk about imgCIF as a pure ASCII format (eg IT Vol G p 40 para 15)
> and CBF as the binary equivalent.  However, your previous statements
> indicate that imgCIF could also be written in UTF16 encoding.  So:
> when you speak of the Dectris detector output as 'imgCIF', what
> encoding is used?
>
> The point you make about embedding imgCIF into a text-only format (in
> this case XML) is, I agree, a use-case that we have to consider.  I
> see merit in the position that 'CIF2 content' inside a container is
> not constrained by encoding, in those cases where the container is
> able to specify the encoding itself.  This is *pedantically* true
> already in that the 'header' of the container file as a whole is *not*
> the CIF2 magic header.  So: what does everyone think of the following
> statement being included in the standard?
>
> "Note that a CIF2-conformant character stream that forms part of a
> larger stream is not constrained to be in UTF8 encoding if the
> encoding of the CIF2 stream is specified in a standards-conformant
> manner within the enclosing stream.  For example, CIF2 content within
> an XML file is not constrained to be UTF8-encoded as standard XML
> attributes can be used to manage encoding."
>
> (Perhaps John B, who has shown superior wordsmithing capabilities,
> could polish this up a bit?)
>
> On Fri, Sep 3, 2010 at 11:10 PM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> Here is more detail on the use of CBFlib.
>>
>> I know for sure that CBFlib is used directly by mosflm and adxv.  While XDS
>> uses code that was prototyped in the Fortran part of CBFlib, they work with
>> their own versions.  However, Kay Diederichs has also used the CBFlib C code
>> for work on simulations.  Paul Ellis started HKL2000 off with CBFlib, but I
>> don't know if they stayed with it.
>>
>> As a practical matter, whether someone uses CBFlib itself, it is an
>> essential part of the documentation that people use to understand how the
>> various compression schemes work, and they use the utility cif2cbf from the
>> package both as an external converter and as a validator and as a debugger
>> when they don't want to put all the functionality in their own code.  If you
>> have a funny CBF in any of the semi-infinite number of representations,
>> cif2cbf allows you to check it, get a hex dump of it or convert it to a
>> specific compression scheme or format that some other program needs to
>> process that file.
>>
>> In other words, CBFlib on its own _is_ useful.
>>
>> Sorry about not giving you a list re imgCIF use, I thought you were asking
>> me about CBFlib use -- every beamline that uses a Dectris Pilatus 6M
>> produces imgCIF as the default.  This had been a byte-offset compressed
>> binary with a mini-header.  Dectris has now moved up to writing a full
>> header.  There were some beamlines with some of the older smaller Dectris
>> detectors that were producing TIFF, but all currently delivered Dectris
>> detectors of all sizes produce imgCIF as the default.
>>
>> All the major detector manufacturers now offer CBF as an option except for
>> Bruker which is debugging an optional CBF output.  When I checked at the ACA
>> meeting in July they all also said that their processing packages can accept
>> CBF as an input.
>>
>> On the XML use, I would suggest a more broad-minded attitude. Judging from
>> the workshop I was at in January at ESRF, it has much broader support than
>> just from Diamond, especially for spectra which have smaller data volume
>> than images. HDF5 is the most widely accepted scientific binary data format
>> for the physics community, and XML is the easiest and most reliable way to
>> port smaller HDF5 datasets from site to site. The problem with XML is that
>> for large files such as crystallographic images ordinary straight-text XML
>> produces huge, impractical files.  binutf allows for a compromise in which
>> you have a true XML UCS-2 file but with the binary having only a 7%
>> overhead.
>>
>> I have no choice -- I _will_ (indeed already do) produce CIFs with UCS-2
>> binary sections.  If COMCIFS repeats the unfortunate decision of 1997 of
>> saying that what the synchrotron community needs can't be called CIF, we'll
>> just go back to calling it imgNCIF (which is an acronym for image-not-CIF),
>> but we will still have to produce it for the community. In 1998 after we had
>> a face-to-face discussion at a BNL workshop, that decision was reversed and
>> what the synchrotron community needed was folded under the CIF umbrella, and
>> imgNCIF became imgCIF.  I hope we can have discussions now to avoid the need
>> for a pointless schism.
>>
>> Your proposal on the relationship between CIF2 and imgCIF sounds like a
>> replay of the discussions we had in 1997, with CIF headers following one
>> standard and binary sections following another. You can make that work, but
>> it is clumsy and hard for users to work with.  It is better if we have one
>> simple, comprehensible standard for the files they work with as a whole.
>>
>> Let me be clear -- imgCIF is produced worldwide and used for thousands of
>> images daily.  These older "legacy" imgCIF images will be around for a long
>> time to come, and whatever new imgCIF (or if you force us to it, imgNCIF)
>> images we produce will need to be, and will be, supported by software that
>> handles both the legacy and the new images and has a clean interface to HDF5
>> and XML as well.  I would greatly prefer that this be coordinated with
>> COMCIFS and done in a way that helps the community to understand the
>> relationship between CIF and imgCIF, but if COMCIFS feels a need to return
>> to its 1997 position and exclude the data we work with from its charge, then
>> imgCIF can return to being imgNCIF.
>>
>> If we are to resolve this, then, as in 1998, we need a meeting or e-meeting.
>>  Once you have a web-cam, I would suggest you and I have a skype meeting to
>> frame the issues in dispute and organize a wider meeting.
>>
>> -- Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Fri, 3 Sep 2010, James Hester wrote:
>>
>>>> On Fri, 3 Sep 2010, James Hester wrote:
>>>>
>>>>> Thanks Herbert for providing the imgCIF perspective.
>>>>>
>>>>> I am unfortunately severely restricted in my ability to attend
>>>>> overseas meetings at present, for family and work reasons.  I am also
>>>>> keen to have our discussions written down and available for perusal by
>>>>> those that will come later.
>>>>
>>>> How about an e-meeting?
>>>
>>> OK, I think we need to try online as my carefully crafted arguments
>>> seem to be misunderstood more often than not.
>>> Let me buy a web cam first!
>>>
>>>>> We need to discuss the relationship of imgCIF to CIF2 explicitly, if
>>>>> imgCIF is going to influence our decisionmaking.  Some questions for
>>>>> Herbert to answer for the record:
>>>>>
>>>>> 1. How widely used are non-CBF forms of imgCIF at present?  By "widely
>>>>> used" I mean both
>>>>>  (a) supported by software packages that allow one to do "useful
>>>>> work", most obviously to extract diffraction spots
>>>>
>>>> I assume by "non-CBF" you mean the forms that do the binary sections
>>>> in something that is not pure binary -- all software that uses CBFlib
>>>> supports them automatically for reading.  For writing, most software
>>>> chooses one representation for writing, usually byte-offset or
>>>> packed binary, except when we have to debug -- then the ascii
>>>> forms, esp. the hexdump form are very useful.
>>>
>>> You are correct in interpreting what I mean by "non-CBF".
>>>
>>> I understand that CBFlib supports everything, but CBFlib on its own is
>>> not useful. Do you know approximately what programs use CBFlib?  I
>>> know only of rasmol, but you presumably know of many more.
>>>
>>>>>  (b) provided as an output format (even optionally) by beamlines or
>>>>> detector manufacturers
>>>
>>>> See above
>>>
>>> I see nothing in your reply on the availability of imgCIF files from
>>> detectors or instruments.
>>>
>>>>> 2. What is the advantage of having "pure text" image files?  Why isn't
>>>>> a format like CBF more appropriate?
>>>>
>>>> While I agree, when we deal with people who like XML e.g. the NeXus
>>>> form of imgCIF, then we have no choice -- no binary is allowed, so
>>>> UCS-2 becomes important.  Don't ask me to defend XML.  It is simply a
>>>> fact of life.
>>>
>>> I am guessing that this NeXuS-XML requirement is coming from Diamond,
>>> and if this is what they want I can see why you are keen to integrate
>>> imgCIF into HDF5, so that HDF5-XML conversion can be carried out the
>>> standard HDF5 way, rather than encapsulating the entire imgCIF file as
>>> a NeXuS-XML dataset.  OK: so apart from this relatively recent and
>>> frankly crazy-wierd use case, is there any other use-case for
>>> pure-text imgCIF?  Can we regard the "Diamond" case as a
>>> beaurocratically-driven kluge that will be resolved via your HDF5
>>> work, leaving no other reason to create a space-efficient CIF2 version
>>> of imgCIF?
>>>
>>>>> 3. What is the problem with a scenario where "pure text" imgCIF
>>>>> remains in its current CIF1 form, and CIF2 advances are incorporated
>>>>> into the CIF sections of CBF?
>>>>
>>>> I don't understand this question, nor the assumptions behind it.
>>>
>>> Let me be less obtuse:
>>> I envision a CBF2 format, which is a CBF file with CIF2 instead of
>>> CIF1 syntax.  A corresponding imgCIF2 format exists. We *do not care*
>>> about the space-efficiency of these imgCIF2 files. We recommend that
>>> all new crystallographic image-handling applications should target
>>> CBF2 only, rendering space-efficiency of imgCIF2 files irrelevant.
>>> Legacy applications, of which there are very few, will be restricted
>>> to the original imgCIF, which is very rarely produced in any case
>>> (anticipating your answers to my above questions).
>>>
>>> What are your (Herbert's, anybody else's) thoughts on such a plan?
>>>
>>>>> Herbert: your work merging a DDL2-based version with DDLm-like
>>>>> features in HDF5 format sounds interesting.  Are you planning to
>>>>> present a motivation and/or discussion of this work at some stage?
>>>>
>>>> This is the subject of some grant applications, so not appropriate for
>>>> detailed open discussion in this forum at this time.  The motivations
>>>> are simple -- to satisfy the demands of several major facilities for
>>>> easy integration of crytallographic synchrotron images into HDF5-based
>>>> data
>>>> management systems while preserving access to metadata, and to extend
>>>> HDF5
>>>> with relational meta-data access.  This second aspect is an increasingly
>>>> critical need and will go forward in any case.  If we have
>>>> a meeting or e-meeting, I can explain better.
>>>
>>> OK, I think reading between the lines I see where this is coming from
>>> (read your CACM article as well, BTW).  It'd be good to discuss some
>>> of these plans at some stage.
>>>
>>>>>
>>>>> On Tue, Aug 24, 2010 at 11:31 PM, Herbert J. Bernstein
>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>
>>>>>> Dear James,
>>>>>>
>>>>>>  I have not been at all reticent -- imgCIF will be very poorly
>>>>>> supported
>>>>>> by CIF2 as currently proposed.  Of necessity, imgCIF changes encodings
>>>>>> internally -- that it why it uses MIME -- same problem as email with
>>>>>> images, same solution.
>>>>>>
>>>>>>  Any purely text version has at least a 7% overhead as compared to
>>>>>> pure binary.  Restricting to UTF-8 increases the overhead to at least
>>>>>> 50%.
>>>>>> We may get away with the 7% (UTF-16).  The 50% version (UTF-8) will be
>>>>>> ignored by the community as unworkable.  The most likely to be used
>>>>>> version
>>>>>> will be the current DDL2-based version with embedded compressed
>>>>>> binaries
>>>>>> that I am augmenting with DDLm-like features
>>>>>> and merging in with HDF5.
>>>>>>
>>>>>>  As I noted many months ago, the unfortunate reality is that the
>>>>>> current CIF2 effort will not merge well with imgCIF.  If avoiding
>>>>>> a split is a important -- we need a meeting.  I would suggest
>>>>>> involving Bob Sweet and holding it at BNL in conjunction with
>>>>>> something relevant to NSLS-II.
>>>>>>
>>>>>>  Regards,
>>>>>>    Herbert
>>>>>>
>>>>>> =====================================================
>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>   Dowling College, Kramer Science Center, KSC 121
>>>>>>        Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>>                 +1-631-244-3035
>>>>>>                 yaya@dowling.edu
>>>>>> =====================================================
>>>>>>
>>>>>> On Tue, 24 Aug 2010, James Hester wrote:
>>>>>>
>>>>>>> Hi Herbert: regarding imgCIF,  I agree that splitting it off is not a
>>>>>>> desirable outcome.  I would like to get an idea of how well imgCIF can
>>>>>>> be accommodated under the various encoding proposals currently
>>>>>>> floating around, as you have been rather reticent to bring it up.  My
>>>>>>> naive take on things is that a UTF8-only encoding scheme for CIF2
>>>>>>> would not pose significant issues for imgCIF, and a decorated UTF16
>>>>>>> encoding in the style of Scheme B would be even better, and quite
>>>>>>> adequate, so imgCIF is not actually presenting any problems and so was
>>>>>>> a red herring.
>>>>>>>
>>>>>>> I'm not sure that face-to-face or Skype discussions are necessarily
>>>>>>> going to be more productive.  Writing things down, while slower,
>>>>>>> allows me at least to collect my thoughts and those of other
>>>>>>> participants, and hopefully make a reasoned contribution (my apologies
>>>>>>> if I am too long-winded) and as an added bonus those thoughts are
>>>>>>> recorded for later reference.  For example, where would I now find the
>>>>>>> background on why a container format for imgCIF is such a bad idea?
>>>>>>> Presumably that was all thrashed out in face to face discussions, and
>>>>>>> no record now remains.
>>>>>>>
>>>>>>> On Tue, Aug 24, 2010 at 8:56 PM, Herbert J. Bernstein
>>>>>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>>>>>
>>>>>>>> Dear Colleagues,
>>>>>>>>
>>>>>>>>   James' and John's last interchange is so voluminous, I doubt any of
>>>>>>>> us has been able to fully appreciate the rich complexity of ideas
>>>>>>>> contained therein.  For example, one of the suggestions far down in
>>>>>>>> the text is:
>>>>>>>>
>>>>>>>> (James now)  Indeed.  My intent with this specification was to ensure
>>>>>>>> that third parties would be able to recover the encoding. If imgCIF
>>>>>>>> is
>>>>>>>> going to cause us to make such an open-ended specification, it is
>>>>>>>> probably a sign that imgCIF needs to be addressed separately.  For
>>>>>>>> example, should we think about redefining it as a container format,
>>>>>>>> with a CIF header and UTF16 body (but still part of the
>>>>>>>> "Crystallographic Information Framework")?
>>>>>>>>
>>>>>>>> The idea of an imgCIF "header" in CIF format and a image in another
>>>>>>>> is
>>>>>>>> an
>>>>>>>> old, well-established, thoroughly discussed, and mistaken idea,
>>>>>>>> rejected
>>>>>>>> in 1998.  The handling of multiple images in a single file (e.g.
>>>>>>>> a jpeg thumbnail and crystal image and a full-size diffraction image)
>>>>>>>> requires the ability to switch among encodings within the file --
>>>>>>>> something handled by the current DDL2 and MIME-based imgCIF format
>>>>>>>> and
>>>>>>>> which would be a serious problem in CIF2 has currently proposed,
>>>>>>>> increasing the chances that we will have to move imgCIF entirely into
>>>>>>>> HDF5 and abandon the CIF representation entirely, sharing only
>>>>>>>> the dictionary and not the framework.
>>>>>>>>
>>>>>>>> If you look carefully, you will see a similar trend with mmCIF, in
>>>>>>>> which
>>>>>>>> and XML representation sharing the dictionary plays a much more
>>>>>>>> important role than the CIF format.
>>>>>>>>
>>>>>>>> Is it really desirable to make the new CIF format so rigid and
>>>>>>>> unadaptable that major portions of macromolecular crysallography
>>>>>>>> end up migrating to very different formats, as they already are
>>>>>>>> doing?  Yes, there is great value in having a common dictionary,
>>>>>>>> but would there not be additional value in having a sufficiently
>>>>>>>> flexible common format to allow for more software sharing than
>>>>>>>> we now have?  It is really desirable for us to continue in the
>>>>>>>> direction of a single macromolecular experiment having to
>>>>>>>> deal with HDF5 and CIF/DDL2/MIME representations of the image data
>>>>>>>> during collection, CCP4-style CIF representations during processing
>>>>>>>> and deposition and legacy PDB and PDBML representations in subsequent
>>>>>>>> community use?  If we could be a little bit more flexible, we might
>>>>>>>> be
>>>>>>>> able to reduce the data interchange software burdens a little.
>>>>>>>> Right now, this discussion seems headed in the direction of simply
>>>>>>>> adding yet another data representation (DDLm/CIF2) to the mix,
>>>>>>>> increasing the chances of mistranslation and confusion, rather
>>>>>>>> that reducing them.
>>>>>>>>
>>>>>>>> Please, step back a bit from the detailed discussion of UTF8 and
>>>>>>>> look at the work-flow of doing and publishing crystallographic
>>>>>>>> experiments and let us try to make a contribution that simplifies
>>>>>>>> it, not one that makes it more complex than it needs to be.
>>>>>>>>
>>>>>>>> I suggest we need to meet and talk, either face-to-face, or by skype.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>   Herbert
>>>>>>>>
>>>>>>>> =====================================================
>>>>>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>>>
>>>>>>>>                  +1-631-244-3035
>>>>>>>>                  yaya@dowling.edu
>>>>>>>> =====================================================
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> cif2-encoding mailing list
>>>>>>>> cif2-encoding@iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> T +61 (02) 9717 9907
>>>>>>> F +61 (02) 9717 3145
>>>>>>> M +61 (04) 0249 4148
>>>>>>> _______________________________________________
>>>>>>> cif2-encoding mailing list
>>>>>>> cif2-encoding@iucr.org
>>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>> _______________________________________________
>>>>>> cif2-encoding mailing list
>>>>>> cif2-encoding@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> T +61 (02) 9717 9907
>>>>> F +61 (02) 9717 3145
>>>>> M +61 (04) 0249 4148
>>>>> _______________________________________________
>>>>> cif2-encoding mailing list
>>>>> cif2-encoding@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>> _______________________________________________
>>>> cif2-encoding mailing list
>>>> cif2-encoding@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> cif2-encoding mailing list
>>> cif2-encoding@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>> _______________________________________________
>> cif2-encoding mailing list
>> cif2-encoding@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.