Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Support for legacy files in DDLm

Title:
I have checked a typical CIF submission to Acta Cryst. to see if there are any obvious problems in reading these with a DDLm dictionary equipped with aliases.  The only one I have found so far (and this will be be very common) is the absence of a list reference in the list of symops.  Originally we assumed they could be indexed according to the order in which they appear in the CIF and most people still use this convention.  However a list reference has more recently been defined and is sometimes used.  It is char, though for various default reasons it has more recently been defined as a number.  In Syd's poof of concept dictionary he assumes that it is present (it is requied in DDLm) and that it is a cardinal number.  In those CIFs where the list reference is explicitly given, it is almost always a number.  I have not seen anything other than a number.  To accommodate defaults, in the latest changes to the CIF(DDL1) dictionary the identity operation is required to be '1' (which it almost always is).

What this means is that reading the symop loop will require the generation of the symop_id in the form of a number if it is not already present.  Since the dictionary will require that it be a cardinal number, the parser would presumably have to generate this code if it were not present, and to change it to a number if it were present but not a number (not a common problem).  Otherwixe an error warning would be needed.  The subsequent manipulation of these numbers should then proceed without problem unless the original CIF carried a non-numeric list reference into the definitions of distnces and angles.  In this rare situation the CIF would probably have to be rewritten.

In addition to unrolled loops, there may occasionally be some looped items that were never intended to be looped.  This might have been adopted to meet a special need, e.g., the description of two different crystal, both of which were used in the experiment.  Technically these would be non-conforming and might have to be rewritten.  Again these would be rare.

David

James Hester wrote:
I've changed the subject as this requires a separate discussion.

On Thu, Apr 15, 2010 at 8:38 PM, Herbert J. Bernstein <yaya@bernstein-plus-sons.com> wrote:
I would appreciate a clarification of intent for DDL1 and DDL2 data
files in the transition to DDLm: 

1.  Please assume somebody has an existing data file conformant to the current COMCIFS-approved DDL1 dictionaries, esp. the core, what are the specific changes that will be required to those data files for them to be acceptable under the proposed new DDLm conformant dictionaries?

I think David may have some insight into the answer here.  While aliases would cover most cases where a dataname had to be redefined, if that dataname lost or gained loopable status, there may be some conversion work involved.  How likely that scenario is I will leave to those who have actually attempted a DDL1 dictionary rewrite.

2.  Please assume somebody has an existing data file conformant to the current COMCIFS-approved DDL2 dictionaries, esp. mmCIF and imgCIF, what are the specific changes that will be required to those data files for them to be acceptable under the proposed new DDLm conformant dictionaries?

Perhaps Herbert and John would like to have a look at how the category structure of their respective DDL2 dictionaries would need to be fiddled with to turn them into DDLm dictionaries.  My intuition is that any effect on the data file that can't be handled by aliases could only arise out of adjustments in the category-subcategory relationships and category types. If the only resulting difference in the datafile will be in disallowing unlooped one-row loops I will gladly retract my 'order of magnitude' remark.

Answers to these two questions would help to quantify the "order of
magnitude more work" we will have to do as per James' remark:


PDB mmCIF files are not an issue for DDLm *at all*, as the mmCIF data files
are written with respect to the DDL2 specification (not DDLm).  If and when
a DDLm version of mmCIF appears, conversion of legacy files will involve an
order of magnitude more work than just rolling up unrolled loops, so the
outcome of the present discussion will be by comparison background noise.


=====================================================
 Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
       Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Thu, 15 Apr 2010, James Hester wrote:

Herbert, it seems to me that both of your issues are not relevant to this
discussion, in that they refer to situations for which DDLm is not used. 
First, a clarification.  When I talk about a dictionary being 'available' to
a program, I have in mind that it could be available at program writing time
(i.e. available to the programmer) and/or at program running time.  I hope
this corresponds with other peoples' usages.

On Wed, Apr 14, 2010 at 9:23 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
     Inasmuch as we appear to be discussing, rather than voting,
     please allow me to clarify my position:

     I am _not_ concerned about whether a DDLm-conformant dictionary
     does or does not have rules to say that a particular category is
     or is not allowed to be presented "unrolled".  I am concerned
     with how to handle two important cases:

      1.  Existing legacy data files that have "rolled" or "unrolled"
     loops that do not conform to the new dictionary rules; and


Those legacy data files were written with a particular dictionary in mind. 
If that dictionary DDL allows loop unrolling (i.e. DDL2) then any
application that presumes to read datafiles based on that dictionary will
need to support it.  But what we are discussing is how to specify the
construction of data files written with respect to a DDLm (not DDL2) based
dictionary.  So I don't see how your case (1) is relevant.

      2.  Applications that are confronted with a data file, portions
     of which are not in dictionaries to which that application has
     access.


If an application has no access to the dictionary relevant to a given
dataname, it cannot be compelled to issue an error or warning when
confronted with an unrolled loop, because it has no way of knowing that the
loop is unrolled.   In such a situation it would be bizarre to specify any
dictionary-derived behaviour, and I am not proposing to do so.  Likewise, if
a CIF-writing application has no dictionary information about a dataname
that it is writing, we are unable to impose any dictionary-based
behaviour.   The latter is a fairly 'Alice in Wonderland' situation: a
program writing a dataname that neither it (nor the programmer) knows
nothing about...

     If an application has a dictionary handy and that dictionary
     says something relevant about the rolledness or unrolledness of
     a loop, then I am reluctantly willing to accept the DDLm
     specification requiring the issuance of a warning or an error
     message.  Some application writers may decide not to do that,
     but that is a different discussion.

     What I am concerned about is the very practical issues above --
     of doing something useful with user data that either does not
     conform to this stricture as presented in a dictionary or on
     which the dictionaries available to the application are silent.
      I am proposing that, rather than requiring an application to
     throw up its hands and die, we try to maximize the useful work
     to be accomplished and try to do something sensible with the
     data, i.e. roll that which is unrolled or unroll that which is
     rolled, if it allows the work of the application to get done.


If an application has no access to a dictionary for the datanames, it will
not be able to roll up an unrolled loop, as it won't know what datanames
should be in the loop.  So I would make a counter-suggestion that (in order
to get useful work done), we can help this dictionary-challenged program by
making sure all datafiles that it is presented with have their loop
structures left intact.


     I have yet to hear of a reason not to adopt that approach for
     the cases listed above.  Once we have those two cases settled, I
     would be happy to discuss the subtleties of whether the List
     attribute itself should be modified or not, but first, please,
     let us deal with this practical issue.

     Regards,
        Herbert

     P.S. to Nick:  It is the current DDLm specification that would
     require every application writer to read the dictionary in order
     to process a CIF, else we would have no way to tell whether
     rolled or unrolled presentation was in conformance with the
     dictionary.  The list attribute is in the dictionary, not the
     data file.  The discussion we are having is orthogonal to the
     question of whether the DDLm specification requires the reading
     of the dictionary.


I think this is the wrong way around.  *If* an application writer wants to
see if a key-value dataitem should be instead in a loop, *then* they will
need to read the dictionary.  If they can do useful work without knowing
this information, then I'm not standing in their way.  A program which
claims to validate a data file *cannot* do the work it was designed for
unless it reads the dictionary, and must flag unrolled loops as a violation
of the standard.  It may then offer to roll up the loops, to create a
conformant file.  What is the problem here?


     P.S. to James:  I have read Nick's argument and on the DDLm
     specification
     issue and stick to voting for 2.  If we change the specification
     then
     strict adherence will no longer require List categories to be
     presented
     as looped data, and no more or less dictionary reading will be
     required
     than is required by the current specification, but users will be
     annoyed by one less warning/error message they are not likely to
     understand or be able to do anything about.  However, no matter
     how
     that vote comes out, we really do need to deal with the
     practical issue
     above -- there are an awful lot of PDB mmCIF data files.


PDB mmCIF files are not an issue for DDLm *at all*, as the mmCIF data files
are written with respect to the DDL2 specification (not DDLm).  If and when
a DDLm version of mmCIF appears, conversion of legacy files will involve an
order of magnitude more work than just rolling up unrolled loops, so the
outcome of the present discussion will be by comparison background noise.

     =====================================================
      Herbert J. Bernstein, Professor of Computer Science
       Dowling College, Kramer Science Center, KSC 121
            Idle Hour Blvd, Oakdale, NY, 11769

                     +1-631-244-3035
                     yaya@dowling.edu
     =====================================================

On Wed, 14 Apr 2010, Nick Spadaccini wrote:


     This doesn?t actually make things clearer or easier James.
     I will repeat it again.

     Strict adherence to the formal specification of DDLm
     REQUIRES List categories to be presented as looped data,
     even
     if there is only one row.

     If the IUCr wishes to universally adopt the case where
     instances of List categories that contain only one row may
     be presented as a Set category then it can do so as an
     accepted extension. This of course requires every
     application writer to necessarily read the dictionary to
     establish if the data is really a Set category or
     possibly a List category. Once the IUCr adopts this
     extension to DDLm within its implementation, I would
     assume
     every application writer would be required to adhere to
     it.

     On 14/04/10 3:29 PM, "James Hester"
     <jamesrhester@gmail.com> wrote:

          Dear all,

          Both John and Herb have come out in favour of
     allowing one-row loops to be unrolled.  Nick and I are
          both sceptical about the value of this idea.  We have
     a few options:

          1.  Disallow loop unrolling altogether (as in DDL1).
          2.  Allow loop unrolling for all DDLm dictionaries
          3.  Add a category-scope DDLm attribute stating that
     one-row loops in this category and child
          categories may be unrolled.  If it appears in the
     'Head' category of the dictionary, it would mean
          that all categories in the dictionary could be
     unrolled.

          We have not discussed option 3: it basically means
     deferring the decision on loop unrolling to the
          dictionary writers.  It also means that programmers
     of generic CIF software will need to be prepared
          for either behaviour, so in that sense it is slightly
     more burdensome than option 2.

          Unless the silent majority would like to contribute
     further thoughts on this matter, I suggest that we
          vote and move on.  I discern that the voting so far
     would be:

          Option 1: James, Nick
          Option 2: John, Herb
          Option 3: ?

          (Some comments on John's post are inserted below).

          James.
          On Thu, Apr 1, 2010 at 10:41 AM, John Westbrook
     <jwest@rcsb.rutgers.edu> wrote:
                Hi all,

                Coming in late on this in support of Herb's
     position.

                I  have never understood the necessity of
     marking a category as
                a 'list' type in the dictionary in the early
     CIF DDL,
                and in DDLm I find this even more confusing.  
     Given
                that DDLm supports a category key which
     provides a
                well defined basis for each category, this
     alone
                would seem to provide the appropriate
     expression of
                cardinality.


          Absolutely agree, my objection is not to the loss of
     some packet ordering information, this is
          explicitly excluded from the infoset produced by the
     parser in any case.


                The choice of exporting a category with a
     single row as
                a collection of keyword-value pairs or  using a
     table
                format via a loop_  seems like a presentation
     style
                matter rather than dictionary issue.


          It is more than a presentation issue, as you have
     lost the information that those key-value pairs
          belong together, and so you need to refer to your
     dictionary to reconstitute them as a group.  And if
          you allow the possibility of unrolling single-row
     loops for all categories, then significant extra
          work is done to check, and if necessary, transform
     the internal representation back to a canonical
          looped form.  This reconstruction of the canonical
     form is highly desirable in a DDLm context, where
          we often wish to apply dREL operations to all packets
     in a loop.
           

                As Herb has observed, the vast majority of DDL2
     files opt
                for key-value output for any category with a
     single row.
                I do not see what additional semantics are
     conveyed by
                regulating the manner data presentation in
     these cases.


          See above - some semantic information is lost.
           

                John

              


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group




--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148


_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group


begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.