Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: the dictionary merging protocol

  • Subject: Re: the dictionary merging protocol
  • From: Brian McMahon <bm@xxxxxxxx>
  • Date: Tue, 16 Jul 2002 15:25:29 +0100 (BST)
Hi Doug

> I hope it is okay to make a few comments here about the  dictionary 
> overlay protocol as documented here: 

I'm very happy that the community is discussing this proposal here. Although
it has been approved by COMCIFS, I see it as still very much a
pen-and-paper description, and I'd be much happier to see it tested in an
implementation. If anyone on the list has a working implementation (or is
interested in writing one) I'd be very much interested to hear about it.

> I hope we can daw a distinction between "valid" and "conformant"
> with respect to the encouraged CIF data_block tags:   
>  ...
> My understanding/definition of conformant is 100% or nothing. The slightest 
> discrepancy at all means it is no longer conformant.
> With this definition the _audit tags above seem mislabeled, but I 
> will continue here assuming the intended meaning is "valid".

OK, we may need to work on a precise and consistent terminology. Perhaps
the current core definition for the category:

               Data items in the AUDIT_CONFORM category describe the
               dictionary versions against which the data names appearing in
               the current data block are conformant.

would be better recast as:
 
               Data items in the AUDIT_CONFORM category describe the
               dictionary versions against which the current data block 
               claims to be conformant.

Individual data items may be validated against their matching dictionary
definitions; the data block as a whole is conformant if all values for
which there is a dictionary definition are valid according to that
definition.

Notice that there are levels of "validity": a _cell_length_a value of
12.763(1) may be "valid" in the sense that it has a numeric value in the
permitted range; but it may be invalid inasmuch as there is a discrepancy
between its value and a cell volume determined from it. So there is
consistency checking to be done. But it may also be that the numbers are
all consistent, yet just plain wrong - a better experiment finds a rather
different value. So "validation" is performed against particular localised 
criteria.

For our current purposes I'm defining valid as obeying the constraints and 
relationships explicitly stated in the dictionary. (So the current core
dictionary can't catch the cell length/volume discrepancy because that's
not stated, at least in machine-readable form; but the work of the Perth
group will make that an achievable goal in future dictionaries.)

The fact that CIF has always allowed private data items means that a data
file can always contain items not in a dictionary, so I allow the notion
of "conformance" so long as there are no demonstrably invalid data values, 
even if no conclusion can be drawn against items in the file with no
matching dictionary definitions.

> From the point of view of CIF validation, the proposed dictionary merging 
> protocol looks functional enough. But the protocol itself seems to be 
> a set of externally based informal rules designed to be hard coded 
> into validation software. The commands for specifying how to create/
> assemble a dictionary to which a given CIF data block may or may not 
> be conformant (even though it may be valid) are actually embeded in 
> the CIF, or passed to the validation software as arguments.  
> 
> There is no support therein for fine grained control over how 
> individual data items and or category classes may be totally replaced,
> by or appended to from the separate disparate dictionaries. 
> It is an all or nothing approach. 
> 
> The currently envisaged dictionary construction mechanism does not
> yet permit specification of such PREPEND, APPEND REPLACE modification
> attributes in the CIF data_block itself, so there is no way to retain
> this information across dictionary reconstruction invocations.

That is a fair criticism. I thought about how to carry along the
modification attributes in the CIF, but considered that would produce
a much greater overhead in both the writing and reading phases to get it
right. If people think it's useful, I'm willing to revisit the
possibility.

> The recent discussion of the CIF specification indicates that in CIF1.1
> dictionary style save_ frames will be permitted in purely data CIFs
> opening up the possibility of combined dictionaries and data. 
> I am not sure if this is the direction things are intended to go
> but it seems to me to be tooooo flexible for something that is 
> supposed to be a purely data archival format.

One doesn't even need save_ frames in DDL1 applications because the
dictionary definitions there live in data blocks. I see an analogy here
with SGML, where the DTD is usually an external file but can be carried
along (or modifications to a library DTD can be carried along) within the
file. My feeling is that the community doesn't want to travel in that
direction; this was reflected in the explicit definitions of "data file"
and "dictionary file" in paras 2.2 and 2.3 of the specification documents,
and in the classification of dictionaries as "external reference files" in
para 3 of the semantics document.

> It also seems counterproductive
> to the overall scheme of standardization because basically any 
> CIF can create any dictionary it likes and say hey I am valid against
> this, (even if it doesn't conform).

Yes, there is a danger in that, but there is also advantage. For most
physical quantities, the CIF core dictionary is permissive in what it
considers "valid" - usually anything positive definite is allowed. But for 
the editorial purposes of Acta Cryst., certain ranges of values might be
excluded, while another journal might insist on a different range. The
proposed approach allows each journal to layer its own restrictive ranges 
on top of what is in the core. That's also an argument against building
too much specificity about validation criteria in the metadata carried along
within the data block: different external criteria might be applied to
different purpose.

So why is my dictionary "better" than yours? So long as the dictionaries
are retrievable for inspection, they can be compared and criticised by
independent reviewers. It would be expected (or at least hoped) that the
dictionaries sanctioned by the IUCr would carry, if you will, a higher
level of trust than others, but the essence of the matter is to ensure
that the dictionaries are public and open to independent review.

So, to revisit your question of what were the guiding principles behind
this proposal, they included the following considerations.

1. Multiple dictionaries already exist (core, powder, msCIF, mmCIF and
others). It's important to have a way of addressing the several
dictionaries that might contribute to a data file. Of course, everything
could be brought into a single increasingly large dictionary, but keeping
them separate facilitates distributed authorship and management. Not in
itself a compelling argument perhaps, but a very useful thing to have in
practice.

2. Dictionaries of private data names can be constructed and employed for
validation in the same way as in the public arena. So if your local
archive files for Xtal have lots of _xtal_ data names you can in principle 
validate them with off-the-shelf software, without needing to add your
private data names to the public dictionary.

Both of these represent a sort of horizontal integration.

3. The desire to overwrite particular attributes in a public dictionary
for more specific validation purposes is addressed by the overlay mode.
If the previous cases were "horizontal", this is more of a "vertical"
integration.

Of course Acta Cryst could write its own validation routines to satisfy
the Notes for Authors (and of course it has); but it seems attractive to
be able to carry out much of the validation using generic dictionary-based
tools. And it seems attractive to be able to overlay a small change, such
as modification of a single enumeration range, rather than to have to
make a complete copy of the official dictionary.

One thing to consider of course is that the generic "off-the-shelf"
validators I envisage will need to interact sensibly with specific
applications, and one might need to think what types of error codes or
return values the validator should make when invalid cases are
found. Perhaps numeric codes defined in a standard header file with symbolic
names like NON_NUMERIC, OUT_OF_RANGE, ILLEGAL_CODE ... ?

Of course now we are talking at an implementation level, and it's a topic
that is also relevant to the current thoughts about the syntax specification.
How should CIF parsers handle exceptions?

That's all I have time for at the moment, but there are other interesting
thoughts in Doug's messages that I would like to follow up later. It would 
also be interesting to hear other views from the community about the
perceived usefulness or otherwise of this protocol. Since we have survived 
for a decade without it, it may not be of critical importance. On the
other hand, I see it as potentially having substantial impact on the way
we develop and use dictionaries in the future.

Best wishes
Brian

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.