Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Important CIF items for discussion

It may be useful to consider the relationship between SGML document type
definitions (DTDs) and CIF virtual dictionaries.

In SGML, which is an electronic publishing markup language, a document
must refer to the DTD which describes its structure. The DTD is
written itself in SGML. It may contain descriptions of the document
structure and/or references to other DTDs. An SGML parser reads an
SGML file, dereferences the calls to DTDs, and assembles a document
structural model against which the remaining contents of the file are
validated. (The "document structure" comprises valid tags, order in
which they may appear or be nested, allowed data types, and perhaps
discrete permitted values. There are many analogies with CIF/DDL.)

The SGML standard allows a DTD to be an external file (or set of files
imported and assembled recursively), or an integral part of the SGML
document. Again, the parallels with the DDLm dictionary being external
to, or integral with, the CIF are clear.

In practice, organizations managing SGML documents maintain a central
repository of standard DTDs, and the individual documents reference
them rather than import them physically. This allows the physical size
of the documents to be more compact. [Compare CIF, where a typical
data file for a structure is a few tens of kB, while the core
dictionary, even without methods, is nearly half a megabyte; the PDB
exchange dictionary is over 3 MB.]

For SGML, the location of a DTD is usually specified through a registry
of local directories and files rather than by URL (the SGML standard
predates the Web). This reduces the portability of SGML, although it
is considered adequate to allow, say, a typesetter and a publisher to
interact; both organisations invest so much in their activities that
the effort to create matching registries is considered acceptable.

If one wishes to transfer an SGML document to someone who does not
have access to the same registry, or if one wishes to ensure that a
single file can be archived without dependency on external resources,
then the full DTD can be inserted into the document instance.

Technically, then, we should permit both possibilities for DDLm
dictionaries. However, we might take the view that stable and reliable
URLs make it easier to provide a common "registry" accessible by all
users, and so implement URL-based references to DDLm dictionaries as
standard working practice. The IUCr would be reasonably happy to host
copies of specialist or private dictionaries, so long as they met
certain quality criteria.

Of course, the IUCr site might one day cease to operate, but if all
the referenced dictionaries could be found at one location, it might
be easier to ensure that they are transferred to another authority, or
that other arrangements could be made to restore the integrity of the
data files that referenced that one site.

* * * *

It's also interesting to consider the history of SGML. The purpose of
a DTD is to impose a particular document structure and to validate a
document against that structure. The computational requirements of
document validation are high, and the management of multiple DTDs is
also a complex programming task. Very few completely independent SGML
systems have ever been written. Most software packages that
handle SGML depend ultimately on James Clark's SP parser. Fully
competent SGML software is complex, mostly available as expensive
commercial packages, and generally implemented within an organisation
that invests heavily in publishing or document management. [SGML is,
however, a powerful software system, and for many such organisations
the investment is fully repaid.]

SGML never made much head way in the outside world, but XML changed
that. XML may be considered a subset of SGML, but it offers a number
of simplifications. One was that DTDs were no longer mandatory. An XML
file could be considered valid so long as it was 'well formed'
(i.e. adhered to syntax and some simple structural standards). Of
course one could do less with a document that was not specified by a
DTD, but for many practical purposes a community could adopt certain
procedural rules or conventions, and gain the benefit they required
without needing to design and implement a full DTD. Semantic content
could be carried through schemas; presentation could be externalised
in style sheets. XML took off. It does still support DTDs (and I
suspect most XML DTD parsers are still built on Clark's SP), but few
applications use these.

So there is a historical inversion of what we are now proposing for CIF.
XML is analogous to the older flavours of CIF ('well-formed' corresponding
to syntactic correctness, schemas mapping to some extent onto DDL1 and
DDL2 dictionaries).

By analogy with SGML, I suspect that DDLm will only ever attract one
or two fully functional parser/validators, so these must be robust and
capable of being integrated into other application packages. Given
that, the complexity of handling multiple dictionaries will probably
be delegated to just one or two public domain libraries. It is likely
that only a few organisations will be able to, or want to, handle the
full complexity of CIF3 validation and processing (the IUCr, PDB, CCDC
etc). I do not foresee Jmol, for example, implementing a full DDLm
compiler (though with Bob Hanson at the helm, Jmol might just be the
application to do that...).

So it might be that functions such as "return missing values by
DDLm methods evaluation of the other content" move into the realm
of web services provided by the IUCr or other service providers,
rather than compiled-in functions of a crystallographic software
suite. If that's the case, then the problem of exposing intermediate
data items/values also becomes less acute. 

Regards
Brian
_________________________________________________________________________
Brian McMahon                                       tel: +44 1244 342878
Research and Development Officer                    fax: +44 1244 314888
International Union of Crystallography            e-mail:  bm@iucr.org
5 Abbey Square, Chester CH1 2HU, England


Reply to: [list | sender only]