[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: Important CIF items for discussion
- Subject: Re: Important CIF items for discussion
- From: Brian McMahon <bm@xxxxxxxx>
- Date: Mon, 21 Jul 2008 11:36:55 +0100
- In-Reply-To: <48777D55.6050606@mcmaster.ca>
- References: <48777D55.6050606@mcmaster.ca>
It may be useful to consider the relationship between SGML document type definitions (DTDs) and CIF virtual dictionaries. In SGML, which is an electronic publishing markup language, a document must refer to the DTD which describes its structure. The DTD is written itself in SGML. It may contain descriptions of the document structure and/or references to other DTDs. An SGML parser reads an SGML file, dereferences the calls to DTDs, and assembles a document structural model against which the remaining contents of the file are validated. (The "document structure" comprises valid tags, order in which they may appear or be nested, allowed data types, and perhaps discrete permitted values. There are many analogies with CIF/DDL.) The SGML standard allows a DTD to be an external file (or set of files imported and assembled recursively), or an integral part of the SGML document. Again, the parallels with the DDLm dictionary being external to, or integral with, the CIF are clear. In practice, organizations managing SGML documents maintain a central repository of standard DTDs, and the individual documents reference them rather than import them physically. This allows the physical size of the documents to be more compact. [Compare CIF, where a typical data file for a structure is a few tens of kB, while the core dictionary, even without methods, is nearly half a megabyte; the PDB exchange dictionary is over 3 MB.] For SGML, the location of a DTD is usually specified through a registry of local directories and files rather than by URL (the SGML standard predates the Web). This reduces the portability of SGML, although it is considered adequate to allow, say, a typesetter and a publisher to interact; both organisations invest so much in their activities that the effort to create matching registries is considered acceptable. If one wishes to transfer an SGML document to someone who does not have access to the same registry, or if one wishes to ensure that a single file can be archived without dependency on external resources, then the full DTD can be inserted into the document instance. Technically, then, we should permit both possibilities for DDLm dictionaries. However, we might take the view that stable and reliable URLs make it easier to provide a common "registry" accessible by all users, and so implement URL-based references to DDLm dictionaries as standard working practice. The IUCr would be reasonably happy to host copies of specialist or private dictionaries, so long as they met certain quality criteria. Of course, the IUCr site might one day cease to operate, but if all the referenced dictionaries could be found at one location, it might be easier to ensure that they are transferred to another authority, or that other arrangements could be made to restore the integrity of the data files that referenced that one site. * * * * It's also interesting to consider the history of SGML. The purpose of a DTD is to impose a particular document structure and to validate a document against that structure. The computational requirements of document validation are high, and the management of multiple DTDs is also a complex programming task. Very few completely independent SGML systems have ever been written. Most software packages that handle SGML depend ultimately on James Clark's SP parser. Fully competent SGML software is complex, mostly available as expensive commercial packages, and generally implemented within an organisation that invests heavily in publishing or document management. [SGML is, however, a powerful software system, and for many such organisations the investment is fully repaid.] SGML never made much head way in the outside world, but XML changed that. XML may be considered a subset of SGML, but it offers a number of simplifications. One was that DTDs were no longer mandatory. An XML file could be considered valid so long as it was 'well formed' (i.e. adhered to syntax and some simple structural standards). Of course one could do less with a document that was not specified by a DTD, but for many practical purposes a community could adopt certain procedural rules or conventions, and gain the benefit they required without needing to design and implement a full DTD. Semantic content could be carried through schemas; presentation could be externalised in style sheets. XML took off. It does still support DTDs (and I suspect most XML DTD parsers are still built on Clark's SP), but few applications use these. So there is a historical inversion of what we are now proposing for CIF. XML is analogous to the older flavours of CIF ('well-formed' corresponding to syntactic correctness, schemas mapping to some extent onto DDL1 and DDL2 dictionaries). By analogy with SGML, I suspect that DDLm will only ever attract one or two fully functional parser/validators, so these must be robust and capable of being integrated into other application packages. Given that, the complexity of handling multiple dictionaries will probably be delegated to just one or two public domain libraries. It is likely that only a few organisations will be able to, or want to, handle the full complexity of CIF3 validation and processing (the IUCr, PDB, CCDC etc). I do not foresee Jmol, for example, implementing a full DDLm compiler (though with Bob Hanson at the helm, Jmol might just be the application to do that...). So it might be that functions such as "return missing values by DDLm methods evaluation of the other content" move into the realm of web services provided by the IUCr or other service providers, rather than compiled-in functions of a crystallographic software suite. If that's the case, then the problem of exposing intermediate data items/values also becomes less acute. Regards Brian _________________________________________________________________________ Brian McMahon tel: +44 1244 342878 Research and Development Officer fax: +44 1244 314888 International Union of Crystallography e-mail: bm@iucr.org 5 Abbey Square, Chester CH1 2HU, England _______________________________________________ cif-developers mailing list cif-developers@iucr.org http://scripts.iucr.org/mailman/listinfo/cif-developers
Reply to: [list | sender only]
- Follow-Ups:
- Re: Important CIF items for discussion (Herbert J. Bernstein)
- Prev by Date: Re: Important CIF items for discussion
- Next by Date: Re: Important CIF items for discussion
- Prev by thread: Re: Important CIF items for discussion
- Next by thread: Re: Important CIF items for discussion
- Index(es):