Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Important CIF items for discussion

Dear Colleagues,

   I, for one, find Brian's comments on SGML and XML to be on point.
Whether the IUCr wishes it, or not, there is going to be significant
use of CIFs by programs that will not validate those CIFs against
any dictionaries.  The vocabulary control will be in the programs,
and unknown features will simply be ignored (think HTML).

   We need to talk this out carefully in Osaka.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 21 Jul 2008, Brian McMahon wrote:

> It may be useful to consider the relationship between SGML document type
> definitions (DTDs) and CIF virtual dictionaries.
>
> In SGML, which is an electronic publishing markup language, a document
> must refer to the DTD which describes its structure. The DTD is
> written itself in SGML. It may contain descriptions of the document
> structure and/or references to other DTDs. An SGML parser reads an
> SGML file, dereferences the calls to DTDs, and assembles a document
> structural model against which the remaining contents of the file are
> validated. (The "document structure" comprises valid tags, order in
> which they may appear or be nested, allowed data types, and perhaps
> discrete permitted values. There are many analogies with CIF/DDL.)
>
> The SGML standard allows a DTD to be an external file (or set of files
> imported and assembled recursively), or an integral part of the SGML
> document. Again, the parallels with the DDLm dictionary being external
> to, or integral with, the CIF are clear.
>
> In practice, organizations managing SGML documents maintain a central
> repository of standard DTDs, and the individual documents reference
> them rather than import them physically. This allows the physical size
> of the documents to be more compact. [Compare CIF, where a typical
> data file for a structure is a few tens of kB, while the core
> dictionary, even without methods, is nearly half a megabyte; the PDB
> exchange dictionary is over 3 MB.]
>
> For SGML, the location of a DTD is usually specified through a registry
> of local directories and files rather than by URL (the SGML standard
> predates the Web). This reduces the portability of SGML, although it
> is considered adequate to allow, say, a typesetter and a publisher to
> interact; both organisations invest so much in their activities that
> the effort to create matching registries is considered acceptable.
>
> If one wishes to transfer an SGML document to someone who does not
> have access to the same registry, or if one wishes to ensure that a
> single file can be archived without dependency on external resources,
> then the full DTD can be inserted into the document instance.
>
> Technically, then, we should permit both possibilities for DDLm
> dictionaries. However, we might take the view that stable and reliable
> URLs make it easier to provide a common "registry" accessible by all
> users, and so implement URL-based references to DDLm dictionaries as
> standard working practice. The IUCr would be reasonably happy to host
> copies of specialist or private dictionaries, so long as they met
> certain quality criteria.
>
> Of course, the IUCr site might one day cease to operate, but if all
> the referenced dictionaries could be found at one location, it might
> be easier to ensure that they are transferred to another authority, or
> that other arrangements could be made to restore the integrity of the
> data files that referenced that one site.
>
> * * * *
>
> It's also interesting to consider the history of SGML. The purpose of
> a DTD is to impose a particular document structure and to validate a
> document against that structure. The computational requirements of
> document validation are high, and the management of multiple DTDs is
> also a complex programming task. Very few completely independent SGML
> systems have ever been written. Most software packages that
> handle SGML depend ultimately on James Clark's SP parser. Fully
> competent SGML software is complex, mostly available as expensive
> commercial packages, and generally implemented within an organisation
> that invests heavily in publishing or document management. [SGML is,
> however, a powerful software system, and for many such organisations
> the investment is fully repaid.]
>
> SGML never made much head way in the outside world, but XML changed
> that. XML may be considered a subset of SGML, but it offers a number
> of simplifications. One was that DTDs were no longer mandatory. An XML
> file could be considered valid so long as it was 'well formed'
> (i.e. adhered to syntax and some simple structural standards). Of
> course one could do less with a document that was not specified by a
> DTD, but for many practical purposes a community could adopt certain
> procedural rules or conventions, and gain the benefit they required
> without needing to design and implement a full DTD. Semantic content
> could be carried through schemas; presentation could be externalised
> in style sheets. XML took off. It does still support DTDs (and I
> suspect most XML DTD parsers are still built on Clark's SP), but few
> applications use these.
>
> So there is a historical inversion of what we are now proposing for CIF.
> XML is analogous to the older flavours of CIF ('well-formed' corresponding
> to syntactic correctness, schemas mapping to some extent onto DDL1 and
> DDL2 dictionaries).
>
> By analogy with SGML, I suspect that DDLm will only ever attract one
> or two fully functional parser/validators, so these must be robust and
> capable of being integrated into other application packages. Given
> that, the complexity of handling multiple dictionaries will probably
> be delegated to just one or two public domain libraries. It is likely
> that only a few organisations will be able to, or want to, handle the
> full complexity of CIF3 validation and processing (the IUCr, PDB, CCDC
> etc). I do not foresee Jmol, for example, implementing a full DDLm
> compiler (though with Bob Hanson at the helm, Jmol might just be the
> application to do that...).
>
> So it might be that functions such as "return missing values by
> DDLm methods evaluation of the other content" move into the realm
> of web services provided by the IUCr or other service providers,
> rather than compiled-in functions of a crystallographic software
> suite. If that's the case, then the problem of exposing intermediate
> data items/values also becomes less acute.
>
> Regards
> Brian
> _________________________________________________________________________
> Brian McMahon                                       tel: +44 1244 342878
> Research and Development Officer                    fax: +44 1244 314888
> International Union of Crystallography            e-mail:  bm@iucr.org
> 5 Abbey Square, Chester CH1 2HU, England
> _______________________________________________
> cif-developers mailing list
> cif-developers@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif-developers
>


Reply to: [list | sender only]