Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: IUPAC workshop on XML and IChI

Dear David,

In german it is 'das Schema' in the singular and 'die Schemata' in the 
plural, so it will have been the same in the greek original.

Best wishes, George



I. David Brown wrote:
> Dear Colleague,
> 
> 	I have just returned from a workshop dealing with chemistry XML
> and the IUPAC Chemical Indentifier (IChI).  I have appended below a report
> on those aspects of the workshop that are likely to be of interest to
> members of IUCr committees.
> 
> 	I apologize to those of you who receive more than one copy of this
> email.  I am circulating it two four groups who might be interested and
> several of you will be members of more than one of these.
> 
> 	I will be following up this report with further suggestions for
> discussion by the coreCIFchem and phaseID groups, but those of you who
> belong to other groups may find this report interesting.
> 
> 			Best wishes
> 
> 				David
> 
> Keep scrolling - More below
> *****************************************************
> Dr.I.David Brown,  Professor Emeritus
> Brockhouse Institute for Materials Research,
> McMaster University, Hamilton, Ontario, Canada
> Tel: 1-(905)-525-9140 ext 24710
> Fax: 1-(905)-521-2773
> idbrown@mcmaster.ca
> *****************************************************
> 
> Report on the workshop on Chemical XML and the IUPAC Chemical
> Identifier (IChI) held at NIST 12-14 Nov. 2003.
> 
> I.D.Brown
> 
> Summary.
> --------
> There is currently no organization coordinating the XML
> ontologies being developed for the various branches of chemistry,
> even though several chemical specialties are developing detailed
> ontologies in their own disciplines.  However, a project to
> develop an IUPAC Chemical Identifier (IChI) in the form of an
> electronic character string that uniquely identifies a compound,
> is well advanced and shows promise as a search key.
> 
> Introduction
> ------------
> IUPAC has appointed a Committee on Printed and Electronic
> Publication (CPEP) which in turn has a subcommittee on Electronic
> Data Standards (EDS).  The latter has two projects that were the
> subject of a workshop held at NIST, Gaithersburg in November
> 2003.  The first is the development of a Chemical XML dictionary
> and the second the development of an IUPAC Chemical Identifier
> (IChI).  This document reports on this workshop for the benefit
> of interested groups in the International Union of
> Crystallography.
> 
> Chemical XML
> ------------
> Although the EDS would appear to be the IUPAC equivalent of
> COMCIFS, the two committees have very different mandates.  The
> primary role of EDS is to define XML schema or dictionaries that
> would allow IUPAC to produce web versions of its Gold Book
> (definitions of chemical terms) and Green Book (mathematical
> relations used in analytical chemistry).  This is equivalent to
> producing web versions of International Tables for
> Crystallography.  EDS is therefore interested in reproducing
> text, mathematical equations and chemical structure diagrams on
> the web using XML versions of the printed Gold and Green Books.
> EDS is explicitly not interested in (or believes it does not have
> the authority to) recommend or coordinate electronic ontologies
> for chemistry as a whole, including defining such items as
> chemical formulae that might be expected to appear in many
> different chemistry XML schema.  In its more limited role, EDS is
> proposing to express mathematical formulae using the existing
> MathML (a general mark-up language prepared by mathematicians),
> units using the similarly general UnitsML, and chemical diagrams
> in a form that would allow them to be printed using SVG.
> 
> Even though the scope of EDS is limited, the workshop received
> reports from several groups developing ontologies for specialists
> branches of chemistry (including the report on CIF that I gave).
> There was a general appreciation that the most important task is
> to define the ontologies (the contents of the dictionaries) and
> that one should not worry too much about the language in which
> they are expressed.  XML is the current flavour of the year, but
> XML might well be superceded by a different (better?) system in
> five or ten years time.  A well designed ontology could easily
> migrate from one delivery system to another.
> 
> Among the 8 to 10 groups working on specialized chemical
> ontologies in the form of XML schema, ThermoML and SpectaML stood
> out as being well advanced.  Their schema (schemae?) are more
> directly comparable with CIF, in that they are designed to
> capture of the results of experimental measurements in their
> respective disciplines.  ThermoML has been adopted by five of the
> leading thermodynamic journals (representing three different
> publishers), but rather than requiring authors to submit papers
> in ThermoML, the journals will continue to accept papers in
> traditional formats (90% are submitted in MSWord).  The mark-up
> into XML will be carried out by the publishers and XML versions
> of the results will be submitted to a thermodynamic database.
> Another group is producing a schema (a schemum?) for analytical
> measurements (AniML) and a group in Prague is working on a Mark-
> up Language for chemical structures based on Graph Theory (GTML).
> Most of these projects are closely related to particular
> experimental techniques where the concepts are specialized.
> There is no group, either existing or proposed, that is charged
> with coordinating these efforts to ensure that the definitions do
> not conflict.
> 
>>From the crystallographer's point of view the most interesting
> project is Peter Murray-Rust's Chemical Mark-up Language (CML)
> which aims to capture the chemical structures that are at the
> heart of any description of chemistry, specifically organic
> chemistry.  Peter has been working on this project for many years
> and his schema are well thought out and tested using software he
> has written.  A number of publishers and the European Patent
> Office have expressed interest in CML, and Peter has been working
> closely with the chemical modelling community to develop a
> version of CML for them.  The schema in CML are very general,
> specifying only that molecules are composed of atoms which are
> linked by bonds, but molecules, atoms and bonds are not defined,
> leaving it to the user to decide which atoms are bonded and
> therefore which atoms constitute a molecule.  One can see the
> reasons for such an open-ended approach, but the philosophy is
> very different from that adopted by CIF.  CML is not likely to
> give us much guidance as we extend CIF to include chemical (as
> opposed to crystallographic) concepts.  However, Peter has
> written programs that will convert DDL1 CIF to cifML and vice
> versa, cifML being a version of XML that explicitly employs CIF
> datanames and ontologies.
> 
> One attractive feature of XML that we might consider incorporating into
> CIF is the ability to avoid namespace collisions.  Two schema
> (dictionaries), foo and fee, that both use the name 'bond_order', though
> with different definitions, would give rise to items with names like
> foo:bond_order and fee:bond_order where 'foo' and 'fee' are equivalenced
> to web URLs where the respective schema can be found.  This allows two XML
> files based on different schema to be concatenated, but it does not
> provide precise definitions for the values of 'bond_order' in the
> different schema.  They may be defined the same way or they may not.  A
> search across databases would retrieve both kinds of bond_orders, but a
> computer would have to assume that the quantities are unrelated.  The
> resulting different dialects of chemistry would make it difficult to
> synthesize information across different databases.
> 
> When I asked the EDS where one could find IUPAC recommendations
> for an electronic coding of widely used chemical concepts such as
> the chemical formulae, everybody in the room started pointing to
> someone else (the scene was reminiscent of Alice in Wonderland!),
> but the eventual consensus was that IUPAC has no mechanism for
> making recommendations at this level of detail, because if it
> did, the recommendations would probably be ignored by the
> chemical community.  This may have been the experience with IUPAC
> recommendations in the past, but a consortium of groups devising
> chemMLs would have a strong motivation to adopt compatible
> definitions for common chemical concepts.  At present it would
> appear that, apart from the sum_chemical_formula for which rules
> already exist, it is unlikely that the various chemMLs will adopt
> compatible definitions of key chemical concepts.  The feeling
> among the members of EDS is that it will be time enough to
> resolve these conflicts when they arise!
> 
> IChI (IUPAC Chemical Identifier)
> --------------------------------
> This inability to coordinate ontologies is perhaps why EDS set up
> the IUPAC Chemical Identifier (IChI) project which aims to
> recommend an identifier that would be able to locate the same
> compound in different databases.  This project was the subject of
> the second half of the workshop.  When the IChI group was set up,
> they approached the IUCr Nomenclature Commission for advice on
> how identify different crystalline phases.  The chair of the
> Commission at the time, S.C.Abrahams, asked me to set up working
> group to make recommendations that could be passed back to IChI.
> Our working group, acting independently of IChI, has discussed a
> number of possibilities which, fortunately, should be easy to
> incorporate into the recommended IChI identifier.
> 
> A proposal for the first version of the identifier covering
> mostly organic compounds is nearly ready, and the IChI working
> group has given thought to how a later version might cover a
> wider range of compounds.  The identifier is built up of a number
> of layers.  The top (first) layer contains only the chemical
> formula and will, for many compounds, be sufficient to identify
> the compound uniquely.  The second layer includes the chemical
> structure, i.e. a normalized description of the connectivity.
> The contents of this layer are determined by computer algorithms
> from a connectivity diagram supplied by the author.  Insofar as
> different authors may disagree on which atoms are bonded, the
> same compound may end up with different identifiers, but this
> layer of the identifier is made as robust as possible by ignoring
> hydrogen atoms, bond orders and charge assignments.  Hydrogen
> atoms are introduced at the third layer which can be ignored if
> one is not interested in a particular tautomer.  Still lower
> levels contain information about stereocenters and isotopes, and
> are included only if required.  Searches can be deep, returning
> only compounds with the same stereochemistry and isotopic
> content, or they can be restricted to higher levels if tautomers,
> stereochemistry and isotopes are not of interest.  Identification
> of the crystallographic phase by including, e.g., the space group
> number, can easily be added as yet a further layer.
> 
> Version 1 of IChI has impressed those who have been testing it.
> It works well, as might be expected, for organic compounds, but
> also for many inorganic and metallorganic compounds if the bonds
> to the metal atoms (or cations) are not included in the second
> layer.  They can be introduced in a lower layer if needed, e.g.,
> to distinguish between isomers with different metal coordination.
> At present the identifier is not designed to describe polymeric
> structures, clusters or disordered structures but the IChI group
> is interested in including these features in future versions.
> 
> We will probably wish to incorporate IChI into CIF when the final
> standard is approved.
> 
> I.D.Brown
> 2003-11-19
> 
> _______________________________________________
> coreDMG mailing list
> coreDMG@iucr.org
> http://scripts.iucr.org/mailman/listinfo/coredmg
> 
> 


-- 
Prof. George M. Sheldrick FRS
Dept. Structural Chemistry,
University of Goettingen,
Tammannstr. 4,
D37077 Goettingen, Germany
Tel. +49-551-39-3021 or -3068
Fax. +49-551-39-2582


_______________________________________________
coreDMG mailing list
coreDMG@iucr.org
http://scripts.iucr.org/mailman/listinfo/coredmg

[Send comment to list secretary]
[Reply to list (subscribers only)]