E1003

AUTOMATIC DETERMINATION OF STRUCTURAL SUBCLASSES. Shishan Guo, Suzanne Fortier, Janice I. Glasgow, Chemistry Department and Computing and Information Science, Queen's University, Kingston, Ontario, Canada K7L 3N6

With the rapid growth of crystallographic databases, fully automatic methods for mining knowledge from these databases are needed. Several classification algorithms are already incorporated into the databases. While these have greatly facilitated the analysis and classification of datasets, considerable user intervention is still required. For example, extensive examination of the dataset may be needed for the selection of clustering algorithm, data parameters, similarity measure, similarity threshold, stopping point, etc. Furthermore, different choices of algorithms and metrics often yield different results. It is thus important to evaluate the robustness of the results and assess their possible dependence on artifacts of the approach used. Thus, a fully automated classification approach requires methods for both pre-classification data preview and post-classification result assessment. This contribution presents a method for the automatic determination of structural subclasses in datasets retrieved from the CSD. Subclasses/clusters are obtained by undergoing a comprehensive automated data preview which is followed by applying clustering algorithms and then by undergoing post-clustering evaluation of the results. The automatic preview component is based on a comprehensive analysis of histograms and scattergrams generated for potential classification parameters. This process helps identify informative parameters and gives a preliminary clustering of the dataset. For post-classification evaluation, plots of a clustering similarity index are used to assess how the results are affected by different algorithms and by the introduction of random noise into the dataset. These plots help understand the nature of the datasets being analysed by revealing characteristic features associated with the degree of overlap among the subclasses and by identifying where maximum similarity occurs. Application of the automatic classification approach to four representative datasets - valine, hexopyronose sugars, steroid side-chains and six-membered rings - will be presented.