Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Adding datanames covering database information

  • To: Distribution list of the IUCr COMCIFS Core Dictionary Maintenance Group <coredmg@iucr.org>
  • Subject: Re: Adding datanames covering database information
  • From: James Hester <jamesrhester@gmail.com>
  • Date: Thu, 28 Jun 2018 17:24:38 +1000
  • DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025;h=mime-version:in-reply-to:references:from:date:message-id:subject:to;bh=kgOepewwh/PhvG7vTI0LR7/5Kui/z/SisfV+dR0s1EU=;b=DS0PJLpdDNSjyfPIqmJOOqyf0qx8ygQzNuALe5zvW8RJ25hJn323jexORghqQnk0oCp+r/iK+/7Fv0h++huZ7wI3q2zH1/0B4WJJekzzh0oVUPK4GH6nUuBojVMamnnTzRCrTXtF5nHeeGBO6itUFSFX0w0pCqY/ZBzXbErmT0zzP7zsJQ8HAsblxrNDfnWdnVAzPE98wA50GqfO8brme5TU7kQRfbyKVquo72b8fXngLCQYNG+wYa6uOFxNhL/IYcCENdXTU0MmAnlRWHnQl8DTBTy2uMg7r/Qlu79BWow3R/J0XZLjSm3VSYW0YOf9y32HStxD+1GzkJdw+65yrg==
  • In-Reply-To: <CAM+dB2e2LDe-aJDVYN64+C7FBJynB=WnPjfqe6tgrnQRgOe7Sg@mail.gmail.com>
  • References: <CAM+dB2e2LDe-aJDVYN64+C7FBJynB=WnPjfqe6tgrnQRgOe7Sg@mail.gmail.com>
Please see below some draft definitions for a new database_related category, as foreshadowed in my email of April 12th.  Feel free to comment. If any databases have been left off the initial list below, feel free to suggest additions.

Note that I have chosen not to make these datanames aliases of the DATABASE_2 datanames in mmCIF, as the new category has a different key.

James.
=============================================================
#
#  Draft definitions for a new DATABASE_RELATED category
#

save_DATABASE_RELATED
_definition.id          DATABASE_RELATED
_definition.class       Loop
_definition.scope       Category
_definition.update      2018-06-29
_description.text
;

    A category of items recording entries in databases that describe
    the same or related data. Databases wishing to insert their own
    canonical codes when archiving and delivering data blocks should
    use items from the DATABASE category.
   
;
_name.category_id       PUBLICATION
_name.object_id         DATABASE_RELATED
_category_key.name      '_database_related.id'
save_

save_database_related.id
_definition.id          '_database_related.id'
_definition.update      2018-06-29
_description.text
;
       An identifer for this database reference
;
_name.category_id       database_related
_name.object_id         id
_type.purpose           Key
_type.source            Recorded
_type.container         Single
_type.contents          Text
save_

save_database_related.database_id
_definition.id          '_database_related.database_id'
_definition.update      2018-06-29
_description.text
;
       An identifier for the database that contains the
       related dataset.
;
_name.category_id       database_related
_name.object_id         database_id
_type.purpose           State
_type.source            Recorded
_type.container         Single
_type.contents          Text
_import.get [{'save':database_list 'file':templ_enum.cif}]
save_

save_database_related.database_code
_definition.id          '_database_related.database_code'
_definition.update      2018-06-29
_description.text
;
       The code used by the database referred to in
       _database_related.database_id to identify the
       related dataset.
;
_name.category_id       database_related
_name.object_id         database_code
_type.purpose           Encode
_type.source            Recorded
_type.container         Single
_type.contents          Text

save_

save_database_related.relation
_definition.id          '_database_related.relation'
_definition.update      2018-06-29
_description.text
;
       The general relationship of the data in the data block
       to the dataset referred to in the database.
;
_name.category_id       database_related
_name.object_id         relation
_type.purpose           State
_type.source            Recorded
_type.container         Single
_type.contents          Text
loop_
   _enumeration_set.state
   _enumeration_set.details
   Identical           'The dataset contents are identical'
   Subset              'The dataset contents are a proper subset of the contents of the data block'
   Superset            'The dataset contents include the contents of the data block'
   Derived             'The dataset contents are derivable from the contents of the data block'
   Common              'The dataset contents share a common source'
save_

save_database_related.special_details
_definition.id          '_database_related.special_details'
_definition.update      2018-06-29
_description.text                      
;
    Information about the external dataset and relationship not encoded
    elsewhere.
;
_name.category_id                       database_related
_name.object_id                         special_details
_type.purpose                           Describe
_type.source                            Recorded
_type.container                         Single
_type.contents                          Text

save_


#
# Contents to be added to templ_enum.cif listing database codes
#


save_database_list
loop_
    _enumeration_set.state
    _enumeration_set.detail
    CAS          'Chemical Abstracts'
    COD          'Crystallographic Open Database'
    CSD          'Cambridge Structural Database'
    ICSD         'Inorganic Crystal Structure Database'
    MDF          'Metals Data File'
    NDB          'Nucleic Acid Database'
    PDB          'Protein Data Bank'
    PDF          'Powder Diffraction File (JCPDS/ICDD)'
    RCSB         'Research Collaboratory for STructural Bioinformatics'
    EBI          'European Bioinformatics Institute'
save_


On 12 April 2018 at 15:59, James Hester <jamesrhester@gmail.com> wrote:
Dear Core CIF users and experts,

The current core CIF provides the DATABASE and DATABASE_CODE categories for identifying a database entry corresponding to the structure contained in the data block, for a variety of pre-determined databases.  These are both Set categories, that is, their datanames can only take a single value in a single data block.  This restriction is reasonable if the database content for that entry is seen as coincident with the data block contents, as has been the case for structural databases.

However, it is possible for multiple entries from a single database to be more broadly relevant to the contents of a data block. For example, multiple structures may correspond to a single topology.  So I would like you to consider the creation of a (looped) DATABASE_RELATED category that would simply list entry codes for databases in the same way as CITATION simply lists literature references.  Other categories in other dictionaries may then reference these entries for their own uses.  This is not intended to replace the current DATABASE categories, which would still be preferred for use by structural databases upon deposition and delivery of CIF files.  The new category would instead align with the mmCIF DATABASE_2 category.

The proposed data names are as follows, with short summaries of their meanings:

_database_related.id           'An arbitrary identifier for this entry'
_database_related.database_id            'An identifier for the database from an enumerated list (e.g. CCDC, PDB, ICSD, COD ...)
_database_related.reference   'A code used by the database given in _database_related.database_id'
_database_related.relation      'The way in which the database entry is related to the contents of the data block, from an enumerated list. Initial suggestions include "identical","component","derived","common source" '
_database_related.special_details   'Optional free-form description of the relationship between this entry and the data block contents"
 
An example of use in a data file would then be:

loop_
_database_related.id         
_database_related.database_id        
_database_related.reference 
_database_related.relation    
_database_related.special_details
1    COD              1234                   identical                            'As deposited structure'
2    COD              6789                   'common source'            'Curated version of this structure'
3    CCDC            qrst-12               'common source'            'Curated version of this structure'
4    ICSD              lll-ppp                 .                                         'An earlier version of the structure with missing H atoms'

Please provide your thoughts on this general scheme, and any further data names that you think might be useful in this context.  If there are no objections, I will prepare formal definitions and advise this group when they are ready for inclusion.

best wishes,
James Hester.
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148



--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
coreDMG mailing list
coreDMG@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/coredmg

[Send comment to list secretary]
[Reply to list (subscribers only)]