[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Additional update to core dictionary

To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <[email protected]>
Subject: Re: Additional update to core dictionary
From: Brian McMahon <[email protected]>
Date: Mon, 28 Mar 2011 23:16:20 +0100
In-Reply-To: <[email protected]>
References: <[email protected]><1893558218.22554.1300989799369.JavaMail.open-xchange@oxapp1.inap.sea.dotster.net><[email protected]>

JK> I think the idea of a definition for a document DOI is uncontroversial.
JK>  It's pretty common these days to access a journal article by its DOI.
JK>  Given the increasing interest in curation of raw data, might this
JK> idea get more complex?  Would the article DOI also include the raw
JK> data, or would multiple DOIs need to be accommodated?  Say one for
JK> the paper and another for the raw data?

Just so.

MT> AFAIK some journals are now using separate DOI for the supplementary
MT> data, that said many use a single DOI for the article then the
MT> data are accessed as sub-pages from this.  If multiple DOI for a
MT> single article were supported  I suggest having a single DOI
MT> item for the article then a separate loop for any associated data e.g.
MT> 
MT> _journal_article_doi             ABC1234
MT> loop_
MT> _journal_article_related_doi
MT> ABC5678
MT> ABC8970

OK, since there's some interest in this, let me share with you our
thinking so far. The following notes are from our internal discussion
document at the Acta offices.

==============================================================================
PROPOSAL FOR INCORPORATING DOI IDENTIFIERS IN CIF
-------------------------------------------------

(1) "Journal housekeeping, citation and indexing entries"

The _journal_ items (not usually modified by an author) are currently
used to record the bibliographic information about the article published
from the current CIF. The simplest extension would be:

    _journal_paper_doi              '10.1107/S010876739101067X'

For supplementary materials, the category currently includes

    _journal_suppl_publ_number
    _journal_suppl_publ_pages

designed to record the old SUP number and number of pages, both of
which were traditionally published in the deposition footnote. These
items cannot however be looped. So one might introduce a new loop
allowing greater characterization of "supporting" documents (hence
_sup_ instead of _suppl_):

loop_
    _journal_sup_material_id
    _journal_sup_material_role
    _journal_sup_material_mime_type
    _journal_sup_material_doi

1  cif    chemical/x-cif   10.1107/S0108768110051050/wh5011sup1.cif 
2  hkl    text/plain       10.1107/S0108768110051050/wh5011Pbar1sup2.hkl
3  hkl    text/plain       10.1107/S0108768110051050/wh5011P21csup3.hkl
4  rtv    text/plain       10.1107/S0108768110051050/wh5011Pbar1sup4.rtv
5  rtv    text/plain       10.1107/S0108768110051050/wh5011P21csup5.rtv
6  extra  application/pdf  10.1107/S0108768110051050/wh5011sup6.pdf


Possible enumerations for _journal_sup_material_role:
   cif     'structural data model in CIF format'
   mmcif   'structural data model in mmCIF format'
                (or mcf if we want to promote standard filename extensions)
   hkl     'structure factors'
   rtv     'Rietveld powder data'
   extra   'additional article content (e.g. figures, tables, appendices)'
   data    'supporting data in a machine-parseable format'

QUESTION: Is it OK to assume that it is not necessary to make an
explicit connection with _journal_paper_doi - i.e. that all of these
items are implicitly associated with the one publication derived from
this CIF? (I think it is.)



(2) "Contents of a publication"

The _publ_ category concerns the content of a publication and are
created/edited by the author. We introduce a category that allows
the listing of links to related materials.

loop_
    _publ_related_id
    _publ_related_citation_id
    _publ_related_publisher
    _publ_related_link_identifier
    _publ_related_link_identifier_type
    _publ_related_role
    _publ_related_details

1  .  pdb        2zse                    refcode  struct      .
2  .  pdb        10.2210/pdb2zse/pdb     doi      struct      .
3  .  uniprotkb  P63810                  refcode  seq         .
4  .  pdb        r2zsesf                 refcode  relsfac     .
5  .  pdb        2zs7                    refcode  relstruct   'citrate complex'
6  .  icsd       161730                  refcode  relstruct   .
7  .  csd        ADENTP01                refcode  relstruct   .
8  1  ?          10.1074/jbc.C500044200  doi      relpub      .
9  1  ?          15795230                pmid     relpub      .
10 2  ?          123456                  casreg   relchem     .


_publ_related_citation_id allows one of these links to be
cross-referenced to a structured entry in the reference list
(CITATION family of CIF categories)

_publ_related_publisher is possibly not well named: all these examples
are databases, and authors might not know the publisher of a journal
(or a journal publisher could change over the lifetime of a journal).

Examples here of _publ_related_link_identifier_type include "refcode"
(meaning any accession code that is local to a particular database,
not just a CSD 'refcode'), PubMed ID and CAS registry number (but are
these inherently different from "refcode") and DOI. The notion behind
"doi" is that you can figure out how to use it directly
(e.g. http://dx.doi.org/blah....). Maybe a better scheme would be doi,
url, uri, urn, refcode ?

Examples of the values permitted for _publ_related_role might be:

  relchem   'related chemical compound'
  relpub    'related publication'
  relseq    'peptide or nucleotide sequence of related structure'
  relsfac   'structure factors of related structure'
  relstruct 'related structural model'
  seq       'peptide or nucleotide sequence for a structure in this publication'
  sfac      'structure factors for a structure in this publication'
  struct    'structural model in this publication'

==============================================================================

Comments welcome. So, it's a more complex scheme than Matt suggested
for the supporting materials for the published article, in order to
express the relationships between the document components; and we
started to work on a parallel scheme for derivative or related data
sets. However, the reason we didn't work this up further was the
sense that locating and annotating such information was likely to
be something that authors would not do reliably (or at all). Further,
in many cases an identifier for a related data set might not be
available when the article is submitted, or published; but it's not
feasible for the journal to locate such information subsequently.
In the end, we believed that so rich an annotation would be
unworkable; but we're willing to revisit the topic when we're
convinced that we could capture a significant amount of such
information at a sufficiently early stage.

However, we would be interested in your thoughts as to the categorisation of
relationships between related and derivative data sets as suggested above.
In particular, if you're aware of any widespread declarative ontologies
(such as the "Cito" ontology for expressing relationships between cited
publications) that could be mined for such relationships, I'd be very
interested.

Regards
Brian

Reply to: [list | sender only]

Follow-Ups:

RE: Additional update to core dictionary (Matthew Towler)

References:

Additional update to core dictionary (Brian McMahon)

Re: Additional update to core dictionary (jim kaduk)

RE: Additional update to core dictionary (Matthew Towler)

Prev by Date: RE: Additional update to core dictionary

Next by Date: RE: Additional update to core dictionary

Prev by thread: RE: Additional update to core dictionary

Next by thread: RE: Additional update to core dictionary

Index(es):

Date

Thread

Discussion List Archives

Re: Additional update to core dictionary