(37) Length of data names in mmCIF; 'include' preprocessor directive

To: [email protected]
Subject: (37) Length of data names in mmCIF; 'include' preprocessor directive
From: bm
Date: Mon, 28 Aug 1995 14:39:55 +0100
Dear Colleagues

I have a couple of points to raise before disappearing to yet another
meeting. I shall endeavour to pull together all the currently active
strands upon my return.

May I just remark on how much I have enjoyed seeing so many of you again,
in one location or other, in the recent past!

The Executive Committee has approved full membership status for Gotzon
Madariaga and Hans Wondratschek, and I welcome them both to their new
positions of weighty responsibility.


The release of the mmCIF dictionary has taken a major step forward with the
establishment of a WWW home page at Rutgers University
(http://ndbserver.rutgers.edu/mmcif/index.html) that contains the latest
version of the dictionary (frequently updated to reflect minor revisions)
and links to a public discussion list. This represents an active phase of
community review, and there has already been some discussion on the list.


D37.1  Lengths of data names
============================
One of the first postings to the mmCIF discussion list posed the following
question:

> In the original CIF standard, data names are restricted to 32 characters,
> but many of the names in mmCIF are longer than this. The longest ones are
> 45 characters, and are in the STRUCT_SHEET_HBOND category. The absolute
> limit on data name lengths is 80 characters (not 75, since the dictionary
> itself is not a CIF, and so its save frame headers are not subject to
> CIF's 80 character limit for a line). 
> 
> Does COMCIFS have a policy to restrict the lengths of data names to less
> than 80 characters? This is important, because of the freedom to use local
> data names. Even if software ignores data names which are not in the
> standard dictionary, it will, in general, still be necessary to read the
> name into a character array during the lexical analysis phase of 
> processing the file.

This comes from Peter Keller at Bath, who is writing i/o routines for the
Fortran-based CCP4 protein structure program suite. The question suggests
three points:

(1) Should the dataname length limit be relaxed?

In practice, it already has for the current mmCIF draft. I suspect there
would be strenuous opposition to suggestions that some data names need to be
shortened by up to 18 characters, and in any case the expanded naming
conventions which always reflect the category and individual names do not
lend themselves to greater terseness.

(2) Should there be different limits between CIFs conforming to DDL1.4
    dictionaries and those using DDL2?

Some of the oldest and most straightforward CIF software, like QUASAR, will
break on datanames > 32 characters. But otherwise there is no reason why
QUASAR should not be used on mmCIF files. So should a new length limit
apply to both old and new types of CIF?

(3) What should the new limit(s) be?

Paula is happy with 45 (the longest existing name!), from which we may
infer that she has no deeper reason for choosing any particular figure.
Peter Keller's suggestion of 80 maintains compatibility with the existing
line length limit, though one could well argue for 75, to allow the
save_ frame declarations for each definition to exist within 80 characters.
And although it may be, as Peter argues, that dictionary files don't have
to conform to 80-character limits, much effort has been put into ensuring
that the current dictionary files do so conform, and thus we might wish to
adopt the 80-character limit as a binding convention even for dictionaries.

This last point also begs the question, of course, of whether we should
relax (or jettison) the 80-character limit. While many of the modern school
of programmers are happy with dynamic memory allocation for strings, this
approach still causes problems in much Fortran programming (and doubtless
in other languages) where one needs at least some idea of what space to
reserve as an input buffer.

Or is the answer to write a dynamic memory allocator for input strings in
Fortran? (Is such a thing possible?)


D37.2 Return of the "include" directive
=======================================
We had extensive discussions some time ago (see, for instance, mailing 18)
on mechanisms for including other STAR-compliant files within the current
one. There were various reasons why one might wish to do this: the
best-defined was Syd's proposal to acquire more primitive dictionaries
through incluusions in the highest-level dictionary to which a CIF
conformed. For a while, indeed, these ideas were encapsulated in the
evolving DDL1 specifications (see D25.6(c), for instance). However, we then
backed away from the idea to allow further reconsideration and the
implementation of trial applications that might use this feature.

Nick Spadaccini has now put together such an implementation, which he
describes below. The source is not yet available for ftp, but I have asked
him to mail details of how to acquire the material as soon as it is
available.

N> ##################################################
N> #                                                #
N> #     The proposed INCLUDE feature in STAR       #
N> #                                                #
N> ##################################################
N> 
N> Sometime ago it was proposed that a desirable feature within STAR would be
N> the ability to "include" files from anywhere as required. I don't recall 
N> who argued for its inclusion but it seemed popular at the time though I 
N> don't believe anyone sat down and thought about its syntax or the semantics. 
N> More importantly its interaction with global_ blocks was not considered in 
N> detail. I have "hacked" an implementation to the "include" feature; the
N> syntax, semantics and my reasoning I describe below. Firstly some general 
N> comments.
N> 
N> I believe this include feature will be little feature simply 
N> because of the way people prefer to maintain their data and dictionaries. 
N> Furthermore there will be a strong onus on the user to know what they 
N> are doing and to have a detailed knowledge of the files they are
N> including. The reason being, that other *improvement* the global_, will 
N> cause chaos because of the scope of global definitions ie from the point 
N> of declaration in the file down to the EOF. That is a global in an 
N> included file may over-ride any implied meaning you have in files/data 
N> lower down.
N> 
N> The syntax
N> ----------
N> There are 3 ways to go in general.
N> 
N> (1) is we have a "special" dataname _include, which is used in the 
N> following manner
N> 
N> _include "/a/b/c/whatever.cif"
N> 
N> The problem here is that it ties a meaning to a data name which is not 
N> and should not be done in general. How data names are interpreted is 
N> application/discipline dependent.
N> 
N> (2) the more obvious way is to make include_ a new keyword in STAR. OK, 
N> then to be consistent with data block and save block names the syntax 
N> should be
N> 
N> include_"/a/b/c/whatever.cif"       or     include_/a/b/c/whatever.cif 
N> 
N> A legitimate but inelegant solution in my opinion.
N> 
N> (3) The third view is that "include" has got nothing to do with STAR, but 
N> its result should be STAR conformant. This is the pre-processor view 
N> point one sees and uses in C compilers. This is the implementation I have 
N> followed. The syntax is
N> 
N> #%include "URL or filename"
N> 
N> The semantics
N> -------------
N> The semantics are identical to "cpp". I expand each include as detected, 
N> and pass control to the new included file, before returning back to the 
N> original file. It is a simple recursive implementation in other words.
N> 
N> The results
N> -----------
N> The resultant *complete* file will be STAR conformant provided the 
N> individual files were in the first place. The files one includes can be
N> local or anywhere accessible on the web provided you have the URL. In my
N> test I created a CIF file from the core dictionary, located at Chester,
N> the ddl dictionary located in Perth, as well as sundry junk files off the 
N> web. The resultant file was the complete composite, the parts of which 
N> were obtained by either the ftp or http protocols. The resultant was 
N> STAR compliant but meaningless as data.
N> 
N> How it is done
N> --------------
N> It is a simple hack of the cpp pre-processor freely available, plus a 
N> system call to a line mode web browser (lynx in particular, but it will 
N> work with the www application). This will require you to have lynx/www in 
N> place on your system. I am NOT prepared to build the application from 
N> scratch given my reservations about its usefulness. However I will be 
N> happy to do so once convinced it is a widely used feature.

Note especially the following point:

N> The caveat
N> ----------
N> Anybody who builds up a file from widely distributed parts had better 
N> know what they are doing, and had better have a VERY good idea what is 
N> contained in those files. Any global_ or missing data_ block heading will 
N> completely change the meaning of the composite file.
N> 
N> What you can do
N> ---------------
N> Think about how you might want to use it, use it or tell others to use it.
N> Does COMCIFS consider this an "important" feature of CIF files? To the 
N> best of my knowledge it is not formally specified in STAR yet, and should 
N> it be? If so, is the form I have cast it in acceptable?
N> 
N> Is it available
N> ---------------
N> Yes, I will set up an ftp site shortly
N> 
N> cheers to all
N> 
N> Nick
N> ------------------------------------------------
N> Dr Nick Spadaccini
N> 
N> Email:  Either of
N>         [email protected] (my current address in England)
N>         [email protected] (my alternative address in OZ)
N> 
N> Glaxo Wellcome Medicines Research Centre
N> Gunnels Wood Road
N> Stevenage HERTS. SG1 2NY 
N> United Kingdom
N> 
N> Voice Work:  (01438) 76 3338
N> Facsimile:   (01438) 76 4918
N> Voice Home:  (01438) 36 4879

Feedback on these ideas will be very welcome.

Regards
Brian
Prev by Date: (36) Comings and goings; IUPAC formula; H bonds; _type_construct
Next by Date: (38) Review of status; length of data names
Index(es):
- Date
Discussion List Archives

(37) Length of data names in mmCIF; 'include' preprocessor directive