(38) Review of status; length of data names

To: [email protected]
Subject: (38) Review of status; length of data names
From: bm
Date: Tue, 12 Dec 1995 14:58:10 GMT
Dear Colleagues

I'm sorry there has been so long a gap since the last COMCIFS circular. I
hope to start work on the final phase of the extended Core dictionary in
the very near future, and you can expect to see a flurry of short messages
while I do that. In the meantime, this message will review what has
happened since the end of August. As ever, there has been plenty of
movement beneath the surface.

Peter Keller, Nick Spadaccini and I have been refining the STAR description
and BNF, and intend to post this to the mmddl and perhaps other relevant
lists in the near future. This doesn't modify the basic STAR specifications,
but tries to clarify possible ambiguities. I'll let you know when this is
available.

Brian Toby and I have brought the powder dictionary to near-completion; I
would guess that Brian is now awaiting the final version of the extended
core. There is some pressure from ICDD to have this ratified, so I would
expect to see it presented for formal approval to COMCIFS very soon.

There has been extensive debate on the mmcif discussion list at Rutgers,
and the draft available from there is regularly updated. Paula has posted
her intention of moving to Phase 2 of her review procedure, which is to
announce the availability of the dictionary on selected discussion lists
and thereby disseminate it to the wider community. It is important, of
course, that she filter any good ideas arising from this process so that no
major structural changes are made to the dictionary in this revision (bug
fixes and truly brilliant ideas excepted, of course).

There has been some discussion of the best way to 'publish' the
dictionaries in future. Although the electronic version will always be the
most useful, a case can be made for the production of 'authoritative'
print versions as historical references (and as a fallback in case of
accidental or malicious corruption of the master files). Suggestions have
ranged from special issues of JAC, through monographs, to a volume of
International Tables with included CD-ROM. 

Graphics protocols in the CIF environment are now exciting significant
interest. The Computing Commission has undertaken to explore this area, and
Phil Bourne's involvement in a metagraphics dictionary will be a fruitful
starting point for some of these considerations.

A related topic has been the generation of a discussion list on a topic
originally called imgcif, but now, I think, migrated to ImageNCIF. The idea
here is to have a CIF-like data standard for binary data, particularly
image data (I think it originates with the ESRF group; Andy Hammersley drew
our attention to it in his email to COMCIFS of 29 September). The working
group are fairly strong in their desire to have image data stored in binary
format, which rules out its implementation as a STAR file proper; but they
appear to want to use the CIF paradigm for header information. So, as I
understand it, a so-called imgcif file would look like

   data_image_0029
       _image_title         'Synchrotron Laue image esrf-098098XR'
       _image_width_pixels   12960
       _image_depth_pixels   15648
       _image_compression_scheme   'blah blah blah'
       _image_field_length   20279864
       _image_data
w+fh;auweh8y2! dchtgKth xcdluyqgldjy ZGXnhcvLJZHGClUAHs;diuydq	ipwye...

where the datanames are pure figments of my imagination, but the intention
is to show that the file might have several data blocks (representing
different images); simple ASCII header fields identifying and describing
the included image; and then a field of binary data (which could well
include spaces, semicolons and other special STAR characters). One would
suppose that the end of this data field is deduced from matching a byte count
to a field-length declaration, rather than by token delimiter characters.

So the file format is *not* STAR-compliant (and hence not CIF); but the CIF
style of (tag, value) tuples appeals to the working group, who feel that
this can be treated as an extension to STAR. What are your views on this?

I should say that David Brown, Peter Keller and I have put to the working
group the idea of using an ASCII encoding of the image data and then having
a fully STAR-compliant implementation; but this has not had any obvious
success. The main competing standard in this area is called HDF, and it is
something I propose to find out some more about. It has been put to me that
there is pressure to resolve a standard format in short order. If you're
interested in following these discussions further, send mail to
[email protected], with the word "Subscribe" in the Subject: line, and
containing the message body
     subscribe imgCIF-l <your name>
Also, if you're interested in seeing previous messages to this list, I've
placed the discussion so far in the file discuss.img in the comcifs ftp area.


Ongoing discussions
===================

D37.1  Lengths of data names
============================

Some comments from Nick Spadaccini (sorry they've being gathering dust for
so long, Nick):

N>> (1) Should the dataname length limit be relaxed?
N> 
N> YES
N> 
N>> (2) Should there be different limits between CIFs conforming to DDL1.4
N>>     dictionaries and those using DDL2?
N> 
N> NO
N> 
N>> (3) What should the new limit(s) be?
N> 
N> Why impose a limit?
N> 
N>> This last point also begs the question, of course, of whether we should
N>> relax (or jettison) the 80-character limit...
N> 
N> I recall having a strenuous argument with nameless people about not 
N> imposing these (rather silly and unnecessary) restrictions on name and 
N> record lengths on CIF files. I lost that argument but now I see it has 
N> come full circle to haunt CIF users again. If Paula would be happy with 
N> 45 characters, what happens to the first person who wants 46? These 
N> decisions were made not on the basis of available "technology" but more 
N> on people's "habits". There is no need to take such a restrictive view 
N> of what to expect in a file. Sure programming in C makes this 
N> "open-ended" view of strings easier, but the same "philosophy" can be  
N> adopted in Fortran - it was in Xtal's QX array and even in the pre-curser
N> XRAY system.
N> 
N>> ... programmers are happy with dynamic memory allocation for strings, this
N>> approach still causes problems in much Fortran programming (and doubtless
N>> in other languages) where one needs at least some idea of what space to
N>> reserve as an input buffer.
N> 
N> There are two problems here. An input buffer should be set to a 
N> largish number (normally to BUFSIZ as defined in stdio.h, typically 
N> 1024 bytes). You read in your line records until a newline or eof or 
N> until you filled the buffer (this can be done in Fortran), deal with that 
N> line, then read the next input. I can never see why any character limit on 
N> record length was introduced.
N> 
N> The second is more difficult, given you want to store certain name strings
N> in an environment where dynmanic memory allocation is not allowed. It 
N> seems most people's solution to this one is to allocate (statically) a 
N> fixed length array to hold the string. Then you can have a number of 
N> names so you need a number of these arrays, or more simply a 2D array.
N> I think the solution is one of using the C philosophy in the Fortran 
N> environment, as I explain below.
N> 
N>> Or is the answer to write a dynamic memory allocator for input strings in
N>> Fortran? (Is such a thing possible?)
N> 
N> Fortran compilers are based on static (compile-time) declarations though 
N> the new Fortran 9X standard has some C-isms in it which may support 
N> dynamic memory allocation.
N> 
N> The simplest solution in Fortran is to mimic this dynamism (and this is 
N> not new; it has been used by crytallographers for some time). You have 
N> to simply statically allocate at compile time your own "heap" into which 
N> you will put strings of variable length e.g.
N> 
N> char strings-array(20000)
N> 
N> This would allow approx 1000 datanames of length 80 chars. As you wish to 
N> store a string, you pack it away from the current array index and add a 
N> null to terminate, and the next index becomes the pointer for the 
N> beginning of the next string. You obviously have to provide support 
N> functions for inputing and extracting these strings from the array, given 
N> a pointer (index). But once these are written they are completely 
N> reusable. There aren't many systems left where large memory allocations 
N> such as the above are no longer possible (are there?).
N> 
N> Now you have to store these pointers (indexes) somewhere. In C you can 
N> dynamically grow arrays using realloc which you can't do in fortran. So
N> you would have to statically allocate a large array of integers (one 
N> element for each dataname) to hold these indices into the  
N> strings-array(). At least now you have only put an upper limit on how 
N> many data names you can handle rather than a limit on the number of 
N> characters you can use in a dataname and on the number of datanames.
N> 
N> BUT I think there is a more important issue at hand .....
N> 
N> The beauty of STAR/CIF/SGML etc *has* been that the dictionaries 
N> themselves are STAR/CIF/SGML. Now we have mmCIF data files (which 
N> violate the CIF standard by the number of characters allowed in a 
N> dataname) defined in a mmCIF dictionary (which is NOT a CIF file itself, 
N> because of the introduction of save-frames). In other words a CIF parser 
N> would fall over on the mmCIF dictionary! I find this a very strange 
N> path to take, seemingly in the opposite direction of most other 
N> standards which pride themselves in their consistency in being self 
N> defining. The question is, is it even a STAR file? It has save frame 
N> definitions which are not de-referencable because there are no save frame 
N> pointers in the file! I believe it still constitutes a legitimate STAR 
N> file, but it is certainly not a CIF File. Why?????

I remind you of John Westbrook's explanation of the use of the save_ syntax
as posted in circular 30:

JW> The save_ syntax has been used in order to have a more consistent use of
JW> scope between data files and dictionaries.   Since we are representing
JW> links between data items we are are using save frames so that the referenced
JW> data items are all within the scope of the current dictionary.  This is
JW> not the case now where data_ sections are used.  Links between data
JW> blocks really violate the STAR scope rule that requires each data block
JW> to have a separate name space.

John's view on this is that the save_ requirement is forced upon us by the
requirements of a software environment that applies consistent scoping
rules across dictionaries and data files. The inability to reference other
data blocks is seen as a real problem. The save_frame usage is admitted to
be a fudge (or kluge, in technospeak), forced because of the absence of any
other means of referencing a block of information. John suggests that if
the STAR specification is again modified (though this isn't considered a
desirable aim in itself), consideration should be given to introducing a
"definition_" keyword that is available to CIF (or at least to CIF
dictionaries) with the express purpose of differentiating dictionary
definitions from any other data block.

N> In my mail concerning "include" the following should read ...
N> 
N>> N> I believe this include feature will be little feature simply 
N>                                                    ^^^^^^^
N>                                                      used


Regards
Brian
Prev by Date: (37) Length of data names in mmCIF; 'include' preprocessor directive
Next by Date: (39) IT volume?; image standard; front of dictionary; data collection
Index(es):
- Date
Discussion List Archives

(38) Review of status; length of data names