[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] New syntax: 'marker' characters

James Hester wrote:
> See comments below:
> 
> On Sat, Oct 31, 2009 at 10:21 AM, Joe Krahn <krahn@niehs.nih.gov> wrote:
>> IMHO, additional dictionary meta-data makes much more sense that
>> complicating the CIF format. Even if you have markers within a file, you
>> would still want to specify the proper order defined in a dictionary. If
>> you want order to be significant in the absence of a dictionary, you can
>> always use the input order as the preferred order for writing data back out.
> 
> Absolutely agree with all these statements.
> 
>> Of course, that does not cover the idea of faster parsing by being able
>> to skip over blocks of data. However, parsing really should not take
>> that long. I wrote a simple Tcl script that can delimit data blocks in
>> the entire 91M PDB components.cif in a few seconds using regexp. If your
>> files are really huge (i.e. imgCIF) and speed is important, then it
>> makes more sense to create a separate data-block index file. With
>> markers, you still have to actually read through data from the disk to
>> search for markers. An index file allows you to seek directly to the
>> desired data.
> 
> Did your Tcl script assume that the character sequence 'data_' did not
> appear in comments or inside strings?  If it did assume this (which I
> suspect), you were using the character sequence 'data_' exactly as I
> envision using a marker character.  The difference being that
> syntactically the character sequence 'data_' is allowed to appear in
> comments or datavalues, so a general-purpose application can't do a
> simple search for 'data_'.  Regarding your comment about still having
> to actually read through the file for markers, your example of
> delimiting data blocks illustrates my point: if you are only reading,
> rather than parsing, you can go pretty fast.
> 
> Of course, if your Tcl script actually parsed on the fly, then there
> is no real problem to be solved in the first place and no point in
> pursuing this marker idea.
> 
I use a regexp that properly handles all comments and quoting types in 
CIF1, so it does not just search for the 'data_' sub-string.

I am actually surprised that it could parse so quickly. However, this 
requires parsing characters without tokenizing them; a data block is a 
single regexp. A normal CIF parser may not be designed to parse without 
tokenizing and storing values, so it may take some redesign to get the 
same performance even in a compiled program.

Of course, there is no reason why a given CIF implementation could not 
use comments as hints for faster parsing. Even with the above argument 
that fast parsing is possible, a large network-mounted file could go 
slow just reading the intervening file data. However, putting the hints 
in comments means that it does not need to be part of the CIF spec.

Ordering hints are a bit different, because they affect more than just 
comments. Currently, order is supposed to be irrelevant, so you could 
claim that it is also just a performance hint. I have always thought 
that a canonical order is useful. Most CIF software writes out "pretty" 
formatted text because organization is useful when being viewed by a 
human. Herbert's suggestion is to make preferred ordering an integral 
part of the DDL, but avoid incorporating ordering rules into the CIF 
syntax. Then, people that want ordering rules can use them, but it avoid 
complicating the CIF spec.

Joe Krahn
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]