[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] New syntax: 'marker' characters
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] New syntax: 'marker' characters
- From: Joe Krahn <krahn@niehs.nih.gov>
- Date: Wed, 04 Nov 2009 13:53:56 -0500
- In-Reply-To: <279aad2a0911032016l7628a9a7paa0d6d0324b38c27@mail.gmail.com>
- References: <279aad2a0910281823tafd2e31o46e93a68e03a4c89@mail.gmail.com> <4AE9BA8F.8090405@mcmaster.ca> <20091029130659.X67614@epsilon.pair.com> <4AEB74F4.3070405@niehs.nih.gov><279aad2a0911032016l7628a9a7paa0d6d0324b38c27@mail.gmail.com>
James Hester wrote: > See comments below: > > On Sat, Oct 31, 2009 at 10:21 AM, Joe Krahn <krahn@niehs.nih.gov> wrote: >> IMHO, additional dictionary meta-data makes much more sense that >> complicating the CIF format. Even if you have markers within a file, you >> would still want to specify the proper order defined in a dictionary. If >> you want order to be significant in the absence of a dictionary, you can >> always use the input order as the preferred order for writing data back out. > > Absolutely agree with all these statements. > >> Of course, that does not cover the idea of faster parsing by being able >> to skip over blocks of data. However, parsing really should not take >> that long. I wrote a simple Tcl script that can delimit data blocks in >> the entire 91M PDB components.cif in a few seconds using regexp. If your >> files are really huge (i.e. imgCIF) and speed is important, then it >> makes more sense to create a separate data-block index file. With >> markers, you still have to actually read through data from the disk to >> search for markers. An index file allows you to seek directly to the >> desired data. > > Did your Tcl script assume that the character sequence 'data_' did not > appear in comments or inside strings? If it did assume this (which I > suspect), you were using the character sequence 'data_' exactly as I > envision using a marker character. The difference being that > syntactically the character sequence 'data_' is allowed to appear in > comments or datavalues, so a general-purpose application can't do a > simple search for 'data_'. Regarding your comment about still having > to actually read through the file for markers, your example of > delimiting data blocks illustrates my point: if you are only reading, > rather than parsing, you can go pretty fast. > > Of course, if your Tcl script actually parsed on the fly, then there > is no real problem to be solved in the first place and no point in > pursuing this marker idea. > I use a regexp that properly handles all comments and quoting types in CIF1, so it does not just search for the 'data_' sub-string. I am actually surprised that it could parse so quickly. However, this requires parsing characters without tokenizing them; a data block is a single regexp. A normal CIF parser may not be designed to parse without tokenizing and storing values, so it may take some redesign to get the same performance even in a compiled program. Of course, there is no reason why a given CIF implementation could not use comments as hints for faster parsing. Even with the above argument that fast parsing is possible, a large network-mounted file could go slow just reading the intervening file data. However, putting the hints in comments means that it does not need to be part of the CIF spec. Ordering hints are a bit different, because they affect more than just comments. Currently, order is supposed to be irrelevant, so you could claim that it is also just a performance hint. I have always thought that a canonical order is useful. Most CIF software writes out "pretty" formatted text because organization is useful when being viewed by a human. Herbert's suggestion is to make preferred ordering an integral part of the DDL, but avoid incorporating ordering rules into the CIF syntax. Then, people that want ordering rules can use them, but it avoid complicating the CIF spec. Joe Krahn _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] New syntax: 'marker' characters (James Hester)
- References:
- [ddlm-group] New syntax: 'marker' characters (James Hester)
- Re: [ddlm-group] New syntax: 'marker' characters (David Brown)
- Re: [ddlm-group] New syntax: 'marker' characters (Herbert J. Bernstein)
- Re: [ddlm-group] New syntax: 'marker' characters (Joe Krahn)
- Re: [ddlm-group] New syntax: 'marker' characters (James Hester)
- Prev by Date: Re: [ddlm-group] New syntax: 'marker' characters
- Next by Date: Re: [ddlm-group] New syntax: 'marker' characters
- Prev by thread: Re: [ddlm-group] New syntax: 'marker' characters
- Next by thread: Re: [ddlm-group] New syntax: 'marker' characters
- Index(es):