[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] New syntax: 'marker' characters

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] New syntax: 'marker' characters
From: David Brown <[email protected]>
Date: Thu, 29 Oct 2009 11:53:51 -0400
In-Reply-To: <[email protected]>
References: <[email protected]>

Title:

I assume that these markers would be used only on a few classes of files such as imgCIF and that the majority of CIFs would continue as before with no section markers present. For this reason someone may write software that knows nothing about markers. Would the presence of a marker cause problems for such a program since it would appear as an orphan string? Preceding each marker with # would put them in a comment thus ensuring that they were ignored by a parser that was not looking for them, but would still be found by a parser that was looking for them since the marker is a single character that may appear at any suitable point in the file regardless of context. There would be the danger that the markers could be lost, but this would not invalidate the CIF, just slow down its processing. David James Hester wrote: Dear DDLm-ers: At the risk of completely overwhelming the group, a relatively radical proposal follows, perhaps inspired by the fact that we now have 107,000 'text' characters to play with... James. --------------------------------------------------------------------------- PROPOSAL: to define 'markers' in CIF2.0 syntax OUTLINE I propose that we nominate two different Unicode characters or sequences as 'markers'. The first marker, a 'datablock' marker, may only appear prior to a datablock data_ tag. The second, a 'section' marker, may appear anywhere that an unlooped dataname could appear, and may never appear in a data value, data name, or comment. It is envisioned that such markers would perform two functions: (i) to allow division of a datablock into sections (semantic function). (ii) to allow rapid traverse of a CIF data file (convenience function). In tandem, a new optional DDLm attribute would be created describing where in a datablock a given tag could appear: 'beginning', 'middle', 'end' (or unrestricted). The 'beginning', 'middle' and 'end' sections would be separated by section markers. If no markers are present, these three sections coincide and the current behaviour is retained. If one marker is present, 'middle' coincides with both 'beginning' and 'end'. Although I would be against it, a more fine-grained sectioning could also be implemented. (i) Semantic function (applies to section marker only) The current abstract datastructure that describes a CIF datablock (the "infoset") is an unordered set containing key-value pairs and loops (*). While this rigorous lack of canonical order has served us well, there are certain practical problems presented by large CIF files, particularly imgCIF. imgCIF files contain information about the image geometry in CIF tags. As these image tags could be placed anywhere in a datablock, it is conceivable that finding them would require parsing several gigabytes of image data first. Of course, in practice, imgCIF files are written with the tags in a header and a single image at the end, but programs cannot strictly rely on this. By placing a single 'marker' at the end of the important header material, the datablock is divided into 'beginning' and 'end', and the imgCIF dictionary can then specify which tags must be present before the image is encountered, or alternatively simply specify that the image must be at the 'end'. This provides a guarantee that the important information will be found in the early part of the file. A further application for all CIFs would be to require the dictionary conformance information to be at the 'beginning'. This may streamline applications which are dictionary-driven, by requiring only one pass over the datablock, as well as enabling dictionary-specific datastructures to be prepared for the data before the data are encountered during parsing. (ii) Convenience function (datablock, section markers) As the 'marker' cannot appear in a data value or name, it is possible for CIF input applications to skip to the next marker and be in a known parsing state, without going through a complete parse. Datablocks and sections within datablocks could be skipped rapidly. This is also useful for recovery from parsing errors, although I'm not sure I'd trust a file to get markers right if it hasn't got the rest of the syntax correct. Note that without datablock markers, it would not be possible to rely on section markers in this way, as, instead of skipping to the next 'section', the application may simply skip to a marker in another, unknown datablock conforming to a different dictionary. With datablock markers, the application can keep track of when a datablock boundary is traversed. NOTES: Effect on infoset: All tags in a datablock still form a single set, however the objects in the set are now composed of three parts: name, value and section. In particular, data names cannot be repeated in separate sections. Choice of marker: a number of Unicode code points represent identical characters (e.g. Greek letters are repeated as mathematical symbols) so using one of these would not affect our ability to include arbitrary text. Or, character U+2468 is an m in brackets (m) which looks rather promising, as we can always represent this is '(','m',')' in ordinary text. There are also lots of funky geometric shapes in the 25A0-25FF range (e.g. a solid circle inside another circle). Or, for maximum likelihood of proper representation in editors and browsers, something that is found in Latin-1 might be preferable. Escaping: I propose that CIFs within a CIF cannot contain marker characters. CIF in CIF is esoteric enough that such applications should be responsible for inserting markers before datablocks and within sections, if necessary. If this proves to be a barrier, we can define an aliasing mechanism via the DDL dictionary that is specific to CIF in CIF and does not form part of the CIF syntax. Section marker at end of datablock: creates a null 'end' section Availability of markers: markers must be used everywhere in a file, or not at all. If an application finds a datablock marker before the first datablock, it will find both datablock and section markers throughout the file. Optional for writers: we would need to decide on what to do if no marker is present in the datablock and the domain dictionary specifies sections: (1) state that beginning, middle and end are all the same section and dictionary requirements are automatically satisfied. This approach gives readers a guarantee that they can get the information from the pre-image part of an arbitrary imgCIF if and only if a marker is present before the first datablock. (2) state that only 'middle' is present, therefore any beginning/end sectioned datanames will be invalid in terms of the dictionary. This gives readers a strict guarantee that, providing the imgCIF dictionary imposes the appropriate conditions, the datablock will be sectioned. Frankly, this is a worthless guarantee as the dictionary conformance of a given imgCIF can only be determined during parsing. (*) Even more strictly speaking, the datablock is a set of loops, where each loop is a set of packets, and each packet is a set of (dataname, datavalue) pairs. The set of unlooped datanames in this model is a packet belonging to a one-packet loop.

begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:[email protected]
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] New syntax: 'marker' characters (Herbert J. Bernstein)

References:

[ddlm-group] New syntax: 'marker' characters (James Hester)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] New syntax: 'marker' characters

Prev by thread: [ddlm-group] New syntax: 'marker' characters

Next by thread: Re: [ddlm-group] New syntax: 'marker' characters

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] New syntax: 'marker' characters