[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] Ordering in CIFs
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: [ddlm-group] Ordering in CIFs
- From: James Hester <jamesrhester@gmail.com>
- Date: Mon, 9 Nov 2009 12:08:20 +1100
I've put this in a separate thread, which I think the discussion of ordering deserves. Ordering in CIF files =============== Some overlong ruminations follow regarding ordering in CIF files. Executive summary: if we want to create DDL attributes relating to ordering, then ordering in the whole file only becomes semantically meaningless at the stage that a dictionary is applied. A number of proposals are starting to appear which in one way or another deal with ordering in CIF files. For clarity in discussing these proposals, we should distinguish between semantically meaningless ordering, which would include anything done purely for presentational purposes, and semantically meaningful ordering. An example of the latter would be where the ordering relates to a sequence of commands to be executed, or a sequence of bases in a protein chain. I wish to discuss here semantically meaningful order, that is, a situation where a change in the order will potentially change the meaning. It follows that if a DDL attribute controls this order, a conforming data file must follow this DDL requirement. Contrast this with meaningless order, where DDL attributes need not be conformed to, such as Herb's PRESENTATION category. Nick asserts that order is not a lexical issue, which I take to mean that we should confine our ordering shenanigans to the DDL. However, as soon as ordering becomes significant at the DDL level, it impacts back on the parser - the parser cannot now blindly discard or change the ordering it finds in a CIF file, and this has subtle implications. CIF file input up until now could be separated into two logical stages: the result of parsing is the "infoset", which in the CIF case is roughly a set of sets of sets of key-value pairs, where both key and value are the strings as parsed from the file. The infoset represents precisely the information that can be derived from the syntax alone. At the next stage, the dictionary DDL information is applied to the "infoset" to produce dictionary-specific datastructures. These two parsing stages correspond to two levels of semantics; implicit semantics arising purely from syntax, and explicit semantics arising from DDL attributes. Note that, even if your application applies dictionary information as soon as the dataname is read in, the process can still be modelled in this way as the behaviour of the parser is not affected by the dictionary information. In a formal sense this model now has to change: if there are ordering requirements for loop packets, the parser must preserve the ordering information when constructing that part of the infoset. Therefore parts of the infoset change based on what is in the DDL dictionary, and so it is no longer exclusively representative of the syntax-derived semantics. If we want to preserve the division between syntax-derived and DDL-derived semantics , we can either (i) redefine the "infoset" as now being a "list of lists of lists of key-value pairs", and then at the DDL level we say that "order is not important, except for the following cases" or (ii) define new syntax, such as 'loop_ordered' instead of 'loop_' To tease out how important or otherwise this really is, we need to consider the implications for a CIF reader. I'm only going to discuss generic CIF readers here, such as ciftbx or PyCIFRW; readers that are tailored for a particular dictionary at software construction time are not as relevant, although keeping clean divisions between the two semantic levels is probably helpful for them as well. A generic CIF reader does not a priori know which DDL dictionaries a given CIF data file conforms to, and therefore must potentially parse the entire file in order to find the _audit data items. Under CIF1, a CIF reader can parse a CIF into a datastructure isomorphous to the infoset, confident that all syntax-derived semantic information of relevance to any DDL dictionary has been preserved. I say "can parse", not "will", because, while a set structure is a more fundamental mathematical construct than a list, it is the other way around in computing: a set is a list with duplicate removal and order-based operations suppressed. So it is pretty likely that all CIF readers retain ordering information at the end of a parse, even though not strictly necessary, simply because extra effort is required to remove the ordering (in most programming languages - although Python recently acquired a 'set' type). That is, choosing option (i) might merely bring us into line with what a programmer is likely to do in any case. Nevertheless, I think we should all be aware that as soon as we start using DDL attributes to control order in a CIF datablock, we are requiring generic readers to preserve the relevant order during the parse. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Prev by Date: Re: [ddlm-group] UTF-8 versus extended ASCII
- Next by Date: [ddlm-group] THREAD 0 - back in to the breech
- Prev by thread: [ddlm-group] THREAD 0 - back in to the breech
- Next by thread: [ddlm-group] UTF-8 versus extended ASCII
- Index(es):