Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] New syntax: 'marker' characters

Title:
I assume that these markers would be used only on a few classes of files such as imgCIF and that the majority of CIFs would continue as before with no section markers present.  For this reason someone may write software that knows nothing about markers.  Would the presence of a marker cause problems for such a program since it would appear as an orphan string?  Preceding each marker with # would put them in a comment thus ensuring that they were ignored by a parser that was not looking for them, but would still be found by a parser that was looking for them since the marker is a single character that may appear at any suitable point in the file regardless of context.  There would be the danger that the markers could be lost, but this would not invalidate the CIF, just slow down its processing.

David




James Hester wrote:
Dear DDLm-ers:

At the risk of completely overwhelming the group, a relatively radical
proposal follows, perhaps inspired by the fact that we now have
107,000 'text' characters to play with...

James.
---------------------------------------------------------------------------
PROPOSAL: to define 'markers' in CIF2.0 syntax

OUTLINE

I propose that we nominate two different Unicode characters or
sequences as 'markers'.  The first marker, a 'datablock' marker, may
only appear prior to a datablock data_ tag.  The second, a 'section'
marker, may appear anywhere that an unlooped dataname could appear,
and may never appear in a data value, data name, or comment. It is
envisioned that such markers would perform two functions:

(i) to allow division of a datablock into sections (semantic function).
(ii) to allow rapid traverse of a CIF data file (convenience function).

In tandem, a new optional DDLm attribute would be created describing
where in a datablock a given tag could appear: 'beginning', 'middle',
'end' (or unrestricted).  The 'beginning', 'middle' and 'end' sections
would be separated by section markers.  If no markers are present,
these three sections coincide and the current behaviour is retained.
If one marker is present, 'middle' coincides with both 'beginning' and
'end'.

Although I would be against it, a more fine-grained sectioning could
also be implemented.

(i) Semantic function (applies to section marker only)

The current abstract datastructure that describes a CIF datablock (the
"infoset") is an unordered set containing key-value pairs and loops (*).
While this rigorous lack of canonical order has served us well, there
are certain practical problems presented by large CIF files,
particularly imgCIF.  imgCIF files contain information about the image
geometry in CIF tags. As these image tags could be placed anywhere in
a datablock, it is conceivable that finding them would require parsing
several gigabytes of image data first.  Of course, in practice, imgCIF
files are written with the tags in a header and a single image at the
end, but programs cannot strictly rely on this.

By placing a single 'marker' at the end of the important header
material, the datablock is divided into 'beginning' and 'end', and the
imgCIF dictionary can then specify which tags must be present before
the image is encountered, or alternatively simply specify that the
image must be at the 'end'.  This provides a guarantee that the
important information will be found in the early part of the file.

A further application for all CIFs would be to require the dictionary
conformance information to be at the 'beginning'.  This may streamline
applications which are dictionary-driven, by requiring only one pass
over the datablock, as well as enabling dictionary-specific
datastructures to be prepared for the data before the data are
encountered during parsing.

(ii) Convenience function (datablock, section markers)

As the 'marker' cannot appear in a data value or name, it is possible
for CIF input applications to skip to the next marker and be in a
known parsing state, without going through a complete parse.  Datablocks
and sections within datablocks could be skipped rapidly. This is
also useful for recovery from parsing errors, although I'm not sure
I'd trust a file to get markers right if it hasn't got the rest of the
syntax correct.

Note that without datablock markers, it would not be possible to rely
on section markers in this way, as, instead of skipping to the next
'section', the application may simply skip to a marker in another,
unknown datablock conforming to a different dictionary.  With
datablock markers, the application can keep track of when a datablock
boundary is traversed.

NOTES:

Effect on infoset: All tags in a datablock still form a single set,
however the objects in the set are now composed of three parts: name,
value and section.  In particular, data names cannot be repeated in
separate sections.

Choice of marker: a number of Unicode code points represent identical
characters (e.g. Greek letters are repeated as mathematical symbols)
so using one of these would not affect our ability to include
arbitrary text.  Or, character U+2468 is an m in brackets (m) which
looks rather promising, as we can always represent this is '(','m',')'
in ordinary text.  There are also lots of funky geometric shapes in
the 25A0-25FF range (e.g. a solid circle inside another circle).  Or,
for maximum likelihood of proper representation in editors and
browsers, something that is found in Latin-1 might be preferable.

Escaping: I propose that CIFs within a CIF cannot contain marker
characters.  CIF in CIF is esoteric enough that such applications
should be responsible for inserting markers before datablocks and
within sections, if necessary.  If this proves to be a barrier, we can
define an aliasing mechanism via the DDL dictionary that is specific
to CIF in CIF and does not form part of the CIF syntax.

Section marker at end of datablock: creates a null 'end' section

Availability of markers: markers must be used everywhere in a file, or
not at all.  If an application finds a datablock marker before the
first datablock, it will find both datablock and section
markers throughout the file.

Optional for writers: we would need to decide on what to do if no
marker is present in the datablock and the domain dictionary specifies
sections:

(1) state that beginning, middle and end are all the same section and
dictionary requirements are automatically satisfied.  This approach
gives readers a guarantee that they can get the information from the
pre-image part of an arbitrary imgCIF if and only if a marker is
present before the first datablock.

(2) state that only 'middle' is present, therefore any beginning/end
sectioned datanames will be invalid in terms of the dictionary.  This
gives readers a strict guarantee that, providing the imgCIF dictionary
imposes the appropriate conditions, the datablock will be sectioned.
Frankly, this is a worthless guarantee as the dictionary conformance
of a given imgCIF can only be determined during parsing.

(*) Even more strictly speaking, the datablock is a set of loops,
where each loop is a set of packets, and each packet is a set of
(dataname, datavalue) pairs.  The set of unlooped datanames in this
model is a packet belonging to a one-packet loop.




begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.