Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] New syntax: 'marker' characters

IMHO, additional dictionary meta-data makes much more sense that 
complicating the CIF format. Even if you have markers within a file, you 
would still want to specify the proper order defined in a dictionary. If 
you want order to be significant in the absence of a dictionary, you can 
always use the input order as the preferred order for writing data back out.

Of course, that does not cover the idea of faster parsing by being able 
to skip over blocks of data. However, parsing really should not take 
that long. I wrote a simple Tcl script that can delimit data blocks in 
the entire 91M PDB components.cif in a few seconds using regexp. If your 
files are really huge (i.e. imgCIF) and speed is important, then it 
makes more sense to create a separate data-block index file. With 
markers, you still have to actually read through data from the disk to 
search for markers. An index file allows you to seek directly to the 
desired data.

Joe Krahn

Herbert J. Bernstein wrote:
> The idea of markers creates interesting possibilities and problems with 
> respect to ordering.  Isn't the real issue one of relative ordering of 
> presentation of categories, rather than beginning, middle and end?
> We could just as easily have a need to organize the middle in more detail.
> The same issues may also arise within a category.
> 
> How about adding an arbitrary string as an attribute for any item or 
> category giving its suggested sort order, where the sorting would be
> done lexicographically among the strings, with no specified ordering
> among items with the same value.  Note that a blank string comes before
> all other strings.
> 
> If nothing were specified the intention would be to assume the ordering 
> string ".", so that strings beginnng with blank would sort ahead of
> all items with no specified ordering and strings beginning with any
> letter of the alphabet would sort after the unspecified orderings.
> This would give the effect of beginning, middle and end, but allow
> arbitrary insertions into the order.
> 
> Thus  " beginning" would come before the unspecified orderings and
> "zzz_end" would come after all of the unspecified orderings.
> 
> There may be other presentation issue, so I would suggest starting
> a PRESENTATION category with the tag
> 
>    _presentation.suggested_ordering
> 
> the value of which would be a Text (if we want to allow the maximal
> flexibility) or Code (for simplicity).
> 
> Note that this would be a suggested ordering, not mandatory.
> 
> Regards,
>    Herbert
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> 
> On Thu, 29 Oct 2009, David Brown wrote:
> 
>> I assume that these markers would be used only on a few classes of files such 
>> as imgCIF and that the majority of CIFs would continue as before with no 
>> section markers present.  For this reason someone may write software that 
>> knows nothing about markers.  Would the presence of a marker cause problems 
>> for such a program since it would appear as an orphan string?  Preceding each 
>> marker with # would put them in a comment thus ensuring that they were 
>> ignored by a parser that was not looking for them, but would still be found 
>> by a parser that was looking for them since the marker is a single character 
>> that may appear at any suitable point in the file regardless of context. 
>> There would be the danger that the markers could be lost, but this would not 
>> invalidate the CIF, just slow down its processing.
>>
>> David
>>
>>
>>
>>
>> James Hester wrote:
>>
>>> Dear DDLm-ers:
>>>
>>> At the risk of completely overwhelming the group, a relatively radical
>>> proposal follows, perhaps inspired by the fact that we now have
>>> 107,000 'text' characters to play with...
>>>
>>> James.
>>> ---------------------------------------------------------------------------
>>> PROPOSAL: to define 'markers' in CIF2.0 syntax
>>>
>>> OUTLINE
>>>
>>> I propose that we nominate two different Unicode characters or
>>> sequences as 'markers'.  The first marker, a 'datablock' marker, may
>>> only appear prior to a datablock data_ tag.  The second, a 'section'
>>> marker, may appear anywhere that an unlooped dataname could appear,
>>> and may never appear in a data value, data name, or comment. It is
>>> envisioned that such markers would perform two functions:
>>>
>>> (i) to allow division of a datablock into sections (semantic function).
>>> (ii) to allow rapid traverse of a CIF data file (convenience function).
>>>
>>> In tandem, a new optional DDLm attribute would be created describing
>>> where in a datablock a given tag could appear: 'beginning', 'middle',
>>> 'end' (or unrestricted).  The 'beginning', 'middle' and 'end' sections
>>> would be separated by section markers.  If no markers are present,
>>> these three sections coincide and the current behaviour is retained.
>>> If one marker is present, 'middle' coincides with both 'beginning' and
>>> 'end'.
>>>
>>> Although I would be against it, a more fine-grained sectioning could
>>> also be implemented.
>>>
>>> (i) Semantic function (applies to section marker only)
>>>
>>> The current abstract datastructure that describes a CIF datablock (the
>>> "infoset") is an unordered set containing key-value pairs and loops (*).
>>> While this rigorous lack of canonical order has served us well, there
>>> are certain practical problems presented by large CIF files,
>>> particularly imgCIF.  imgCIF files contain information about the image
>>> geometry in CIF tags. As these image tags could be placed anywhere in
>>> a datablock, it is conceivable that finding them would require parsing
>>> several gigabytes of image data first.  Of course, in practice, imgCIF
>>> files are written with the tags in a header and a single image at the
>>> end, but programs cannot strictly rely on this.
>>>
>>> By placing a single 'marker' at the end of the important header
>>> material, the datablock is divided into 'beginning' and 'end', and the
>>> imgCIF dictionary can then specify which tags must be present before
>>> the image is encountered, or alternatively simply specify that the
>>> image must be at the 'end'.  This provides a guarantee that the
>>> important information will be found in the early part of the file.
>>>
>>> A further application for all CIFs would be to require the dictionary
>>> conformance information to be at the 'beginning'.  This may streamline
>>> applications which are dictionary-driven, by requiring only one pass
>>> over the datablock, as well as enabling dictionary-specific
>>> datastructures to be prepared for the data before the data are
>>> encountered during parsing.
>>>
>>> (ii) Convenience function (datablock, section markers)
>>>
>>> As the 'marker' cannot appear in a data value or name, it is possible
>>> for CIF input applications to skip to the next marker and be in a
>>> known parsing state, without going through a complete parse.  Datablocks
>>> and sections within datablocks could be skipped rapidly. This is
>>> also useful for recovery from parsing errors, although I'm not sure
>>> I'd trust a file to get markers right if it hasn't got the rest of the
>>> syntax correct.
>>>
>>> Note that without datablock markers, it would not be possible to rely
>>> on section markers in this way, as, instead of skipping to the next
>>> 'section', the application may simply skip to a marker in another,
>>> unknown datablock conforming to a different dictionary.  With
>>> datablock markers, the application can keep track of when a datablock
>>> boundary is traversed.
>>>
>>> NOTES:
>>>
>>> Effect on infoset: All tags in a datablock still form a single set,
>>> however the objects in the set are now composed of three parts: name,
>>> value and section.  In particular, data names cannot be repeated in
>>> separate sections.
>>>
>>> Choice of marker: a number of Unicode code points represent identical
>>> characters (e.g. Greek letters are repeated as mathematical symbols)
>>> so using one of these would not affect our ability to include
>>> arbitrary text.  Or, character U+2468 is an m in brackets (m) which
>>> looks rather promising, as we can always represent this is '(','m',')'
>>> in ordinary text.  There are also lots of funky geometric shapes in
>>> the 25A0-25FF range (e.g. a solid circle inside another circle).  Or,
>>> for maximum likelihood of proper representation in editors and
>>> browsers, something that is found in Latin-1 might be preferable.
>>>
>>> Escaping: I propose that CIFs within a CIF cannot contain marker
>>> characters.  CIF in CIF is esoteric enough that such applications
>>> should be responsible for inserting markers before datablocks and
>>> within sections, if necessary.  If this proves to be a barrier, we can
>>> define an aliasing mechanism via the DDL dictionary that is specific
>>> to CIF in CIF and does not form part of the CIF syntax.
>>>
>>> Section marker at end of datablock: creates a null 'end' section
>>>
>>> Availability of markers: markers must be used everywhere in a file, or
>>> not at all.  If an application finds a datablock marker before the
>>> first datablock, it will find both datablock and section
>>> markers throughout the file.
>>>
>>> Optional for writers: we would need to decide on what to do if no
>>> marker is present in the datablock and the domain dictionary specifies
>>> sections:
>>>
>>> (1) state that beginning, middle and end are all the same section and
>>> dictionary requirements are automatically satisfied.  This approach
>>> gives readers a guarantee that they can get the information from the
>>> pre-image part of an arbitrary imgCIF if and only if a marker is
>>> present before the first datablock.
>>>
>>> (2) state that only 'middle' is present, therefore any beginning/end
>>> sectioned datanames will be invalid in terms of the dictionary.  This
>>> gives readers a strict guarantee that, providing the imgCIF dictionary
>>> imposes the appropriate conditions, the datablock will be sectioned.
>>> Frankly, this is a worthless guarantee as the dictionary conformance
>>> of a given imgCIF can only be determined during parsing.
>>>
>>> (*) Even more strictly speaking, the datablock is a set of loops,
>>> where each loop is a set of packets, and each packet is a set of
>>> (dataname, datavalue) pairs.  The set of unlooped datanames in this
>>> model is a packet belonging to a one-packet loop.
>>>
>>>
>>>
>>
>>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.