Re: [ddlm-group] Data-name character restrictions - one last time

I'll just chime in and say that this sounds like a brilliant way to move forward.  To describe it from another angle:  Nick is proposing that we introduce new semantics in our CIF2 syntax, such that


refers to the i'th element of _somecharacters. The dictionary does not define any dataname of the form 'somecharacters[i]', or indeed need to define this in an alias.  The dictionary simply defines '_somecharacters'.

I believe a correct description of the syntax would be

<dataname> = <datanamehead><slice>*
<datanamehead>='_' [A-z0-9._]+
<oneslice>='[' [0-9]+ ']'

On Fri, Dec 11, 2009 at 2:19 PM, Nick Spadaccini wrote:
I can agree with that, if you are saying only the matrix object is available
to the user.

OR alternatively are you saying there will ONLY be one object defined in the
dictionary, let's say the 3x3 matrix


But NEVER have definitions in the dictionary for the individual
_atom_site.U[i][j] elements.

As we parse a CIF data file, if we detect _atom_site.U[i][j], it isn't in
the defined dictionary so this would normally raise an error. BUT because of
the specific trailing syntax [i][j] this informs the parser there must be an
object of matching rank with the name _atom_site.U (ie the
_atom_site.U[i][j] with the [i][j] truncated) in the dictionary - and
therefore populate the appropriate element of _atom_site.U with that value.

This would circumvent the problem of two different identifiers called
_atom_site.U[i][j] in the dictionary BUT would necessarily mean that [i][j]
syntax in a data name was reserved for objects that are defined in the
dictionary as, in this case, a 2D matrix. They can't (shouldn't?) be used
for general data names.

Does this cover what John wanted also?

On 11/12/09 10:12 AM, "Herbert J. Bernstein" wrote:

> Actually, the suggestion comes from reading the dREL documentation and the
> DDLm documentation and noticing how clumsy the access to array elements in
> DDLm is compared to the access in dREL.  What I am suggesting is to
> promote the dREL access making it fully available at the DDLm level,
> replacing the clumsy element-by-element definitions with one automatic
> definition that looks and works just the way one might expect.
On Fri, 11 Dec 2009, Nick Spadaccini wrote:
>> Many of you need to read the dREL part of the dictionary much more closely.
>> dREL extensively exploits access to  matrix and vector types by index
>> addressing at a programmatic level. That's how it gets done the things it is
>> has to. So within the dREL programming language you will see littered
>> everywhere a matrix which is accessed via standard indexing (as you would
>> with any language supporting array structures).
>> So lets have a matrix _atom_site.U - within dREL I have access to
>> _atom_site.U[0][0] etc as part of the language (I'll stick with 0 initial
>> indexing but this really is a trivial problem, solved many times over).
>> But now you ALSO want a scalar data item called _atom_site.U[0][0] with in
>> CIF. The dictionary says _atom_site.U[0][0] is a single scalar value.
>> The dREL constructor method for _atom_site.U has
>> _atom_site.U = Matrix([[atom_site.U[0][0] ...]...])
>> This obviously won't work. This is why the dictionary in DDLm uses the
>> equivalent of _atom_site.U_0_0 for the scalar value so that the above
>> constructor will make sense and still allows me to access _atom_site.U[0][0]
>> from within dREL. It is why I am keen to restrict the syntax of the data
>> names.
On 11/12/09 2:46 AM, "Herbert J. Bernstein" wrote:
>> wrote:
>>> Dear Colleagues,
>>>    One very neat resolution to this problem would be to allow a
>>> list or array-typed CIF2 tag to be referenced in a data file either
>>> as a whole or element by element.
>>>    Thus
>>>    _a.vec
>>> being defined as an array or list in CIF2 would automatically make
>>> the tags
>>>    _a.vec[1]
>>>    _a.vec[2]
>>> ...
>>> defined CIF2 tags.  If the array or list were nested, the
>>>    _a.vec[1][1]
>>>    _a.vec[1][2]
>>> etc. would be valid tags
>>>    I would propose that this be general and automatic, applying to
>>> all tags defined as list or arrays.  In view of past practice in
>>> CIF1, there is a slight conflict with respect to the default starting
>>> index in dREL versus the common CIF1 practice in indexing arrays
>>> from 0, but that can (and should be solved) with explicit specification
>>> of a starting index, so we can carry over the tag name usage from
>>> CIF1 without confusing people with an index shift.  So, if _a.vec
>>> were an array of dimension 5, starting from index 0, _a.vec[0]
>>> through _a.vec[4] would be valid, but if the starting index were
>>> specified as 1, _a.vec[1] through _a.vec[5] would be valid, matching
>>> CIF1 conventions.
>>>    The aliasing mechanism might have to be extended or clarified to
>>> handle the mapping against CIF1 tags in bulk for _a.vec as a whole,
>>> but, to me, this has a very intuitive feel.
At 3:29 PM -0500 12/9/09, John Westbrook wrote:
>>>> Hi all -
>>>> On the issue of reserved characters in mmCIF/PDBx data items, these
>>>> generally have been inherited from the style of items from the core.  The
>>>> majority of items in this class are data items related to short
>>>> matrices/tensors
>>>> and vectors (e.g. items including []).    Virtually all have a syntax which
>>>> could reasonably be interpreted as a programmatic reference.  For instance,
>>>> _atom_sites.fract_transf_matrix[1][1]   0.007738
>>>> _atom_sites.fract_transf_matrix[1][2]   0.000000
>>>> _atom_sites.fract_transf_matrix[1][3]   0.004298
>>>> _atom_sites.fract_transf_matrix[2][1]   0.000000
>>>> _atom_sites.fract_transf_matrix[2][2]   0.016545
>>>> _atom_sites.fract_transf_matrix[2][3]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][1]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][2]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][3]   0.020200
>>>> _atom_sites.fract_transf_vector[1]      0.00000
>>>> _atom_sites.fract_transf_vector[2]      0.00000
>>>> _atom_sites.fract_transf_vector[3]      0.00000
>>>> Are we close to being able to treat these as legal in the context of
>>>> CIF2/DDL+?
>>>> I suppose I am asking what will constitute a legal assignment for an
>>>> element
>>>> of a matrix/array -
>>>> Only this -
>>>> _a.vec [1,2,3]
>>>> or also expanded assignment by element such as -
>>>> _a.vec[1]  1
>>>> _a.vec[2]  2
>>>> _a.vec[3]  3
>>>> If the latter is to be considered, then this will solve most of the data
>>>> name
>>>> issues for our data.
Joe Krahn wrote:
>>>>>  In practice, CIF2 parsers should allow CIF1 data names within a CIF2
>>>>>  formatted file. The question is whether these files should be allowed as
>>>>>  valid CIF2, or just for convenience as a non-standard CIF2.
>>>>>  When CIF files are used as working data files, the restrictions should
>>>>>  be relaxed. For long-term archival files, it makes sense to be more
>>>>>  restrictive. I would just make the CIF1 names inaccessible to dREL.
>>>>>  Alternatively, an implementation could allow CIF1 names only on reading,
>>>>>  and require dictionary alias mappings to CIF2 names.
>>>>>  One argument in favor of allowing them would be that someone wants to
>>>>>  convert all data files to CIF2 format, but they want to preserve the
>>>>>  original data as-is, without alias mapping.
>>>>>  I think that the current CIF2 syntax makes it possible to use CIF1 names
>>>>>  without any ambiguities. The question is whether they should be
>>>>>  considered valid CIF2, or just a non-standard version that will be
>>>>>  useful for the transitional period.
Herbert J. Bernstein wrote:
>>>>>>  Personally, I would greatly prefer to allow all data names that do not
>>>>>>  create a major lexer/parser conflict to appear in a data CIF and
>>>>>>  only apply the strong restrictions to data names that appear in CIF2
>>>>>>  dictionaries as defined data names (not as aliases).  -- Herbert
At 2:40 PM +0000 12/9/09, Brian McMahon wrote:
>>>>>>>  I have one remaining niggle that I'd like to revisit before we put
>>>>>>>  this finally to bed. As has been mentioned a couple of times
>>>>>>>  recently, restricting the data-name character set does invalidate
>>>>>>>  syntactically many existing CIF 1 files (e.g. _refine_ls_shift/esd_max
>>>>>>> ).
>>>>>>>  We have discussed strategies for handling this, and I think these
>>>>>>>  are workable strategies, but will involve investment and hence expense
>>>>>>>  in workflow management in CIF archives.
>>>>>>>  I understand the rationale behind this restriction is to simplify
>>>>>>>  future processing of data names in areas such as dREL
>>>>>>>  applications. The question really is whether we're choosing the right
>>>>>>>  trade-off in making things cleaner at that end of the processing
>>>>>>>  chain. I would suppose that a dREL or other application could ingest a
>>>>>>>  data name with dangerous characters, convert it internally into a
>>>>>>>  "safe" identifier that's used for all processing, and then restore the
>>>>>>>  original form upon output; but writing that intermediate layer of
>>>>>>>  processing is of course expensive (especially if there aren't readily
>>>>>>>  available libraries that will do this transparently).
>>>>>>>  I suspect that some of the original proposed syntactic changes also
>>>>>>>  had the effect (whether by design or collaterally) of simplifying i/o,
>>>>>>>  data structure management, symbol table processing etc., but those may
>>>>>>>  have suffered in the subsequent revision exercise we've just been
>>>>>>>  practising. Given the consensus we are now approaching, would the code
>>>>>>>  builders now be prepared to incur the addition expense of handling
>>>>>>>  "dangerous" data names?
>>>>>>>  I really don't want to spark off a long discussion on this - if a
>>>>>>>  quick round of response shows that there's no appetite to allow
>>>>>>>  the additional punctuation characters in data names, I'll accept that
>>>>>>>  gracefully.
>>>>>>>  One last comment while I have the floor, though it is related in part
>>>>>>>  to the above question. A concern raised in the editorial office was
>>>>>>>  that there would be circumstances where users didn't know if they were
>>>>>>>  dealing with a CIF 1 or 2 ("users" meaning authors, perhaps resorting
>>>>>>>  to the vi editor - and we're imagining most of them are dealing with
>>>>>>>  small-molecule/inorganic CIFs). My supposition is that the IUCr
>>>>>>>  editorial offices would only want to use CIF2 seriously in association
>>>>>>>  with DDLm dictionaries, and that we would expect the revised core
>>>>>>>  dictionaries to use the dot component in data names to signal this
>>>>>>>  further evolution. So even a superficial glimpse of the middle of a
>>>>>>>  CIF would make it clear whether it was CIF1 or CIF2.
>>>>>>>  Does that fit in with how others see this progressing?
