[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Data-name character restrictions - one last time

The range notation "1:4" is familiar to Fortran90 programmers. Many 
Fortran77 compilers also supported range notation. But, if ':' is 
disallowed in unquoted strings, it would have to be written as:

     _type.dimension      ['1:4','1:4']

Also, why not allow negative indices? It may make sense to allow only 0 
or 1, but why make it a mandatory restriction?


Herbert J. Bernstein wrote:
> No, the implicit zero comes from the dREL documentation in both the
> 2007 and 2008 versions.  This is a very serious issue for people with
> a Fortran background, and causes many mistakes.  Simply being able
> to specify the starting index would solve the problem.
> I agree that we need to keep in touch, but I am working from the
> dREL/DDLm documentation, and hope you are, too.  What we need to do
> is to stop
> focusing in stylistic issues and work on getting the documentation to
> be clear and unambiguous with more examples, so we do not go another
> 3+ years without people being aware of such critical issues as the
> default starting
> index for arrays.
> You will find the statement about the default index for arrays in section 3.4
> of dREL_spec_aug08.pdf.  All we need to fix it is to adopt a new tag to
> identify the starting index, such as
>   _type.starting_index
> or allow the dimensions of an array to be ranges.  The only problem
> with that is that there is a strange python convention which would
> suggest that
>    _type.dimension [1:5]
> would be declaring an array of dimension 4, starting at index 1.  To avoid
> the confusion that would cause for Fortran programmers, I would suggest
> that we write dictionaries with
>    _type.starting_index [1,1]
>    _type.dimension      [3,3]
> instead of
>    _type.dimension      [1:4,1:4]
> which would be natural in a python world, but not for Fortran programmers.
> To make implementation easy, I would not allow negative starting indices.
>    -- Herbert
> At 2:14 PM -0500 12/10/09, David Brown wrote:
>> I was not aware that there was a default indexing of arrays.  The
>> only place where this arises in DDL1 is in the list of symmetry
>> opertations where we originally failed to define a key for the symop
>> loop.  But there, as far as I am aware, the assumed indexing always
>> starts at 1 for the first item.  This is strictly a fix since CIF1
>> specificly states that the order within a loop has no significance.
>> Later additions to the dictionary have corrected this oversight by
>> adding an explicit key, but it is not yet often used.  Otherwise, in
>> DDL1 (and DDL2?) the elements of an array have explicit data names
>> that start at 1, not 0.  The assumption that arrays are numbered
>>from zero must be an imgCIF convention.  It would always be better
>> to include explicit indexing to avoid these problems.
>> The DDLm dictionaries have methods for constructing arrays from
>> their elements, and methods for the reverse process could be added.
>> In this case it would not be necessary to decompose (or assemble) an
>> array on first resding as the necessary action would be taken as
>> soon as the array or its elements are invoked by a method or by a
>> list of items to output.
>> This raises another concern.  Herbert, if you are writing DDLm
>> dictionaries for imgCIF and I am writing them for coreCIF, we need
>> to keep in contact to make sure we are not introducing conficting
>> conventions.
>> David
>> Herbert J. Bernstein wrote:
>>> Dear Colleagues,
>>>    One very neat resolution to this problem would be to allow a
>>> list or array-typed CIF2 tag to be referenced in a data file either
>>> as a whole or element by element.
>>>    Thus
>>>    _a.vec
>>> being defined as an array or list in CIF2 would automatically make
>>> the tags
>>>    _a.vec[1]
>>>    _a.vec[2]
>>> ...
>>> defined CIF2 tags.  If the array or list were nested, the
>>>    _a.vec[1][1]
>>>    _a.vec[1][2]
>>> etc. would be valid tags
>>>    I would propose that this be general and automatic, applying to
>>> all tags defined as list or arrays.  In view of past practice in
>>> CIF1, there is a slight conflict with respect to the default starting
>>> index in dREL versus the common CIF1 practice in indexing arrays
>> >from 0, but that can (and should be solved) with explicit specification
>>> of a starting index, so we can carry over the tag name usage from
>>> CIF1 without confusing people with an index shift.  So, if _a.vec
>>> were an array of dimension 5, starting from index 0, _a.vec[0]
>>> through _a.vec[4] would be valid, but if the starting index were
>>> specified as 1, _a.vec[1] through _a.vec[5] would be valid, matching
>>> CIF1 conventions.
>>>    The aliasing mechanism might have to be extended or clarified to
>>> handle the mapping against CIF1 tags in bulk for _a.vec as a whole,
>>> but, to me, this has a very intuitive feel.
>>>    Regards,
>>>      Herbert
>>> At 3:29 PM -0500 12/9/09, John Westbrook wrote:
>>>> Hi all -
>>>> On the issue of reserved characters in mmCIF/PDBx data items, these
>>>> generally have been inherited from the style of items from the core.  The
>>>> majority of items in this class are data items related to short
>>>> matrices/tensors
>>>> and vectors (e.g. items including []).    Virtually all have a syntax which
>>>> could reasonably be interpreted as a programmatic reference.  For instance,
>>>> _atom_sites.fract_transf_matrix[1][1]   0.007738
>>>> _atom_sites.fract_transf_matrix[1][2]   0.000000
>>>> _atom_sites.fract_transf_matrix[1][3]   0.004298
>>>> _atom_sites.fract_transf_matrix[2][1]   0.000000
>>>> _atom_sites.fract_transf_matrix[2][2]   0.016545
>>>> _atom_sites.fract_transf_matrix[2][3]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][1]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][2]   0.000000
>>>> _atom_sites.fract_transf_matrix[3][3]   0.020200
>>>> _atom_sites.fract_transf_vector[1]      0.00000
>>>> _atom_sites.fract_transf_vector[2]      0.00000
>>>> _atom_sites.fract_transf_vector[3]      0.00000
>>>> Are we close to being able to treat these as legal in the context of
>>>> CIF2/DDL+?
>>>> I suppose I am asking what will constitute a legal assignment for an element
>>>> of a matrix/array -
>>>> Only this -
>>>> _a.vec [1,2,3]
>>>> or also expanded assignment by element such as -
>>>> _a.vec[1]  1
>>>> _a.vec[2]  2
>>>> _a.vec[3]  3
>>>> If the latter is to be considered, then this will solve most of
>>>> the data name
>>>> issues for our data.
>>>> Regards,
>>>> John
>>>> Joe Krahn wrote:
>>>>>  In practice, CIF2 parsers should allow CIF1 data names within a CIF2
>>>>>  formatted file. The question is whether these files should be allowed as
>>>>>  valid CIF2, or just for convenience as a non-standard CIF2.
>>>>>  When CIF files are used as working data files, the restrictions should
>>>>>  be relaxed. For long-term archival files, it makes sense to be more
>>>>>  restrictive. I would just make the CIF1 names inaccessible to dREL.
>>>>>  Alternatively, an implementation could allow CIF1 names only on reading,
>>>>>  and require dictionary alias mappings to CIF2 names.
>>>>>  One argument in favor of allowing them would be that someone wants to
>>>>>  convert all data files to CIF2 format, but they want to preserve the
>>>>>  original data as-is, without alias mapping.
>>>>>  I think that the current CIF2 syntax makes it possible to use CIF1 names
>>>>>  without any ambiguities. The question is whether they should be
>>>>>  considered valid CIF2, or just a non-standard version that will be
>>>>>  useful for the transitional period.
>>>>  >
>>>>>  Joe
>>>>>  Herbert J. Bernstein wrote:
>>>>>>  Personally, I would greatly prefer to allow all data names that do not
>>>>>>  create a major lexer/parser conflict to appear in a data CIF and
>>>>>>  only apply the strong restrictions to data names that appear in CIF2
>>>>>>  dictionaries as defined data names (not as aliases).  -- Herbert
>>>>>>  At 2:40 PM +0000 12/9/09, Brian McMahon wrote:
>>>>>>>  I have one remaining niggle that I'd like to revisit before we put
>>>>>>>  this finally to bed. As has been mentioned a couple of times
>>>>>>>  recently, restricting the data-name character set does invalidate
>>>>>>>  syntactically many existing CIF 1 files (e.g.
>>>>>>> _refine_ls_shift/esd_max ).
>>>>>>>  We have discussed strategies for handling this, and I think these
>>>>>>>  are workable strategies, but will involve investment and hence expense
>>>>>>>  in workflow management in CIF archives.
>>>>>>>  I understand the rationale behind this restriction is to simplify
>>>>>>>  future processing of data names in areas such as dREL
>>>>>>>  applications. The question really is whether we're choosing the right
>>>>>>>  trade-off in making things cleaner at that end of the processing
>>>>>>>  chain. I would suppose that a dREL or other application could ingest a
>>>>>>>  data name with dangerous characters, convert it internally into a
>>>>>>>  "safe" identifier that's used for all processing, and then restore the
>>>>>>>  original form upon output; but writing that intermediate layer of
>>>>>>>  processing is of course expensive (especially if there aren't readily
>>>>>>>  available libraries that will do this transparently).
>>>>>>>  I suspect that some of the original proposed syntactic changes also
>>>>>>>  had the effect (whether by design or collaterally) of simplifying i/o,
>>>>>>>  data structure management, symbol table processing etc., but those may
>>>>>>>  have suffered in the subsequent revision exercise we've just been
>>>>>>>  practising. Given the consensus we are now approaching, would the code
>>>>>>>  builders now be prepared to incur the addition expense of handling
>>>>>>>  "dangerous" data names?
>>>>>>>  I really don't want to spark off a long discussion on this - if a
>>>>>>>  quick round of response shows that there's no appetite to allow
>>>>>>>  the additional punctuation characters in data names, I'll accept that
>>>>>>>  gracefully.
>>>>>>>  ***
>>>>>>>  One last comment while I have the floor, though it is related in part
>>>>>>>  to the above question. A concern raised in the editorial office was
>>>>>>>  that there would be circumstances where users didn't know if they were
>>>>>>>  dealing with a CIF 1 or 2 ("users" meaning authors, perhaps resorting
>>>>>>>  to the vi editor - and we're imagining most of them are dealing with
>>>>>>>  small-molecule/inorganic CIFs). My supposition is that the IUCr
>>>>>>>  editorial offices would only want to use CIF2 seriously in association
>>>>>>>  with DDLm dictionaries, and that we would expect the revised core
>>>>>>>  dictionaries to use the dot component in data names to signal this
>>>>>>>  further evolution. So even a superficial glimpse of the middle of a
>>>>>>>  CIF would make it clear whether it was CIF1 or CIF2.
>>>>>>>  Does that fit in with how others see this progressing?
>>>>>>>  Cheers
>>>>>>>  Brian
ddlm-group mailing list

Reply to: [list | sender only]