[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Data-name character restrictions - one last time

The dREL/DDLm documents explicitly use ":" for ranges.

The main reason to forbid negative starting indices is for simplicity
in a python implementation.

At 5:41 PM -0500 12/10/09, Joe Krahn wrote:
>The range notation "1:4" is familiar to Fortran90 programmers. Many
>Fortran77 compilers also supported range notation. But, if ':' is
>disallowed in unquoted strings, it would have to be written as:
>      _type.dimension      ['1:4','1:4']
>Also, why not allow negative indices? It may make sense to allow only 0
>or 1, but why make it a mandatory restriction?
>Herbert J. Bernstein wrote:
>>  No, the implicit zero comes from the dREL documentation in both the
>>  2007 and 2008 versions.  This is a very serious issue for people with
>>  a Fortran background, and causes many mistakes.  Simply being able
>>  to specify the starting index would solve the problem.
>>  I agree that we need to keep in touch, but I am working from the
>>  dREL/DDLm documentation, and hope you are, too.  What we need to do
>>  is to stop
>>  focusing in stylistic issues and work on getting the documentation to
>>  be clear and unambiguous with more examples, so we do not go another
>>  3+ years without people being aware of such critical issues as the
>>  default starting
>>  index for arrays.
>>  You will find the statement about the default index for arrays in 
>>section 3.4
>>  of dREL_spec_aug08.pdf.  All we need to fix it is to adopt a new tag to
>>  identify the starting index, such as
>>    _type.starting_index
>>  or allow the dimensions of an array to be ranges.  The only problem
>>  with that is that there is a strange python convention which would
>>  suggest that
>>     _type.dimension [1:5]
>>  would be declaring an array of dimension 4, starting at index 1.  To avoid
>>  the confusion that would cause for Fortran programmers, I would suggest
>>  that we write dictionaries with
>>     _type.starting_index [1,1]
>>     _type.dimension      [3,3]
>>  instead of
>>     _type.dimension      [1:4,1:4]
>>  which would be natural in a python world, but not for Fortran programmers.
>>  To make implementation easy, I would not allow negative starting indices.
>>     -- Herbert
>>  At 2:14 PM -0500 12/10/09, David Brown wrote:
>>>  I was not aware that there was a default indexing of arrays.  The
>>>  only place where this arises in DDL1 is in the list of symmetry
>>>  opertations where we originally failed to define a key for the symop
>>>  loop.  But there, as far as I am aware, the assumed indexing always
>>>  starts at 1 for the first item.  This is strictly a fix since CIF1
>>>  specificly states that the order within a loop has no significance.
>>>  Later additions to the dictionary have corrected this oversight by
>>>  adding an explicit key, but it is not yet often used.  Otherwise, in
>>>  DDL1 (and DDL2?) the elements of an array have explicit data names
>  >> that start at 1, not 0.  The assumption that arrays are numbered
>>>from zero must be an imgCIF convention.  It would always be better
>>>  to include explicit indexing to avoid these problems.
>>>  The DDLm dictionaries have methods for constructing arrays from
>>>  their elements, and methods for the reverse process could be added.
>>>  In this case it would not be necessary to decompose (or assemble) an
>>>  array on first resding as the necessary action would be taken as
>>>  soon as the array or its elements are invoked by a method or by a
>>>  list of items to output.
>>>  This raises another concern.  Herbert, if you are writing DDLm
>>>  dictionaries for imgCIF and I am writing them for coreCIF, we need
>>>  to keep in contact to make sure we are not introducing conficting
>>>  conventions.
>>>  David
>>>  Herbert J. Bernstein wrote:
>>>>  Dear Colleagues,
>>>>     One very neat resolution to this problem would be to allow a
>  >>> list or array-typed CIF2 tag to be referenced in a data file either
>>>>  as a whole or element by element.
>>>>     Thus
>>>>     _a.vec
>>>>  being defined as an array or list in CIF2 would automatically make
>>>>  the tags
>>>>     _a.vec[1]
>>>>     _a.vec[2]
>>>>  ...
>>>>  defined CIF2 tags.  If the array or list were nested, the
>>>>     _a.vec[1][1]
>>>>     _a.vec[1][2]
>>>>  etc. would be valid tags
>>>>     I would propose that this be general and automatic, applying to
>>>>  all tags defined as list or arrays.  In view of past practice in
>>>>  CIF1, there is a slight conflict with respect to the default starting
>>>>  index in dREL versus the common CIF1 practice in indexing arrays
>>>  >from 0, but that can (and should be solved) with explicit specification
>>>>  of a starting index, so we can carry over the tag name usage from
>>>>  CIF1 without confusing people with an index shift.  So, if _a.vec
>>>>  were an array of dimension 5, starting from index 0, _a.vec[0]
>>>>  through _a.vec[4] would be valid, but if the starting index were
>>>>  specified as 1, _a.vec[1] through _a.vec[5] would be valid, matching
>>>>  CIF1 conventions.
>>>>     The aliasing mechanism might have to be extended or clarified to
>>>>  handle the mapping against CIF1 tags in bulk for _a.vec as a whole,
>>>>  but, to me, this has a very intuitive feel.
>>>>     Regards,
>>>>       Herbert
>>>>  At 3:29 PM -0500 12/9/09, John Westbrook wrote:
>>>>>  Hi all -
>>>>>  On the issue of reserved characters in mmCIF/PDBx data items, these
>>>>>  generally have been inherited from the style of items from the core.  The
>>>>>  majority of items in this class are data items related to short
>>>>>  matrices/tensors
>>>>>  and vectors (e.g. items including []).    Virtually all have a 
>>>>>syntax which
>>>>>  could reasonably be interpreted as a programmatic reference. 
>>>>>For instance,
>>>>>  _atom_sites.fract_transf_matrix[1][1]   0.007738
>>>>>  _atom_sites.fract_transf_matrix[1][2]   0.000000
>>>>>  _atom_sites.fract_transf_matrix[1][3]   0.004298
>>>>>  _atom_sites.fract_transf_matrix[2][1]   0.000000
>>>>>  _atom_sites.fract_transf_matrix[2][2]   0.016545
>>>>>  _atom_sites.fract_transf_matrix[2][3]   0.000000
>>>>>  _atom_sites.fract_transf_matrix[3][1]   0.000000
>>>>>  _atom_sites.fract_transf_matrix[3][2]   0.000000
>>>>>  _atom_sites.fract_transf_matrix[3][3]   0.020200
>>>>>  _atom_sites.fract_transf_vector[1]      0.00000
>>>>>  _atom_sites.fract_transf_vector[2]      0.00000
>>>>>  _atom_sites.fract_transf_vector[3]      0.00000
>>>>>  Are we close to being able to treat these as legal in the context of
>>>>>  CIF2/DDL+?
>>>>>  I suppose I am asking what will constitute a legal assignment 
>>>>>for an element
>>>>>  of a matrix/array -
>>>>>  Only this -
>>>>>  _a.vec [1,2,3]
>>>>>  or also expanded assignment by element such as -
>>>>>  _a.vec[1]  1
>>>>>  _a.vec[2]  2
>>>>>  _a.vec[3]  3
>>>>>  If the latter is to be considered, then this will solve most of
>>>>>  the data name
>>>>>  issues for our data.
>>>>>  Regards,
>>>>>  John
>>>>>  Joe Krahn wrote:
>>>>>>   In practice, CIF2 parsers should allow CIF1 data names within a CIF2
>  >>>>>  formatted file. The question is whether these files should 
>be allowed as
>>>>>>   valid CIF2, or just for convenience as a non-standard CIF2.
>>>>>>   When CIF files are used as working data files, the restrictions should
>>>>>>   be relaxed. For long-term archival files, it makes sense to be more
>>>>>>   restrictive. I would just make the CIF1 names inaccessible to dREL.
>>>>>>   Alternatively, an implementation could allow CIF1 names only 
>>>>>>on reading,
>>>>>>   and require dictionary alias mappings to CIF2 names.
>>>>>>   One argument in favor of allowing them would be that someone wants to
>>>>>>   convert all data files to CIF2 format, but they want to preserve the
>>>>>>   original data as-is, without alias mapping.
>>>>>>   I think that the current CIF2 syntax makes it possible to use 
>>>>>>CIF1 names
>>>>>>   without any ambiguities. The question is whether they should be
>>>>>>   considered valid CIF2, or just a non-standard version that will be
>  >>>>>  useful for the transitional period.
>>>>>   >
>>>>>>   Joe
>>>>>>   Herbert J. Bernstein wrote:
>>>>>>>   Personally, I would greatly prefer to allow all data names that do not
>>>>>>>   create a major lexer/parser conflict to appear in a data CIF and
>>>>>>>   only apply the strong restrictions to data names that appear in CIF2
>>>>>>>   dictionaries as defined data names (not as aliases).  -- Herbert
>>>>>>>   At 2:40 PM +0000 12/9/09, Brian McMahon wrote:
>>>>>>>>   I have one remaining niggle that I'd like to revisit before we put
>>>>>>>>   this finally to bed. As has been mentioned a couple of times
>>>>>>>>   recently, restricting the data-name character set does invalidate
>>>>>>>>   syntactically many existing CIF 1 files (e.g.
>>>>>>>>  _refine_ls_shift/esd_max ).
>>>>>>>>   We have discussed strategies for handling this, and I think these
>>>>>>>>   are workable strategies, but will involve investment and 
>>>>>>>>hence expense
>>>>>>>>   in workflow management in CIF archives.
>>>>>>>>   I understand the rationale behind this restriction is to simplify
>>>>>>>>   future processing of data names in areas such as dREL
>>>>>>>>   applications. The question really is whether we're choosing the right
>>>>>>>>   trade-off in making things cleaner at that end of the processing
>>>>>>>>   chain. I would suppose that a dREL or other application 
>>>>>>>>could ingest a
>>>>>>>>   data name with dangerous characters, convert it internally into a
>>>>>>>>   "safe" identifier that's used for all processing, and then 
>>>>>>>>restore the
>>>>>>>>   original form upon output; but writing that intermediate layer of
>>>>>>>>   processing is of course expensive (especially if there aren't readily
>>>>>>>>   available libraries that will do this transparently).
>>>>>>>>   I suspect that some of the original proposed syntactic changes also
>>>>>>>>   had the effect (whether by design or collaterally) of 
>>>>>>>>simplifying i/o,
>>>>>>>>   data structure management, symbol table processing etc., 
>>>>>>>>but those may
>>>>>>>>   have suffered in the subsequent revision exercise we've just been
>>>>>>>>   practising. Given the consensus we are now approaching, 
>>>>>>>>would the code
>>>>>>>>   builders now be prepared to incur the addition expense of handling
>>>>>>>>   "dangerous" data names?
>>>>>>>>   I really don't want to spark off a long discussion on this - if a
>>>>>>>>   quick round of response shows that there's no appetite to allow
>>>>>>>>   the additional punctuation characters in data names, I'll accept that
>>>>>>>>   gracefully.
>>>>>>>>   ***
>>>>>>>>   One last comment while I have the floor, though it is related in part
>>>>>>>>   to the above question. A concern raised in the editorial office was
>>>>>>>>   that there would be circumstances where users didn't know 
>>>>>>>>if they were
>>>>>>>>   dealing with a CIF 1 or 2 ("users" meaning authors, perhaps resorting
>>>>>>>>   to the vi editor - and we're imagining most of them are dealing with
>>>>>>>>   small-molecule/inorganic CIFs). My supposition is that the IUCr
>>>>>>>>   editorial offices would only want to use CIF2 seriously in 
>>>>>>>>   with DDLm dictionaries, and that we would expect the revised core
>  >>>>>>>  dictionaries to use the dot component in data names to signal this
>>>>>>>>   further evolution. So even a superficial glimpse of the middle of a
>>>>>>>>   CIF would make it clear whether it was CIF1 or CIF2.
>>>>>>>>   Does that fit in with how others see this progressing?
>>>>>>>>   Cheers
>>>>>>>>   Brian
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]