Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   First let me say that I very much agree with Nick's latest version.  I 
think it is a reasonable, pragmatic compromise trying to introduce changes
that are needed while providing an workable conversion path for existing
CIF 1 data files.

   Now to David's comments:

> For clarification are two ' (i.e. '') the same as one "?  Some of the 
> illustrations seem to indicate that this is so, but this may be the result of 
> the fonts used which do not distinguish between '' and ".  (Thisi is why I am 
> using Courier.) If they are not the same then '' is not a delimiter and 
> ''red'' would be interpreted as three items '' red '' whereas "red" would be 
> interpreted as the single item red without any quotes.  Are both ''' and """ 
> legal dlimitiers?


Both Nick and I believe that two sequential apostrohphe's is different
from double quote character and both are different from the microsoft
smart quotes.  That is why the ASCII characters were cited.

> As Simon points out, the tags are all undelimited strings and therefore are 
> restricted in their allowed character set in CIF2.0. I don't think that any 
> existing tags violate Nick's latest set of excluded characters.  Characters 
> such as % and / do appear in DDL1, though they will probably be removed in 
> DDLm.  If any illegal characters do appear they present no problem as long as 
> a CIF2 application recognizes that it is reading a CIF1.  Where the name 
> appears as an alias in a DDLm dictionary it can be be made legal by being 
> quoted.  Providing a CIF1 application can be taught to recognize _ and . as 
> interchangeable it should have no problem in reading CIF2 names, but it may 
> not recognize the names which will be different or absent in the DDL1 
> dictionary.  This could result in a loss of information, which may or may not 
> be important.  It would clearly have serious problems with arrays and other 
> cases where the new delimiters were used.

Yes, the list of excluded characters for tags will be the same as the list 
of excluded characters for non-delimited strings.  I am sure that 
somebody, somwhere has a CIF with a tag that may need to be changed 
because it uses one of the small list of excluded characters, but I think 
this will be a sufficiently rare problem to make hand correction an
acceptable cure.  I will add checks to my software to flag such cases
and provide a warning.  Certainly, for conversion utilities, we will need
to have access from the applications to all the relevant dictionaries, 
except for the simplest, cleanest cases.

> I assume that CIF2.0 applies to both the dictionaries and the CIFs 
> themselves.  Are there conditions (like global_) that only apply to 
> dictionaries?  CIFs prepared using the CIF2.0 standards are likely in the 
> first instance to code matrices and vectors as separate elements.  Existing 
> methods can combine these into arrays.  Eventually I foresee that such values 
> will be coded directly as arrays as this is more efficient.  Methods will 
> then be needed to decompose these arrays into their elements in case 
> individual elements need to be retrieved.  I see no problem except that a 
> CIF2.0 coded in this way clearly could not be read by a CIF1 application.

Certainly it would be nice to move over to bracketed formats for matrices 
and vectors in data files.  Once you do that, it would make sense to just 
allow the full recursive use of bracket values in data file.  In order to 
go back to a CIF-1, we will need to, at the very least, embed the 
resulting complex data item into a text field.  In many cases, with the 
dictionaries available, it should be possible to redistribute the complex 
data value items among the appropriate CIF-1 data items, but I am not sure 
this will be needed.

> The use of an expression such as #CIF2,0 as a magic number as the first 
> string in a CIF could cause problems since the CIF standard states that 
> anything after # is not part of the CIF and can be stripped out without 
> destroying the integrity of the CIF, i.e., anything following # has no 
> bearing on the either the syntax or the semantics of the CIF.  Have I missed 
> something here?  Software designed, e.g., to strip out the comments in a 
> template could easily strip out the magic number.  No problem if this is a 
> CIF1 file, but it would create an illegal file if it did this to a CIF2 file. 
> Some legacy software might not be sophisticated enough to recognize the 
> problem.  In general I would strongly advocate using a different initial 
> character for this string.

At some point, software that expects to process a CIF2 will have to get 
the information that it needs to follow CIF2 rules from somewhere.  If all 
it expects to deal with are CIF2 documents, then it is fine.  But if it 
expects to handle both CIF1 and CIF2, then it should pay attention to the 
first couple of comments.  However, to help, I would suggest that for 
CIF2, we adopt the convention of using .cif2 or .cf2, rather than .cif as 
the file extension for a data file, and .dic3 or .cd3 rather than .dic as
the file extension for a DDLm dctionary.

Regards,
   Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Thu, 15 Oct 2009, David Brown wrote:

>
> I have just returned from a cruise along the coast of Labrador which started 
> before the current discussions began.  I have spent the last couple of days 
> reading through all 74 contributions that have subsequently arrived in my 
> computer.  Most of the discussion is a little outside my concern with 
> dictionaries, but I have been trying to see if there are any obvious problems 
> with what is proposed.  Perhaps someone can point out if I have 
> misinterpreted anything.
>
> I am adopting James' suggestion that the new standard is sufficiently 
> different to require a major version change to CIF2.0 rather than CIF1.2.
>
> For clarification are two ' (i.e. '') the same as one "?  Some of the 
> illustrations seem to indicate that this is so, but this may be the result of 
> the fonts used which do not distinguish between '' and ".  (Thisi is why I am 
> using Courier.) If they are not the same then '' is not a delimiter and 
> ''red'' would be interpreted as three items '' red '' whereas "red" would be 
> interpreted as the single item red without any quotes.  Are both ''' and """ 
> legal dlimitiers?
>
> I have been considering the problems of CIF 1.0 and 1.1 files being read by 
> CIF2.0 applications and vice versa.
>
> As Simon points out, the tags are all undelimited strings and therefore are 
> restricted in their allowed character set in CIF2.0. I don't think that any 
> existing tags violate Nick's latest set of excluded characters.  Characters 
> such as % and / do appear in DDL1, though they will probably be removed in 
> DDLm.  If any illegal characters do appear they present no problem as long as 
> a CIF2 application recognizes that it is reading a CIF1.  Where the name 
> appears as an alias in a DDLm dictionary it can be be made legal by being 
> quoted.  Providing a CIF1 application can be taught to recognize _ and . as 
> interchangeable it should have no problem in reading CIF2 names, but it may 
> not recognize the names which will be different or absent in the DDL1 
> dictionary.  This could result in a loss of information, which may or may not 
> be important.  It would clearly have serious problems with arrays and other 
> cases where the new delimiters were used.
>
> I assume that CIF2.0 applies to both the dictionaries and the CIFs 
> themselves.  Are there conditions (like global_) that only apply to 
> dictionaries?  CIFs prepared using the CIF2.0 standards are likely in the 
> first instance to code matrices and vectors as separate elements.  Existing 
> methods can combine these into arrays.  Eventually I foresee that such values 
> will be coded directly as arrays as this is more efficient.  Methods will 
> then be needed to decompose these arrays into their elements in case 
> individual elements need to be retrieved.  I see no problem except that a 
> CIF2.0 coded in this way clearly could not be read by a CIF1 application.
>
> The use of an expression such as #CIF2,0 as a magic number as the first 
> string in a CIF could cause problems since the CIF standard states that 
> anything after # is not part of the CIF and can be stripped out without 
> destroying the integrity of the CIF, i.e., anything following # has no 
> bearing on the either the syntax or the semantics of the CIF.  Have I missed 
> something here?  Software designed, e.g., to strip out the comments in a 
> template could easily strip out the magic number.  No problem if this is a 
> CIF1 file, but it would create an illegal file if it did this to a CIF2 file. 
> Some legacy software might not be sophisticated enough to recognize the 
> problem.  In general I would strongly advocate using a different initial 
> character for this string.
>
> David
>
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.