Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Colleagues,

   Let us look at a recent CIF from Acta C. and the tag 
_reflns_threshold_expression which has the value

       I>3\s(I)

This value would be required to be quoted because is contains '>' and '\\'

   The, let us look at a recent mmCIF from the PDB, 3jze, which contain the 
loop

loop_
_pdbx_struct_assembly_gen.assembly_id
_pdbx_struct_assembly_gen.oper_expression
_pdbx_struct_assembly_gen.asym_id_list
1 1 A,E,F,G,H,I,J,K,GA,B,L,M,N,O,P,Q,R,HA,C,S,T,U,V,W,X,Y,IA,D,Z,AA,BA,CA,DA,EA,FA,JA
2 1 C,S,T,U,V,W,X,Y,IA,A,E,F,G,H,I,J,K,GA,B,L,M,N,O,P,Q,R,HA,D,Z,AA,BA,CA,DA,EA,FA,JA

which would violate the proposed rule because of the embedded commas

Frankly, I cannot imagine writing my software to make the easily 
recognized cases into errors.  The intent is clear.  For code that has the 
luxury of lots of time for warnings, it might be worth providing the 
suggestion that such string would be better off quoted, but when 
processing imgCIF files, in which we are pushing the limits of system 
speeds for image reading, I suspect we will have to turn those checks and 
warnings off.

I am not saying we should nor try to encourage applications writers to 
quote as many confusing cases as possible, but I really do not think it 
wise to make the restrictes character set the standard for non-delimited 
strings.

I'll stop here, but please note that all I did to get these examples was 
to take the first cif in the current issue of Acta C and the last entry 
released by the PDB.  The proposed non-delimited string change would 
invalidate a lot of existing CIFs.

I think it is very unwise.

Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 2 Oct 2009, Herbert J. Bernstein wrote:

> Dear Colleagues,
>
> When we went from CIF 1.0 to CIF 1.1, we all tried very hard to make as many 
> CIF 1.0 files as possible remain valid CIF 1.1 files without the need for any 
> changes.  When DDLm was introduced a promise was made to the community that 
> is still on the IUCr web site in bold face:
>
> "No changes are required in existing archival data files in order to apply 
> domain dictionaries written in DDLm."
>
> If we are now breaking that promise, which it appears we are about to do if 
> we are not very, very careful, then I believe the have an ethical obligation 
> to make that clear to the community and invite them into the discussion.
>
> I have to run to get ready to submit a proposal now, but I will respond more 
> directly to James Hester's questions about the details of how this change 
> impacts existing CIFs later today, but please do take a look at what we said 
> on
>
>  http://www.iucr.org/resources/cif/ddl/ddlm
>
>  Regards,
>    Herbert
>
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Fri, 2 Oct 2009, James Hester wrote:
>
>> Herbert writes:
>>
>> 	" Bottom line -- what is proposed is a very different language
>> 	that will use a significantly different lexer and parser from
>> 	the one used for DDL1 and DDL2 CIFS, guaranteeing to leave us
>> 	with multiple dialects for a very long time.  I think that is
>> 	a shame -- rather than DDLm consolidating DDL1 and DDL2 and
>> 	adding useful new features, we are simply going to end up with
>> 	DDL1, DDL2 and DDL3 as three distinct dialects.
>>
>> 	  I think this is unwise."
>> 
>> In order not to confuse matters, let us restrict the use of the terms
>> DDL1, DDL2 and DDL3 to dictionary definition languages, not the syntax
>> variations we are currently discussing.  I believe Herbert has in mind
>> CIF 1.0, 1.1 and 1.2. I would like to explore his concern about the
>> difference in the proposed CIF 1.2 parser.  Some difference is
>> inevitable in that we have added two new constructs, the triple quote
>> delimited string and the bracketed list.  Because of this, a CIF 1.1
>> parser will break on a CIF 1.2 file regardless of any changes to
>> string content rules, so that is presumably not the main
>> concern. Perhaps the concern is that a CIF1.2 parser will not be able
>> to parse all files built according to previous CIF syntax versions?
>> But this is always going to be the case due to the (theoretical)
>> possibility of a triple quote appearing as a value in a CIF 1.1 file,
>> which would mean a single quote under CIF1.1 rules, but the beginning
>> of a string under CIF1.2. Perhaps Herbert could expand on why this
>> inability of a CIF 1.2 parser to parse a CIF 1.1 file is a problem.
>> 
>> To take the DDL1/DDL2/DDL3 comment at face value, these are by design
>> three distinct dictionary languages, with DDL3 taking the best of DDL1
>> and 2.  I don't see why this is a shame.
>> 
>> Herbert goes on to say:
>>
>> 	  Just to be clear, I do think the restriction on character
>> 	  set of non-delimited strings is unwise -- of all the changes
>> 	  proposed, I believe that it is the one that invalidates the
>> 	  largest number of existing CIFS, and serves no useful
>> 	  purpose that could not be achieved by the simple exclusion
>> 	  of specific cases, as we have already done.
>> 
>> In what sense are existing CIFs 'invalidated'?  They are all still
>> valid CIF1.1 files, which is a published standard.  Perhaps Herbert or
>> somebody could expand on what the real world issues might be because
>> of the proposed change?
>> 
>> Finally, Herbert writes:
>>
>>  "I would also consider all the printable UTF-8 characters as valid."
>> 
>> Herbert, could you please explain in more detail this proposal.  Do
>> you mean that only the one-byte printable UTF-8 characters (= ASCII)
>> are included?  Or do you mean that all of UTF-8 is included,
>> i.e. characters may need up to 4 bytes to be represented?  If the
>> latter, then are we proposing to accept all legal UTF-8 byte values,
>> without using an intermediate representation?  Is this use of UTF8
>> restricted to delimited strings?
>> 
>> 
>> -- 
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.