[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Nick,

   If you have persuaded the others to your view, then you will win on a 
straw vote.  I hope you have not persuaded a majority, because I agree 
neither with your premises, nor your conclusions, but the only way to find 
out is to hear from the others.

   I still think the right way to resolve this is to put the items I have 
listed to a vote and then move on.

   Regards,
     Herbert

P.S.  From your comments about binary, it sounds as if you intend to 
"excommunicate" imgCIF from DDLm.  I think that would be a mistake. imgCIF 
will benefit greatly from the use of methods, but at worst, I can always 
go back to the original name:  imgNCIF, where the N stands for "not", and 
use methods without the blessing of it being officially a "CIF" 
dictionary.


=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Fri, 9 Oct 2009, Nick Spadaccini wrote:

>
>
>
> On 9/10/09 5:37 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
> wrote:
>
>> Dear Colleagues,
>>
>>    I sense a certain strong emotion in this.  I don't think that is the
>> way to resolve this.  Nick has his views.  I have mine.  Neither of us
>> has the final say.  I suggest that these matters be put to a straw
>> vote, tell the community the outcome, and then move on to more
>> substantive issues.
>
> There isn't emotion in this Herb, but when I say something is not negotiable
> it is a statement of fact.
>
> At least we agree on the item you have probably viewed as emotional, my
> statement of non-negotiability on Issue 2. I put it much stronger, 2.1
> simply is not an option. However it is not a strictly limited set of
> characters. The only restriction I am suggesting are those 6 characters that
> are token delimiters.
>
> The problem with your suggestions Herb is that is refers to deprecating and
> not enforcing when we are trying to specify the standard. Standards tend to
> be strict, though individual parsers can be liberal in what to do when error
> states arise. That is fair enough, BUT the standard can't be liberal.
>
> As a standard I much prefer 2.3 with the added restrictions for "" and ''
> strings. With that in place, 1.1 doesn't make sense so clearly I prefer 1.2.
>
> UTF-8 introduce strictly binary data in to the file. I don't think this is
> the direction to take. Not withstanding most of us wouldn't know how to
> encode in to UTF-8. So what are we going to do? We will probably identify
> the characters we want to encode in some ascii presentation, likely unicode,
> and then use a library function/method to encode it.
>
> To write utf-8 (binary) into the cif file you will have to execute something
> like
>
> outputToCIF("\u1234".encode('utf-8))
>
> To me it makes more sense to
>
> outputToCIF("\u1234")
>
> And then do the encoding once you read the string in from the CIF. That way
> the CIF remains ascii readable.
>
> I think 1.2, 2.3 with the added restrictions on "" and '', and ascii-fied
> unicode in strings.
>
>>    Issue1:  Removing the requirement for a trailing whitespace after
>> quoted strings outside of bracketed constructs.
>>    Options:  1.1. Preserve the current convention as is
>>              1.2. Terminate all quoted strings on the occurance of the
>> trailing quoted delimiter without consideration of the next character
>>
>>    Issue2:  Restriction of the character set for non-delimited strings
>> outside of bracketed constructs
>>    Options  2.1.  Preserve the current convention as is
>>             2.2.  Modify the current convention to deprecate use of
>>                   any characters other than a strictly limited set
>>                   of characters, adding a warning oon reads and
>>                   defaulting to add quote marks on write
>>             2.3.  Modify the current convention to forbid the use of
>>                   any characters other than a strctly limited set
>>                   of characters, making it an error to read a non-delimited
>>                   string that does not comply even if the intention
>>                   can be inferred from context
>>
>>     Issue 3:  Use of UTF-8
>>     Options:  3.1.  Do not use UTF-8
>>               3.2.  Use UTF-8
>>
>> My votes would be 1.1, 2.2, 3.2
>>
>> Whatever the outcome of the vote, I will code at least one variant of a
>> parser to comply, but it will take longer if the vote goes for 1.2 and
>> 2.3.
>>
>>    Regards,
>>      Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Fri, 9 Oct 2009, Nick Spadaccini wrote:
>>
>>> Ok. Back on board. I am proposing some old and some new stuff here. From the
>>> beginning,
>>>
>>> (1) restricting the character set of non-delimited strings is
>>> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data
>>> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should
>>> drop it now and stick with its current DDL.
>>>
>>> IUCr needs to make that decision now.
>>>
>>> I have built a new lexer for the current syntax specification and checked
>>> for cases where
>>>
>>> (1) a double-quote-delimited string contains a double quote.
>>> (2) a single-quote-delimited string contains a single quote.
>>> (3) a non-delimited string contains any of " ' , : { }
>>> (4) a data name contains any of (3)
>>>
>>> The contents of (3) are sufficient I think) restriction to non-delimited
>>> strings to enable us to move forward.
>>>
>>> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
>>> results are
>>>
>>> (1) 0 of the 3.4M (M = million) data values failed the test.
>>>
>>> (2) 4 of the 1.3M data values failed the test.
>>> When I pointed these out to John he said these SHOULD have been in
>>> semi-colon delimited text because at the PDB they have been systematically
>>> dealing with quotes within quotes to avoid parsing problems.
>>>
>>> HENCE not allowing a string delimiter character within the string delimited
>>> by the same character poses very little or no problem in mmCIF.
>>>
>>> (3) 138,733 of the 2,009M data values failed the test (.007%)
>>>
>>> Again the magnitude of the problem has been exaggerated. The restrictions
>>> will not affect many of the archived data items. All the failures were
>>> limited to 3-5 data names. These were those with embedded : which includes
>>> the specification of a URL, and those with embedded , to which Herb has
>>> already alluded. John has stipulated that those restrictions we are
>>> suggesting can be quickly and efficiently implemented (I am here and looked
>>> at their systems and the changes are a single change to dictionary entry and
>>> all software handles the change immediately). I believe the PDB has a
>>> remediation process that will resolve all legacy issues (at least for them).
>>>
>>> Conclusion: This restriction has minimal (.007%) impact on how things have
>>> been done, and can be easily implemented for files from here on.
>>>
>>> (4) 0 data names contain these characters.
>>>
>>> I will not comment further on this point until I have done the same analysis
>>> for the IUCr archive. I suspect the problem will be bigger for those files
>>> because they represent a more lackadaisical period in CIFs evolution where
>>> we suggested you could do whatever you want etc, and also there are IUCr
>>> mark ups that likely cause problems. Once I get my hands on that archive I
>>> will let people know.
>>>
>>> Now guess what? If we don't allow a ' within a '..' and a " within a ".."
>>> and any "',:{} within a non-delimited string or a data name WE DON'T NEED A
>>> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
>>> importantly NORMALIZES the grammar.
>>>
>>> I don't accept the argument that the new parser is so much more difficult
>>> that existing parsers. Currently you have (if you are inside a double quote
>>> delimited string)
>>>
>>> if (char == \") {
>>>  tmpchar=lookahead(1);
>>>  if (tmpchar == " ") return END_OF_STRING;
>>>  else continue;
>>>  }
>>>
>>> In the new parser you will have
>>>
>>> if (char == \") return END_OF_STRING
>>>
>>> YOU WILL NOTE:
>>>
>>> I have note included the [] characters in the restriction. There is too much
>>> legacy associated with their existence in data names in both small and mm
>>> CIFs.
>>>
>>> I am going to suggest a single token to represent lists, lists of lists and
>>> associative arrays, namely  {...}. These are new, and don't present a
>>> problem.
>>>
>>> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
>>> originally up to 6 bytes providing 31 bits of representation). It is a
>>> binary representation. The encoding algorithm is not brain busting, but
>>> neither is it trivial. Having a CIF file not editable by a bog standard
>>> editor will upset some people. I propose the introduction of a new string
>>> type within the DDLm semantics that allows one to define it to be Unicode.
>>> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to
>>> represent the character. Equally we could go with the HTML approach of
>>> &#xABCDEF; (ie 1-6 HEX characters).
>>>
>>> I also strongly propose support fort the UNICODE string within """ strings
>>> ONLY. Lets's start from a restrictive stance from the outset.
>>>
>>> I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my
>>> boxing gloves, Frances can referee :)
>>>
>>> Nick
>>>
>>> On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>>>
>>>> Dear All:
>>>>
>>>> As a result of the discussion with Herbert I can see two differing
>>>> approaches to these CIF syntax changes:
>>>>
>>>> 1. Any changes to CIF syntax should be such that earlier syntax
>>>> versions form a subset of the new syntax, i.e. files in the older
>>>> syntax will also conform to the new syntax
>>>>
>>>> or
>>>>
>>>> 2. When making changes to the standard, the opportunity should be
>>>> taken to simplify and streamline syntax as much as possible.
>>>>
>>>> Advantages of (1): a single CIF parser can be maintained for all
>>>> syntax versions; a CIF writer is always conformant to the latest
>>>> version and only needs changing if new syntax features are to be used;
>>>> the existing CIF software ecosystem is minimally affected
>>>>
>>>> Advantages of (2): implementation of CIF readers/writers from scratch
>>>> is easier; the standard is easier to define formally and more
>>>> aesthetically pleasing; mistakes in previous versions can be fixed,
>>>> warts do not accumulate
>>>>
>>>> I would like to suggest we act as follows: in essence, we deprecate
>>>> rather than exclude.  In detail:
>>>>
>>>> 1. For this edition of the standard (1.2) we follow Herbert's line,
>>>> leaving everything currently defined untouched.  We simply add triple
>>>> quote delimited strings and bracket expressions.  The content of
>>>> non-delimited strings in bracket expressions will be as proposed by
>>>> Nick.
>>>>
>>>> 2. In the documents associated with the new standard we strongly
>>>> suggest that all non-delimited strings use the same character set as
>>>> for non-delimited strings in bracket expressions (i.e. Nick's original
>>>> proposal).  We might point out that this simplifies code for writing
>>>> CIFs, and perhaps (if all agree) we add that using the CIF1.1
>>>> non-delimited string character set is deprecated, darkly foreshadowing
>>>> that a future version of the syntax standard will adopt this character
>>>> set for all non-delimited strings.
>>>>
>>>> 3. We also deprecate including string delimiters inside strings,
>>>> regardless of whitespace issues.
>>>>
>>>> 4. In all dictionaries we adopt the restricted character set for
>>>> non-delimited strings and exclusion of string delimiters in strings.
>>>>
>>>> 5. We ask that CheckCIF emit a warning about use of deprecated
>>>> characters in non-delimited strings
>>>>
>>>> 6. When (say in 10 years' time) a sufficiently large proportion of
>>>> incoming CIFs conform to the new non-delimited string character set,
>>>> we promulgate the 1.3 version of the standard.
>>>>
>>>
>>> cheers
>>>
>>> Nick
>>>
>>> --------------------------------
>>> Associate Professor N. Spadaccini, PhD
>>> School of Computer Science & Software Engineering
>>>
>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> MBDP  M002
>>>
>>> CRICOS Provider Code: 00126G
>>>
>>> e: Nick.Spadaccini@uwa.edu.au
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]