[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: "Herbert J. Bernstein" <[email protected]>
Date: Fri, 9 Oct 2009 18:05:55 -0400 (EDT)
In-Reply-To: <C6F5BF24.1200E%[email protected]>
References: <C6F5BF24.1200E%[email protected]>
Dear Nick,

   imgCIF and CBFlib have supported UTF-8 and will continue to do so.  Any
application that supports ascii can trivially suport UTF-8.  In addition,
one of the encodings in imgCIF and CBFlib uses UTF-16/UCS-2.  If these
are not valid for a CIF, we can always go back to using the name imgNCIF.

   Most larger CIFs are no more readable than Postscript files, but there
are editors that do a nice job of displaying UTF-8 properly.  I use them
for the multi-lingual strings for the message catalog for RasMol.  The
world has many languages, and it make sense for a data representation
language to be able to handle them.  Even for the western European 
languages, UTF-8 makes much more sense than using national code pages.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  [email protected]
=====================================================

On Sat, 10 Oct 2009, Nick Spadaccini wrote:

> I am willing to be convinced on 3.2 vs an ascii based representation. But I
> need answers to several things.
>
> The UTF-8 removes text readability from CIF, which something many still hold
> dear, but that may be a cost.
>
> However here is a practical example. A user wishes to add the author names
> to an existing CIF. They fire up vim or emacs. The possible non-readability
> of the file already presents a problem. But more importantly how do they
> inject the utf-8 coding equivalent of what they need?
>
> On 10/10/09 3:34 AM, "SIMON WESTRIP" <[email protected]> wrote:
>
>> Dear all
>>
>> Without having discussed this with the IUCr, my vote would be:
>>
>> 1.2 - delimiters lose trailing whitespace condition
>>
>> 2.3 - restricted char set of non-delimited strings
>>
>> Although I'm sure these two will 'invalidate' many archived CIFs (IUCr
>> archives), just as our software is able to recognize the current specs,
>> it could equally use the same ability to 'remediate' any offending items.
>> Granted this is not an ideal situation, but I don't think the current
>> use of delimiters is ideal either (based on experience of handling CIFs that
>> were edited manually - though fortunately this is not such common practice
>> these days). So if these changes are necessary to realize the potential of
>> DDLm, I have no major objections.
>>
>> 3.2 - allow UTF-8
>>
>> Though this would probably require far more effort from CIF developers than
>> handling the first two changes, in the longer term I'm not sure this should be
>> ruled out. Afterall, support for such encoding is growing (dare I mention
>> xml?), and the rendering issues are far less of a problem than a few years
>> back (widespread font support).
>>
>> That said, I have to confess that support for 3.2 is partly driven by the fact
>> that a large part of the development of software to support the IUCr's CIF
>> publishing activities involves translation from UTF-8 to ASCII CIF codes;
>> furthermore, we are actively looking at ways to include 'richer' content in
>> CIFs.
>> So for my part, I would at least like to see support for an ASCII-based
>> representation of a wider character set.
>>
>> I have to stress that these are my views (as someone who writes CIF
>> applications for the IUCr) - I've yet to speak with Brian et al. regarding an
>> 'official' view on these matters.
>>
>> Anyway, hope this helps in your deliberations.
>>
>> Cheers
>>
>> Simon
>>
>> Simon P. Westrip
>>
>>
>> From: Herbert J. Bernstein <[email protected]>
>> To: [email protected]; Group finalising DDLm and associated
>> dictionaries <[email protected]>
>> Sent: Friday, 9 October, 2009 1:45:01
>> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
>>
>> Dear Nick,
>>
>>    If you have persuaded the others to your view, then you will win on a
>> straw vote.  I hope you have not persuaded a majority, because I agree
>> neither with your premises, nor your conclusions, but the only way to find
>> out is to hear from the others.
>>
>>    I still think the right way to resolve this is to put the items I have
>> listed to a vote and then move on.
>>
>>    Regards,
>>      Herbert
>>
>> P.S.  From your comments about binary, it sounds as if you intend to
>> "excommunicate" imgCIF from DDLm.  I think that would be a mistake. imgCIF
>> will benefit greatly from the use of methods, but at worst, I can always
>> go back to the original name:  imgNCIF, where the N stands for "not", and
>> use methods without the blessing of it being officially a "CIF"
>> dictionary.
>>
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   [email protected]
>> =====================================================
>>
>> On Fri, 9 Oct 2009, Nick Spadaccini wrote:
>>
>>>>
>>>>
>>>>
>>>> On 9/10/09 5:37 AM, "Herbert J. Bernstein" <[email protected]>
>>>> wrote:
>>>>
>>>>>> Dear Colleagues,
>>>>>>
>>>>>>    I sense a certain strong emotion in this.  I don't think that is the
>>>>>> way to resolve this.  Nick has his views.  I have mine.  Neither of us
>>>>>> has the final say.  I suggest that these matters be put to a straw
>>>>>> vote, tell the community the outcome, and then move on to more
>>>>>> substantive issues.
>>>>
>>>> There isn't emotion in this Herb, but when I say something is not
>>> negotiable
>>>> it is a statement of fact.
>>>>
>>>> At least we agree on the item you have probably viewed as emotional, my
>>>> statement of non-negotiability on Issue 2. I put it much stronger, 2.1
>>>> simply is not an option. However it is not a strictly limited set of
>>>> characters. The only restriction I am suggesting are those 6 characters
>>> that
>>>> are token delimiters.
>>>>
>>>> The problem with your suggestions Herb is that is refers to deprecating and
>>>> not enforcing when we are trying to specify the standard. Standards tend to
>>>> be strict, though individual parsers can be liberal in what to do when
>>> error
>>>> states arise. That is fair enough, BUT the standard can't be liberal.
>>>>
>>>> As a standard I much prefer 2.3 with the added restrictions for "" and ''
>>>> strings. With that in place, 1.1 doesn't make sense so clearly I prefer
>>> 1.2.
>>>>
>>>> UTF-8 introduce strictly binary data in to the file. I don't think this is
>>>> the direction to take. Not withstanding most of us wouldn't know how to
>>>> encode in to UTF-8. So what are we going to do? We will probably identify
>>>> the characters we want to encode in some ascii presentation, likely
>>> unicode,
>>>> and then use a library function/method to encode it.
>>>>
>>>> To write utf-8 (binary) into the cif file you will have to execute
>>> something
>>>> like
>>>>
>>>> outputToCIF("\u1234".encode('utf-8))
>>>>
>>>> To me it makes more sense to
>>>>
>>>> outputToCIF("\u1234")
>>>>
>>>> And then do the encoding once you read the string in from the CIF. That way
>>>> the CIF remains ascii readable.
>>>>
>>>> I think 1.2, 2.3 with the added restrictions on "" and '', and ascii-fied
>>>> unicode in strings.
>>>>
>>>>>>    Issue1:  Removing the requirement for a trailing whitespace after
>>>>>> quoted strings outside of bracketed constructs.
>>>>>>    Options:  1.1. Preserve the current convention as is
>>>>>>              1.2. Terminate all quoted strings on the occurance of the
>>>>>> trailing quoted delimiter without consideration of the next character
>>>>>>
>>>>>>    Issue2:  Restriction of the character set for non-delimited strings
>>>>>> outside of bracketed constructs
>>>>>>    Options  2.1.  Preserve the current convention as is
>>>>>>             2.2.  Modify the current convention to deprecate use of
>>>>>>                   any characters other than a strictly limited set
>>>>>>                   of characters, adding a warning oon reads and
>>>>>>                   defaulting to add quote marks on write
>>>>>>             2.3.  Modify the current convention to forbid the use of
>>>>>>                   any characters other than a strctly limited set
>>>>>>                   of characters, making it an error to read a
>>>> non-delimited
>>>>>>                   string that does not comply even if the intention
>>>>>>                   can be inferred from context
>>>>>>
>>>>>>     Issue 3:  Use of UTF-8
>>>>>>     Options:  3.1.  Do not use UTF-8
>>>>>>               3.2.  Use UTF-8
>>>>>>
>>>>>> My votes would be 1.1, 2.2, 3.2
>>>>>>
>>>>>> Whatever the outcome of the vote, I will code at least one variant of a
>>>>>> parser to comply, but it will take longer if the vote goes for 1.2 and
>>>>>> 2.3.
>>>>>>
>>>>>>    Regards,
>>>>>>      Herbert
>>>>>>
>>>>>> =====================================================
>>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>>                   +1-631-244-3035
>>>>>>                   [email protected]
>>>>>> =====================================================
>>>>>>
>>>>>> On Fri, 9 Oct 2009, Nick Spadaccini wrote:
>>>>>>
>>>>>>>> Ok. Back on board. I am proposing some old and some new stuff here.
>>>>> From the
>>>>>>>> beginning,
>>>>>>>>
>>>>>>>> (1) restricting the character set of non-delimited strings is
>>>>>>>> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive
>>>>> data
>>>>>>>> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr
>>>>> should
>>>>>>>> drop it now and stick with its current DDL.
>>>>>>>>
>>>>>>>> IUCr needs to make that decision now.
>>>>>>>>
>>>>>>>> I have built a new lexer for the current syntax specification and
>>>>> checked
>>>>>>>> for cases where
>>>>>>>>
>>>>>>>> (1) a double-quote-delimited string contains a double quote.
>>>>>>>> (2) a single-quote-delimited string contains a single quote.
>>>>>>>> (3) a non-delimited string contains any of " ' , : { }
>>>>>>>> (4) a data name contains any of (3)
>>>>>>>>
>>>>>>>> The contents of (3) are sufficient I think) restriction to
>>>>> non-delimited
>>>>>>>> strings to enable us to move forward.
>>>>>>>>
>>>>>>>> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
>>>>>>>> results are
>>>>>>>>
>>>>>>>> (1) 0 of the 3.4M (M = million) data values failed the test.
>>>>>>>>
>>>>>>>> (2) 4 of the 1.3M data values failed the test.
>>>>>>>> When I pointed these out to John he said these SHOULD have been in
>>>>>>>> semi-colon delimited text because at the PDB they have been
>>>>> systematically
>>>>>>>> dealing with quotes within quotes to avoid parsing problems.
>>>>>>>>
>>>>>>>> HENCE not allowing a string delimiter character within the string
>>>>> delimited
>>>>>>>> by the same character poses very little or no problem in mmCIF.
>>>>>>>>
>>>>>>>> (3) 138,733 of the 2,009M data values failed the test (.007%)
>>>>>>>>
>>>>>>>> Again the magnitude of the problem has been exaggerated. The
>>>>> restrictions
>>>>>>>> will not affect many of the archived data items. All the failures were
>>>>>>>> limited to 3-5 data names. These were those with embedded : which
>>>>> includes
>>>>>>>> the specification of a URL, and those with embedded , to which Herb has
>>>>>>>> already alluded. John has stipulated that those restrictions we are
>>>>>>>> suggesting can be quickly and efficiently implemented (I am here and
>>>>> looked
>>>>>>>> at their systems and the changes are a single change to dictionary
>>>>> entry and
>>>>>>>> all software handles the change immediately). I believe the PDB has a
>>>>>>>> remediation process that will resolve all legacy issues (at least for
>>>>> them).
>>>>>>>>
>>>>>>>> Conclusion: This restriction has minimal (.007%) impact on how things
>>>>> have
>>>>>>>> been done, and can be easily implemented for files from here on.
>>>>>>>>
>>>>>>>> (4) 0 data names contain these characters.
>>>>>>>>
>>>>>>>> I will not comment further on this point until I have done the same
>>>>> analysis
>>>>>>>> for the IUCr archive. I suspect the problem will be bigger for those
>>>>> files
>>>>>>>> because they represent a more lackadaisical period in CIFs evolution
>>>>> where
>>>>>>>> we suggested you could do whatever you want etc, and also there are
>>>>> IUCr
>>>>>>>> mark ups that likely cause problems. Once I get my hands on that
>>>>> archive I
>>>>>>>> will let people know.
>>>>>>>>
>>>>>>>> Now guess what? If we don't allow a ' within a '..' and a " within a
>>>>> ".."
>>>>>>>> and any "',:{} within a non-delimited string or a data name WE DON'T
>>>>> NEED A
>>>>>>>> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
>>>>>>>> importantly NORMALIZES the grammar.
>>>>>>>>
>>>>>>>> I don't accept the argument that the new parser is so much more
>>>>> difficult
>>>>>>>> that existing parsers. Currently you have (if you are inside a double
>>>>> quote
>>>>>>>> delimited string)
>>>>>>>>
>>>>>>>> if (char == \") {
>>>>>>>>  tmpchar=lookahead(1);
>>>>>>>>  if (tmpchar == " ") return END_OF_STRING;
>>>>>>>>  else continue;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> In the new parser you will have
>>>>>>>>
>>>>>>>> if (char == \") return END_OF_STRING
>>>>>>>>
>>>>>>>> YOU WILL NOTE:
>>>>>>>>
>>>>>>>> I have note included the [] characters in the restriction. There is too
>>>>> much
>>>>>>>> legacy associated with their existence in data names in both small and
> mm
>>>>>>>> CIFs.
>>>>>>>>
>>>>>>>> I am going to suggest a single token to represent lists, lists of lists
> and
>>>>>>>> associative arrays, namely  {...}. These are new, and don't present a
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
>>>>>>>> originally up to 6 bytes providing 31 bits of representation). It is a
>>>>>>>> binary representation. The encoding algorithm is not brain busting, but
>>>>>>>> neither is it trivial. Having a CIF file not editable by a bog standard
>>>>>>>> editor will upset some people. I propose the introduction of a new
>>>>> string
>>>>>>>> type within the DDLm semantics that allows one to define it to be
>>>>> Unicode.
>>>>>>>> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX
>>>>> characters) to
>>>>>>>> represent the character. Equally we could go with the HTML approach of
>>>>>>>> &#xABCDEF; (ie 1-6 HEX characters).
>>>>>>>>
>>>>>>>> I also strongly propose support fort the UNICODE string within """
>>>>> strings
>>>>>>>> ONLY. Lets's start from a restrictive stance from the outset.
>>>>>>>>
>>>>>>>> I will be arriving at Dowling at about noon on Wednesday Herb. I'll
>>>>> bring my
>>>>>>>> boxing gloves, Frances can referee :)
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On 6/10/09 11:01 PM, "James Hester" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>>> Dear All:
>>>>>>>>>>
>>>>>>>>>> As a result of the discussion with Herbert I can see two differing
>>>>>>>>>> approaches to these CIF syntax changes:
>>>>>>>>>>
>>>>>>>>>> 1. Any changes to CIF syntax should be such that earlier syntax
>>>>>>>>>> versions form a subset of the new syntax, i.e. files in the older
>>>>>>>>>> syntax will also conform to the new syntax
>>>>>>>>>>
>>>>>>>>>> or
>>>>>>>>>>
>>>>>>>>>> 2. When making changes to the standard, the opportunity should be
>>>>>>>>>> taken to simplify and streamline syntax as much as possible.
>>>>>>>>>>
>>>>>>>>>> Advantages of (1): a single CIF parser can be maintained for all
>>>>>>>>>> syntax versions; a CIF writer is always conformant to the latest
>>>>>>>>>> version and only needs changing if new syntax features are to be
>>>>>> used;
>>>>>>>>>> the existing CIF software ecosystem is minimally affected
>>>>>>>>>>
>>>>>>>>>> Advantages of (2): implementation of CIF readers/writers from scratch
>>>>>>>>>> is easier; the standard is easier to define formally and more
>>>>>>>>>> aesthetically pleasing; mistakes in previous versions can be fixed,
>>>>>>>>>> warts do not accumulate
>>>>>>>>>>
>>>>>>>>>> I would like to suggest we act as follows: in essence, we deprecate
>>>>>>>>>> rather than exclude.  In detail:
>>>>>>>>>>
>>>>>>>>>> 1. For this edition of the standard (1.2) we follow Herbert's line,
>>>>>>>>>> leaving everything currently defined untouched.  We simply add triple
>>>>>>>>>> quote delimited strings and bracket expressions.  The content of
>>>>>>>>>> non-delimited strings in bracket expressions will be as proposed by
>>>>>>>>>> Nick.
>>>>>>>>>>
>>>>>>>>>> 2. In the documents associated with the new standard we strongly
>>>>>>>>>> suggest that all non-delimited strings use the same character set as
>>>>>>>>>> for non-delimited strings in bracket expressions (i.e. Nick's
>>>>>> original
>>>>>>>>>> proposal).  We might point out that this simplifies code for writing
>>>>>>>>>> CIFs, and perhaps (if all agree) we add that using the CIF1.1
>>>>>>>>>> non-delimited string character set is deprecated, darkly
>>>>>> foreshadowing
>>>>>>>>>> that a future version of the syntax standard will adopt this
>>>>>> character
>>>>>>>>>> set for all non-delimited strings.
>>>>>>>>>>
>>>>>>>>>> 3. We also deprecate including string delimiters inside strings,
>>>>>>>>>> regardless of whitespace issues.
>>>>>>>>>>
>>>>>>>>>> 4. In all dictionaries we adopt the restricted character set for
>>>>>>>>>> non-delimited strings and exclusion of string delimiters in strings.
>>>>>>>>>>
>>>>>>>>>> 5. We ask that CheckCIF emit a warning about use of deprecated
>>>>>>>>>> characters in non-delimited strings
>>>>>>>>>>
>>>>>>>>>> 6. When (say in 10 years' time) a sufficiently large proportion of
>>>>>>>>>> incoming CIFs conform to the new non-delimited string character set,
>>>>>>>>>> we promulgate the 1.3 version of the standard.
>>>>>>>>>>
>>>>>>>>
>>>>>>>> cheers
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>> Associate Professor N. Spadaccini, PhD
>>>>>>>> School of Computer Science & Software Engineering
>>>>>>>>
>>>>>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>>>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>>>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>>> <http://www.csse.uwa.edu.au/%7Enick>
>>>>>>>> MBDP  M002
>>>>>>>>
>>>>>>>> CRICOS Provider Code: 00126G
>>>>>>>>
>>>>>>>> e: [email protected]
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ddlm-group mailing list
>>>>>>>> [email protected]
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> [email protected]
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>> cheers
>>>>
>>>> Nick
>>>>
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>>
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> <http://www.csse.uwa.edu.au/%7Enick>
>>>> MBDP  M002
>>>>
>>>> CRICOS Provider Code: 00126G
>>>>
>>>> e: [email protected]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> [email protected]
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: [email protected]
>
>
>
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.