Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear Nick,

   imgCIF and CBFlib have supported UTF-8 and will continue to do so.  Any
application that supports ascii can trivially suport UTF-8.  In addition,
one of the encodings in imgCIF and CBFlib uses UTF-16/UCS-2.  If these
are not valid for a CIF, we can always go back to using the name imgNCIF.

   Most larger CIFs are no more readable than Postscript files, but there
are editors that do a nice job of displaying UTF-8 properly.  I use them
for the multi-lingual strings for the message catalog for RasMol.  The
world has many languages, and it make sense for a data representation
language to be able to handle them.  Even for the western European 
languages, UTF-8 makes much more sense than using national code pages.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sat, 10 Oct 2009, Nick Spadaccini wrote:

> I am willing to be convinced on 3.2 vs an ascii based representation. But I
> need answers to several things.
>
> The UTF-8 removes text readability from CIF, which something many still hold
> dear, but that may be a cost.
>
> However here is a practical example. A user wishes to add the author names
> to an existing CIF. They fire up vim or emacs. The possible non-readability
> of the file already presents a problem. But more importantly how do they
> inject the utf-8 coding equivalent of what they need?
>
> On 10/10/09 3:34 AM, "SIMON WESTRIP" <simonwestrip@btinternet.com> wrote:
>
>> Dear all
>>
>> Without having discussed this with the IUCr, my vote would be:
>>
>> 1.2 - delimiters lose trailing whitespace condition
>>
>> 2.3 - restricted char set of non-delimited strings
>>
>> Although I'm sure these two will 'invalidate' many archived CIFs (IUCr
>> archives), just as our software is able to recognize the current specs,
>> it could equally use the same ability to 'remediate' any offending items.
>> Granted this is not an ideal situation, but I don't think the current
>> use of delimiters is ideal either (based on experience of handling CIFs that
>> were edited manually - though fortunately this is not such common practice
>> these days). So if these changes are necessary to realize the potential of
>> DDLm, I have no major objections.
>>
>> 3.2 - allow UTF-8
>>
>> Though this would probably require far more effort from CIF developers than
>> handling the first two changes, in the longer term I'm not sure this should be
>> ruled out. Afterall, support for such encoding is growing (dare I mention
>> xml?), and the rendering issues are far less of a problem than a few years
>> back (widespread font support).
>>
>> That said, I have to confess that support for 3.2 is partly driven by the fact
>> that a large part of the development of software to support the IUCr's CIF
>> publishing activities involves translation from UTF-8 to ASCII CIF codes;
>> furthermore, we are actively looking at ways to include 'richer' content in
>> CIFs.
>> So for my part, I would at least like to see support for an ASCII-based
>> representation of a wider character set.
>>
>> I have to stress that these are my views (as someone who writes CIF
>> applications for the IUCr) - I've yet to speak with Brian et al. regarding an
>> 'official' view on these matters.
>>
>> Anyway, hope this helps in your deliberations.
>>
>> Cheers
>>
>> Simon
>>
>> Simon P. Westrip
>>
>>
>> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
>> To: Nick.Spadaccini@uwa.edu.au; Group finalising DDLm and associated
>> dictionaries <ddlm-group@iucr.org>
>> Sent: Friday, 9 October, 2009 1:45:01
>> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
>>
>> Dear Nick,
>>
>>    If you have persuaded the others to your view, then you will win on a
>> straw vote.  I hope you have not persuaded a majority, because I agree
>> neither with your premises, nor your conclusions, but the only way to find
>> out is to hear from the others.
>>
>>    I still think the right way to resolve this is to put the items I have
>> listed to a vote and then move on.
>>
>>    Regards,
>>      Herbert
>>
>> P.S.  From your comments about binary, it sounds as if you intend to
>> "excommunicate" imgCIF from DDLm.  I think that would be a mistake. imgCIF
>> will benefit greatly from the use of methods, but at worst, I can always
>> go back to the original name:  imgNCIF, where the N stands for "not", and
>> use methods without the blessing of it being officially a "CIF"
>> dictionary.
>>
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
>>
>> On Fri, 9 Oct 2009, Nick Spadaccini wrote:
>>
>>>>
>>>>
>>>>
>>>> On 9/10/09 5:37 AM, "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
>>>> wrote:
>>>>
>>>>>> Dear Colleagues,
>>>>>>
>>>>>>    I sense a certain strong emotion in this.  I don't think that is the
>>>>>> way to resolve this.  Nick has his views.  I have mine.  Neither of us
>>>>>> has the final say.  I suggest that these matters be put to a straw
>>>>>> vote, tell the community the outcome, and then move on to more
>>>>>> substantive issues.
>>>>
>>>> There isn't emotion in this Herb, but when I say something is not
>>> negotiable
>>>> it is a statement of fact.
>>>>
>>>> At least we agree on the item you have probably viewed as emotional, my
>>>> statement of non-negotiability on Issue 2. I put it much stronger, 2.1
>>>> simply is not an option. However it is not a strictly limited set of
>>>> characters. The only restriction I am suggesting are those 6 characters
>>> that
>>>> are token delimiters.
>>>>
>>>> The problem with your suggestions Herb is that is refers to deprecating and
>>>> not enforcing when we are trying to specify the standard. Standards tend to
>>>> be strict, though individual parsers can be liberal in what to do when
>>> error
>>>> states arise. That is fair enough, BUT the standard can't be liberal.
>>>>
>>>> As a standard I much prefer 2.3 with the added restrictions for "" and ''
>>>> strings. With that in place, 1.1 doesn't make sense so clearly I prefer
>>> 1.2.
>>>>
>>>> UTF-8 introduce strictly binary data in to the file. I don't think this is
>>>> the direction to take. Not withstanding most of us wouldn't know how to
>>>> encode in to UTF-8. So what are we going to do? We will probably identify
>>>> the characters we want to encode in some ascii presentation, likely
>>> unicode,
>>>> and then use a library function/method to encode it.
>>>>
>>>> To write utf-8 (binary) into the cif file you will have to execute
>>> something
>>>> like
>>>>
>>>> outputToCIF("\u1234".encode('utf-8))
>>>>
>>>> To me it makes more sense to
>>>>
>>>> outputToCIF("\u1234")
>>>>
>>>> And then do the encoding once you read the string in from the CIF. That way
>>>> the CIF remains ascii readable.
>>>>
>>>> I think 1.2, 2.3 with the added restrictions on "" and '', and ascii-fied
>>>> unicode in strings.
>>>>
>>>>>>    Issue1:  Removing the requirement for a trailing whitespace after
>>>>>> quoted strings outside of bracketed constructs.
>>>>>>    Options:  1.1. Preserve the current convention as is
>>>>>>              1.2. Terminate all quoted strings on the occurance of the
>>>>>> trailing quoted delimiter without consideration of the next character
>>>>>>
>>>>>>    Issue2:  Restriction of the character set for non-delimited strings
>>>>>> outside of bracketed constructs
>>>>>>    Options  2.1.  Preserve the current convention as is
>>>>>>             2.2.  Modify the current convention to deprecate use of
>>>>>>                   any characters other than a strictly limited set
>>>>>>                   of characters, adding a warning oon reads and
>>>>>>                   defaulting to add quote marks on write
>>>>>>             2.3.  Modify the current convention to forbid the use of
>>>>>>                   any characters other than a strctly limited set
>>>>>>                   of characters, making it an error to read a
>>>> non-delimited
>>>>>>                   string that does not comply even if the intention
>>>>>>                   can be inferred from context
>>>>>>
>>>>>>     Issue 3:  Use of UTF-8
>>>>>>     Options:  3.1.  Do not use UTF-8
>>>>>>               3.2.  Use UTF-8
>>>>>>
>>>>>> My votes would be 1.1, 2.2, 3.2
>>>>>>
>>>>>> Whatever the outcome of the vote, I will code at least one variant of a
>>>>>> parser to comply, but it will take longer if the vote goes for 1.2 and
>>>>>> 2.3.
>>>>>>
>>>>>>    Regards,
>>>>>>      Herbert
>>>>>>
>>>>>> =====================================================
>>>>>>   Herbert J. Bernstein, Professor of Computer Science
>>>>>>     Dowling College, Kramer Science Center, KSC 121
>>>>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>>>>
>>>>>>                   +1-631-244-3035
>>>>>>                   yaya@dowling.edu
>>>>>> =====================================================
>>>>>>
>>>>>> On Fri, 9 Oct 2009, Nick Spadaccini wrote:
>>>>>>
>>>>>>>> Ok. Back on board. I am proposing some old and some new stuff here.
>>>>> From the
>>>>>>>> beginning,
>>>>>>>>
>>>>>>>> (1) restricting the character set of non-delimited strings is
>>>>>>>> NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive
>>>>> data
>>>>>>>> structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr
>>>>> should
>>>>>>>> drop it now and stick with its current DDL.
>>>>>>>>
>>>>>>>> IUCr needs to make that decision now.
>>>>>>>>
>>>>>>>> I have built a new lexer for the current syntax specification and
>>>>> checked
>>>>>>>> for cases where
>>>>>>>>
>>>>>>>> (1) a double-quote-delimited string contains a double quote.
>>>>>>>> (2) a single-quote-delimited string contains a single quote.
>>>>>>>> (3) a non-delimited string contains any of " ' , : { }
>>>>>>>> (4) a data name contains any of (3)
>>>>>>>>
>>>>>>>> The contents of (3) are sufficient I think) restriction to
>>>>> non-delimited
>>>>>>>> strings to enable us to move forward.
>>>>>>>>
>>>>>>>> I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
>>>>>>>> results are
>>>>>>>>
>>>>>>>> (1) 0 of the 3.4M (M = million) data values failed the test.
>>>>>>>>
>>>>>>>> (2) 4 of the 1.3M data values failed the test.
>>>>>>>> When I pointed these out to John he said these SHOULD have been in
>>>>>>>> semi-colon delimited text because at the PDB they have been
>>>>> systematically
>>>>>>>> dealing with quotes within quotes to avoid parsing problems.
>>>>>>>>
>>>>>>>> HENCE not allowing a string delimiter character within the string
>>>>> delimited
>>>>>>>> by the same character poses very little or no problem in mmCIF.
>>>>>>>>
>>>>>>>> (3) 138,733 of the 2,009M data values failed the test (.007%)
>>>>>>>>
>>>>>>>> Again the magnitude of the problem has been exaggerated. The
>>>>> restrictions
>>>>>>>> will not affect many of the archived data items. All the failures were
>>>>>>>> limited to 3-5 data names. These were those with embedded : which
>>>>> includes
>>>>>>>> the specification of a URL, and those with embedded , to which Herb has
>>>>>>>> already alluded. John has stipulated that those restrictions we are
>>>>>>>> suggesting can be quickly and efficiently implemented (I am here and
>>>>> looked
>>>>>>>> at their systems and the changes are a single change to dictionary
>>>>> entry and
>>>>>>>> all software handles the change immediately). I believe the PDB has a
>>>>>>>> remediation process that will resolve all legacy issues (at least for
>>>>> them).
>>>>>>>>
>>>>>>>> Conclusion: This restriction has minimal (.007%) impact on how things
>>>>> have
>>>>>>>> been done, and can be easily implemented for files from here on.
>>>>>>>>
>>>>>>>> (4) 0 data names contain these characters.
>>>>>>>>
>>>>>>>> I will not comment further on this point until I have done the same
>>>>> analysis
>>>>>>>> for the IUCr archive. I suspect the problem will be bigger for those
>>>>> files
>>>>>>>> because they represent a more lackadaisical period in CIFs evolution
>>>>> where
>>>>>>>> we suggested you could do whatever you want etc, and also there are
>>>>> IUCr
>>>>>>>> mark ups that likely cause problems. Once I get my hands on that
>>>>> archive I
>>>>>>>> will let people know.
>>>>>>>>
>>>>>>>> Now guess what? If we don't allow a ' within a '..' and a " within a
>>>>> ".."
>>>>>>>> and any "',:{} within a non-delimited string or a data name WE DON'T
>>>>> NEED A
>>>>>>>> SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
>>>>>>>> importantly NORMALIZES the grammar.
>>>>>>>>
>>>>>>>> I don't accept the argument that the new parser is so much more
>>>>> difficult
>>>>>>>> that existing parsers. Currently you have (if you are inside a double
>>>>> quote
>>>>>>>> delimited string)
>>>>>>>>
>>>>>>>> if (char == \") {
>>>>>>>>  tmpchar=lookahead(1);
>>>>>>>>  if (tmpchar == " ") return END_OF_STRING;
>>>>>>>>  else continue;
>>>>>>>>  }
>>>>>>>>
>>>>>>>> In the new parser you will have
>>>>>>>>
>>>>>>>> if (char == \") return END_OF_STRING
>>>>>>>>
>>>>>>>> YOU WILL NOTE:
>>>>>>>>
>>>>>>>> I have note included the [] characters in the restriction. There is too
>>>>> much
>>>>>>>> legacy associated with their existence in data names in both small and
> mm
>>>>>>>> CIFs.
>>>>>>>>
>>>>>>>> I am going to suggest a single token to represent lists, lists of lists
> and
>>>>>>>> associative arrays, namely  {...}. These are new, and don't present a
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
>>>>>>>> originally up to 6 bytes providing 31 bits of representation). It is a
>>>>>>>> binary representation. The encoding algorithm is not brain busting, but
>>>>>>>> neither is it trivial. Having a CIF file not editable by a bog standard
>>>>>>>> editor will upset some people. I propose the introduction of a new
>>>>> string
>>>>>>>> type within the DDLm semantics that allows one to define it to be
>>>>> Unicode.
>>>>>>>> Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX
>>>>> characters) to
>>>>>>>> represent the character. Equally we could go with the HTML approach of
>>>>>>>> &#xABCDEF; (ie 1-6 HEX characters).
>>>>>>>>
>>>>>>>> I also strongly propose support fort the UNICODE string within """
>>>>> strings
>>>>>>>> ONLY. Lets's start from a restrictive stance from the outset.
>>>>>>>>
>>>>>>>> I will be arriving at Dowling at about noon on Wednesday Herb. I'll
>>>>> bring my
>>>>>>>> boxing gloves, Frances can referee :)
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote:
>>>>>>>>
>>>>>>>>>> Dear All:
>>>>>>>>>>
>>>>>>>>>> As a result of the discussion with Herbert I can see two differing
>>>>>>>>>> approaches to these CIF syntax changes:
>>>>>>>>>>
>>>>>>>>>> 1. Any changes to CIF syntax should be such that earlier syntax
>>>>>>>>>> versions form a subset of the new syntax, i.e. files in the older
>>>>>>>>>> syntax will also conform to the new syntax
>>>>>>>>>>
>>>>>>>>>> or
>>>>>>>>>>
>>>>>>>>>> 2. When making changes to the standard, the opportunity should be
>>>>>>>>>> taken to simplify and streamline syntax as much as possible.
>>>>>>>>>>
>>>>>>>>>> Advantages of (1): a single CIF parser can be maintained for all
>>>>>>>>>> syntax versions; a CIF writer is always conformant to the latest
>>>>>>>>>> version and only needs changing if new syntax features are to be
>>>>>> used;
>>>>>>>>>> the existing CIF software ecosystem is minimally affected
>>>>>>>>>>
>>>>>>>>>> Advantages of (2): implementation of CIF readers/writers from scratch
>>>>>>>>>> is easier; the standard is easier to define formally and more
>>>>>>>>>> aesthetically pleasing; mistakes in previous versions can be fixed,
>>>>>>>>>> warts do not accumulate
>>>>>>>>>>
>>>>>>>>>> I would like to suggest we act as follows: in essence, we deprecate
>>>>>>>>>> rather than exclude.  In detail:
>>>>>>>>>>
>>>>>>>>>> 1. For this edition of the standard (1.2) we follow Herbert's line,
>>>>>>>>>> leaving everything currently defined untouched.  We simply add triple
>>>>>>>>>> quote delimited strings and bracket expressions.  The content of
>>>>>>>>>> non-delimited strings in bracket expressions will be as proposed by
>>>>>>>>>> Nick.
>>>>>>>>>>
>>>>>>>>>> 2. In the documents associated with the new standard we strongly
>>>>>>>>>> suggest that all non-delimited strings use the same character set as
>>>>>>>>>> for non-delimited strings in bracket expressions (i.e. Nick's
>>>>>> original
>>>>>>>>>> proposal).  We might point out that this simplifies code for writing
>>>>>>>>>> CIFs, and perhaps (if all agree) we add that using the CIF1.1
>>>>>>>>>> non-delimited string character set is deprecated, darkly
>>>>>> foreshadowing
>>>>>>>>>> that a future version of the syntax standard will adopt this
>>>>>> character
>>>>>>>>>> set for all non-delimited strings.
>>>>>>>>>>
>>>>>>>>>> 3. We also deprecate including string delimiters inside strings,
>>>>>>>>>> regardless of whitespace issues.
>>>>>>>>>>
>>>>>>>>>> 4. In all dictionaries we adopt the restricted character set for
>>>>>>>>>> non-delimited strings and exclusion of string delimiters in strings.
>>>>>>>>>>
>>>>>>>>>> 5. We ask that CheckCIF emit a warning about use of deprecated
>>>>>>>>>> characters in non-delimited strings
>>>>>>>>>>
>>>>>>>>>> 6. When (say in 10 years' time) a sufficiently large proportion of
>>>>>>>>>> incoming CIFs conform to the new non-delimited string character set,
>>>>>>>>>> we promulgate the 1.3 version of the standard.
>>>>>>>>>>
>>>>>>>>
>>>>>>>> cheers
>>>>>>>>
>>>>>>>> Nick
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>> Associate Professor N. Spadaccini, PhD
>>>>>>>> School of Computer Science & Software Engineering
>>>>>>>>
>>>>>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>>>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>>>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>>>> <http://www.csse.uwa.edu.au/%7Enick>
>>>>>>>> MBDP  M002
>>>>>>>>
>>>>>>>> CRICOS Provider Code: 00126G
>>>>>>>>
>>>>>>>> e: Nick.Spadaccini@uwa.edu.au
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> ddlm-group mailing list
>>>>>>>> ddlm-group@iucr.org
>>>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> ddlm-group mailing list
>>>>>> ddlm-group@iucr.org
>>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>> cheers
>>>>
>>>> Nick
>>>>
>>>> --------------------------------
>>>> Associate Professor N. Spadaccini, PhD
>>>> School of Computer Science & Software Engineering
>>>>
>>>> The University of Western Australia    t: +61 (0)8 6488 3452
>>>> 35 Stirling Highway                    f: +61 (0)8 6488 1089
>>>> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
>>> <http://www.csse.uwa.edu.au/%7Enick>
>>>> MBDP  M002
>>>>
>>>> CRICOS Provider Code: 00126G
>>>>
>>>> e: Nick.Spadaccini@uwa.edu.au
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia    t: +61 (0)8 6488 3452
> 35 Stirling Highway                    f: +61 (0)8 6488 1089
> CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
> MBDP  M002
>
> CRICOS Provider Code: 00126G
>
> e: Nick.Spadaccini@uwa.edu.au
>
>
>
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.