Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Eliding in triple-quoted strings: Proposals C andD. .

Inasmuch as the issue of "CIF remain[ing] a subset of STAR" keeps being
raised, would somebody be kind enough to refer me to a complete, formal
published specification of STAR, especially one that lays out 
precisely what STAR considers a valid quoting mechanism.

Saying that there are a large number of alternate languages against which
we could leverage is unhelpful.  If somebody has a better, fully specified,
unicode compatible base to propose as an alternative to Python, make that
specific proposal.  The proposal on the table from Ralf and supported by
John and myself is that we adopt the Python treble quote syntax and
semantics.   Unless somebody has a STAR publication with supporting software
to cite that already handles the treble quote, STAR compatibility would
not seem relevant to this particular issue.

I believe that Ralf is right.


At 12:20 PM -0600 1/7/11, Bollinger, John C wrote:
>On Friday, January 07, 2011 7:56 AM, John Westbrook wrote:
>>I have been quiet on this issue as my bias for supporting Python semantics
>>has not been popular or productive in prior DDLm/Cif2 discussions.   I would
>>extend Herb's argument to the whole of this enterprise and emphasize
>>my view that meaningful adoption of DDLm/CIF2 will require embracing
>>and leveraging existing technologies as much as possible.
>>On 1/7/11 7:52 AM, Herbert J. Bernstein wrote:
>>>  As noted in my prior message, I disagree. I find it
>>>  counter-inutitive and unproductive to adopt something
>>>  that looks very much like the python treble quoted
>>>  string but which follows confusingly different rules.
>>>  Remeber -- for most of the the coomunity, the entire
>>>  CIF2 approach to quoting is something new and different.
>>>  It does not agree with the well-established CIF1 quoting
>>>  rules. By giving them the python treble quoted strings
>>>  we are giving them a way to simply and easily carry any
>>>  and all strings and text fields forward from CIF1 to CIF2
>>>  without having to seriously rework them. Sure, we could
>>>  come up with some other set of rules for treble quoted
>>>  strings, but by following the python rules we will
>>>  greatly reduce the chances of misinterpretations in
>>>  the marginal cases, and give ourselves an independent
>>>  check on our new parsers -- all the existing oython
>>>  parsers.
>>>  I believe that Ralf is right.
>There is a large number of elide mechanisms in popular programming 
>languages, many of them similar, but none of them identical.  I 
>don't see why choosing any one language's conventions, Python's for 
>instance, is an overall win over choosing any of the others' (C/C++, 
>Java, Ruby, Perl, ...).  Furthermore, although I recognize the 
>advantages of drawing on existing technologies, I don't think that 
>doing so in this case necessarily requires adopting the entire 
>package of conventions from any particular language, no matter that 
>language's present popularity level.
>For elide processing I greatly favor approaches that are invisible 
>to a STAR lexer.  That will allow CIF to remain a subset of STAR, 
>which matters at least to me, though it may be of little concern to 
>some others here.  Furthermore, I prefer an approach that is as 
>minimal as possible while still adequately addressing the problem. 
>That will reduce the work to implement the scheme, the potential for 
>bugs and incompatibilities, and the details that people need to 
>I have these objections to adopting specifically the Python scheme for CIF2:
>1) I dislike tying CIF strongly to a particular programming language.
>2) The Python system is incompatible with STAR.
>3) It is far more complicated than we need.  In particular, I don't 
>like \N{name} for representing characters by UCD name, but nearly 
>all of the other elides are redundant with \uxxxx forms.
>4) It needlessly coopts some commonly used elides of the IUCr 
>system: \a, \b, \f, \', \"
>5) In Python, Unicode strings have a different data type than plain 
>strings (though there it matters little), and I don't want to carry 
>that impression over to CIF2, where it would be false.
>I observe in passing that the Python conventions provide for eliding 
>newlines to indicate that they should be ignored, similar to the CIF 
>line-wrapping protocol.  I have no objection to including that, but 
>it has been a controversial topic in the past, so I do not let it go 
>I don't think any existing programming language's system has the 
>characteristics we (I) want, so I propose instead yet another 
>alternative, "Proposal E", derived from Python's system but much 
>1) All triple-quoted strings (either delimiter) are handled 
>according to these conventions.  No [uUrR] sigils are recognized (or 
>2) (Only) The following elides are recognized and handled in 
>triple-quoted strings as they are in Python Unicode strings:
>   a) \uxxxx             (represents the Unicode character having the 
>specified 4(-hex)-digit code point)
>   b) \Uxxxxxxxx (represents the Unicode character having the 
>specified 8(-hex)-digit code point)
>   c) \(newline) (represents nothing; that is, it is consumed and ignored)
>   d) \\         (represents a single backslash (same as \u0062))
>3) As in Python, unrecognized escape sequences are treated as 
>literals (that is, they are left uninterpreted in the string, 
>including the backslash).  Because \' and \" are not among the 
>recognized elides, trailing backslashes are subject to this rule as 
>well: they are treated as literals unless part of a \\ elide.
>Those are the essentials, but a few more details are necessary to 
>ensure consistent interpretation:
>4) Elides are processed as if after lexical analysis (unlike in 
>Java, where Unicode escapes are processed as if before lexing).
>5) Elides are processed left-to-right, and when an elide is replaced 
>by a character, elide processing continues immediately *after* the 
>replacement character.  (Thus '''\u0062u0062''' is equivalent to 
>'\u0062', not to '\'.)
>6) As in Python, Unicode characters outside the BMP may be 
>represented as surrogate pairs via the \uxxxx mechanism, with the 
>same meaning as the corresponding \Uxxxxxxxx representation.
>7) Unlike the IUCr elides, these elides are considered part of the 
>CIF _representation_ of values, not part of the values themselves. 
>That is, applications consuming CIF data should not have to process 
>or generate these elides.  Of course, general STAR applications will 
>not and should not recognize them (unless they were adopted there, 
>too), but that is desirable.
>8) Characters not allowed to appear as literals in CIF must not 
>appear as Unicode escapes, either.
>Comments on Proposal E:
>() I think (4) and (5) are consistent with Python, but I had trouble 
>finding documentation of that (which is another reason to be wary of 
>adopting the Python system whole-hog, by reference).  (5) could as 
>easily go the other way, but (4) is needed as-is to avoid additional 
>() This proposal would allow almost all existing, well-formed CIF 
>character data to be triple-quoted as-is, without need for eliding 
>anything, even when IUCr elides are present.
>() All allowed Unicode characters can be represented in data values 
>via this system, using only printable ASCII characters and CIF 
>() There are few rules to remember or code
>() Rules (1) - (6) are, I think, a strict subset of Python's Unicode 
>string elide system, but [uUrR] sigils are not needed or used to 
>activate them.
>() This system preserves lexical compatibility with STAR, provides 
>for line-wrapping, is mostly compatible with CIF1-style IUCr elides, 
>and is small and relatively easy to code.
>() The biggest potential gotcha I see for users is the absence of \' 
>and \" elides, but that is necessary for the scheme to satisfy my 
>objective of STAR compatibility, and it is furthermore useful for 
>compatibility with the IUCr elide system.
>John C. Bollinger, Ph.D.
>Department of Structural Biology
>St. Jude Children's Research Hospital
>Email Disclaimer:  www.stjude.org/emaildisclaimer
>ddlm-group mailing list

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.