Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Simon's elide proposal

On Thursday, January 13, 2011 7:10 AM, SIMON WESTRIP wrote:
>Let's assume we were starting with CIF2 that included a minimal scheme like F'.
What then would be gained by adopting the full python specification of string literals?
>1) "Cleaner" presentation in the very rare cases that the eliding system would be needed in order
>to accommodate delimiters within the value. This is purely a matter of taste.
>2) Ability to include raw strings using the 'r' prefix. But in CIF2 as it stands, all strings are 'raw'.

Yes, but that will no longer be true if any of the proposals we're discussing is adopted.

>Perhaps others can add to this list?

>From the perspective of technical features only:

3) Three distinct forms for expressing Unicode characters via ASCII characters; one is restricted to characters from the BMP, but the others are general

4) Two forms for expressing 8-bit characters (from some undocumented character set, probably the source character set) via ASCII characters

5) Several elides for specific whitespace and non-printing ASCII characters, some of which are not among the allowed CIF characters, and all of which clash with the IUCr application-level elides

6) A mechanism for indicating whether the three forms of Unicode elides of item (3) should in fact be processed, or not.

7) A mechanism for representing a byte-string data object, or possibly a stub for such a feature, depending on which Python version serves as a reference


I think that makes a complete list of the new technical features that full Python string literals would bring to CIF, beyond those of proposal F.  I ignore a few semantic details that are mostly consistent with the current CIF specifications.

Python's is indeed a rich feature set, but that is one of my objections to its use for CIF.  CIF is a data representation language, not a programming language, so once the language can represent everything in its present and future domain, alternative representation mechanisms add little.  People can and do write CIF by hand, but I don't think that use case is of sufficient import to justify convenience features solely for its support, particularly when such features present problems in other respects.

Furthermore, Python admits essentially one implementation (changing slowly over time), so a rich feature set does not present compatibility problems.  CIF, however, anticipates many implementations, so the number and complexity of its features contribute to the likelihood of incompatibility between implementations.

Most importantly, however, I think several of the Python features are inappropriate for CIF, and I specifically want them excluded:

a) The \N{name} syntax for designating Unicode characters by UCD name.  I view this as the single greatest locus for bugs and incompatibility, both among CIF implementations and between CIF and Python.  Large among the questions here is *which version of the UCD is referenced*?  That can evolve over time in Python, but it must be fixed in CIF, at least for each CIF version.  Shall we plan to issue a new version of CIF every time Python moves up to a new Unicode version, and to deal with the multiple resulting versions?  Must every CIF2 implementation lug along a name=>character table just for this?  It is redundant with the other two Unicode elides.

b) The [uU] prefix.  In Python, Unicode strings are a different type of object than ordinary strings, which is the main reason for the [uU] syntax.  All CIF2 strings are Unicode strings, however (so there's an unavoidable semantic difference regardless).  In CIF the [uU] prefix could still turn on and off processing of Unicode elides, but to what end?  In rare cases, to yield a slightly simpler representation of strings that would otherwise clash with one of the Unicode elide sequences.  Should we really require all conforming CIF processors to implement rules to support that obscure case, even though it can reasonably be handled by the \\ elide instead?

c) The [bB] prefix.  I'm not clear on what it will mean in Python 3, but it is ignored in Python 2.  The only Python 3 meanings I can imagine are incompatible with CIF, and there is no technical advantage for CIF in including [bB] just to ignore it.

d) The [rR] prefix.  In Python, this turns off elide processing for the string, except that if the [uU] prefix is also present then Unicode elides are still handled.  Also, the \\ elide is handled, but differently than for other string literals.  I would be happier with this for CIF, though  still not in favor, if it were a universal on/off for all elides.  Furthermore, as Simon pointed out, raw strings are what we have now.  Supposing that we use the Python rule that unrecognized elides are treated as literals, the value of [rR] raw strings for CIF depends on how many and which elides we adopt.  Inasmuch as I favor restriction to only a few elides, I don't see [rR] adding much of value.

e) The \a, \b, \f, \n, \r, \t, and \v elides.  These needlessly clash with the IUCr elides, they are redundant with Unicode elides, and they express characters that either can appear in as literals in triple-quoted strings or are not allowed CIF characters (more on that in a separate message).  Including these would complicate CIF implementations for little or no technical advantage.

f) The \ooo and \xhh elides.  These are redundant with the Unicode elides.  Moreover, they are byte-oriented in standard strings (so that their actual meaning depends on the source or runtime character set), but character-oriented in Unicode strings (there *thoroughly* redundant with the \uxxxx and \Uxxxxxxxx forms).

That leaves very few Python string features that I could support being added to CIF (triple-quoted strings only), to wit:

Among those, \' and \" serve only the purpose of delimiter elision; the others have larger scopes.  Given that the need to elide delimiters is likely to be quite rare, and that these two clash with the IUCr elides, I would prefer to omit them.

As for the two Unicode escapes, it turns out that when the \[uU] is not followed by the expected number of hex digits, the Python 2.4 behavior differs from what the documentation lead me to believe.  Python throws a UnicodeDecodeError in such cases, rather than applying "all unrecognized escape sequences are left in the string unchanged" to the whole construct.  With respect to those forms, if they are included then I would prefer that constructs such as '''\u065q''' be treated as literals rather than error cases.  (And thus, to be subject to further interpretation at the application level.)

It should also be noted that Python source code, including its string literals, is restricted to being expressed in the characters of the 7-bit ASCII character set (though they need not necessarily be encoded according to US-ASCII).  Unconditional, bidirectional CIF/Python string compatibility would require that we apply the same restriction to CIF2 triple-quoted strings.  I would oppose that.


John C. Bollinger, Ph.D.
Department of Structural Biology
St. Jude Children's Research Hospital

Email Disclaimer:  www.stjude.org/emaildisclaimer
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.