Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Simon's elide proposal

John has made some good points in reply, to which I'll add a few others:

On Fri, Jan 14, 2011 at 9:25 PM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> OK:
>  1.  I do think there is value in having CIF capable
> for functioning as a programming language, and see
> nothing to be gained by crippling its ability to
> function in that role.  As noted there is a move now
> towards the creation of executable papers to allow
> journale articles to be better reviewed.  Opening
> the dREL features to more general use than just in
> dictionaries would allow the IUCr to explore this
> very important direction and be more competitive
> with Elsevier, so I strongly disagree with John's
> value judgement that a rich feature set is somehow
> negative.

This whole paragraph makes no sense without a more concrete
explanation of how you plan to turn a data container into a
programming language, and why the current CIF2+DDLm+dREL framework is
not adequate to the task you envisage.  Regardless of the
well-advisedness or otherwise of the quest to turn CIF into a
programming language, I would note that simply adopting the string
literal syntax of a programming language does not in any way make a
data format somehow more like a programming language - string literal
syntax is simply syntactic sugar for specifying a sequence of bytes.

> 2.  I find the ability to have escapes to handle the
> "illegal" characters useful for imgCIF which need
> to be able to handle at least the range of 15 out
> of 16 bits without breaks.

You can have all the escapes you want by either creating a new string
type in DDLm, or describing a syntax right in the item definition.
There is therefore no need to impose a heap of syntactic sugar on the
entire CIF community simply to satisfy a domain-specific application.
I will clarify for John B that I consider that disallowed Unicode code
points should not appear in any CIF datavalue.  CIF datavalues can
obviously be transformed to include those code points in application
specific contexts, so, for example, Herbert can define \b to mean
ASCII BEL in a particular imgCIF item definition if he thinks that
useful, or a LaTeX processor can take LaTeX text inside a CIF
datavalue and turn it into DVI.

> 3. I find the \N{}, \a, ... constructs useful for the reasons
> in 1, above.  In point fo fact, I think we would be best
> of following Brian's original approach of being "maximally
> disrupttive" and requiring a uniform translation of all
> the IUCr glyphs that conflict with current programming
> practice to an escaped form \\a

In addition to John B's entirely reasonable comments, note that
supporting the \N construct would create a dependency on the whole
Unicode database in every CIF2 parser.  Am I really the only one who
finds this ridiculous?   Asking the IUCr to wholesale redefine their
glyphs would require your application to be considerably more
important, which is far from demonstrated.

> In any case, I think you should get the point -- this
> really is a matter of taste, not a technical issue.
> I find python compatability a strong plus to help
> move us into the executable paper realm, indeed
> to help move CIF into being a scripting language.

No, as John says, it is an important design issue.  Unlike a
programming language, we have several layers at which meaning can be
created: syntactic, DDL, and domain dictionary.  To ignore the latter
two is to misunderstand the entire CIF project.

> If anyone wants even more detail, I will be happy
> to send it in an off-list message, but it think it
> really might be wise to address the broader issues
> of what we want CIF to be, first, before we get
> too far into those technical details.  Right now
> I have to get packed to be able to catch a plane,
> so the longer answer will have to wait until tomorrow.
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
> On Fri, 14 Jan 2011, James Hester wrote:
>> Dear Herbert,
>> Au contraire, I would not be bored, I'd be fascinated by a
>> point-by-point rebuttal.  I find John's assessment spot-on and do not
>> think dismissing his points as a matter of taste shows much respect
>> for the amount of time he has put in to formulate these comments.
>> Please go ahead and rebut his points.
>> James.
>> On Fri, Jan 14, 2011 at 5:42 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>> Dear Colleagues,
>>>   I will not bore you all with a point-by-point rebuttal
>>> to John B.'s negative assessment of Python treble quote
>>> use in a CIF context.  Most of what he sees as defects,
>>> I see as virtues.  Such are differences in taste, and
>>> more importantly in the uses to which we put CIF.  Especially
>>> with the introduction of dREL and DDLm, I do see CIF as
>>> a programming language, and as one with strong similarities
>>> to Python.  That does not mean everybody has to use it that
>>> way, just that it would be nice if those who use it one way
>>> and those who use it another could find some common ground.
>>> The is now a move towards executable papers, and I suspect
>>> a more powerful and fexible python compatible CIF could be
>>> a strong competitor in that area.  Indeed, if current trends
>>> continue, the IUCr is likely to need programming support
>>> in papers if it is to keep up.
>>> One point that does need a rebuttal...
>>>> It should also be noted that Python source code, including its string
>>>> literals, is restricted to being expressed in the characters of the
>>>> 7-bit ASCII character set (though they need not necessarily be encoded
>>>> according to US-ASCII).  Unconditional, bidirectional CIF/Python string
>>>> compatibility would require that we apply the same restriction to CIF2
>>>> triple-quoted strings.  I would oppose that.
>>> That started to change in Python 2.5 which allowed explicit encoding
>>> declarations, and by Python 3 has vanished even without an
>>> encoding declaration.  The Python 3 spec is:
>>> "Python reads program text as Unicode code points; the encoding
>>> ... defaults to UTF8"
>>> For more on how Python dealt with this issue as the same time
>>> we were considering it, see:
>>> http://www.python.org/dev/peps/pep-3120/
>>> =====================================================
>>>  Herbert J. Bernstein, Professor of Computer Science
>>>    Dowling College, Kramer Science Center, KSC 121
>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>                  +1-631-244-3035
>>>                  yaya@dowling.edu
>>> =====================================================
>>> On Thu, 13 Jan 2011, Bollinger, John C wrote:
>>>> On Thursday, January 13, 2011 7:10 AM, SIMON WESTRIP wrote:
>>>>> Let's assume we were starting with CIF2 that included a minimal scheme
>>>>> like F'.
>>>> What then would be gained by adopting the full python specification of
>>>> string literals?
>>>>> 1) "Cleaner" presentation in the very rare cases that the eliding
>>>>> system would be needed in order to accommodate delimiters within the value.
>>>>> This is purely a matter of taste.
>>>>> 2) Ability to include raw strings using the 'r' prefix. But in CIF2 as
>>>>> it stands, all strings are 'raw'.
>>>> Yes, but that will no longer be true if any of the proposals we're
>>>> discussing is adopted.
>>>>> Perhaps others can add to this list?
>>>>> From the perspective of technical features only:
>>>> 3) Three distinct forms for expressing Unicode characters via ASCII
>>>> characters; one is restricted to characters from the BMP, but the others are
>>>> general
>>>> 4) Two forms for expressing 8-bit characters (from some undocumented
>>>> character set, probably the source character set) via ASCII characters
>>>> 5) Several elides for specific whitespace and non-printing ASCII
>>>> characters, some of which are not among the allowed CIF characters, and all
>>>> of which clash with the IUCr application-level elides
>>>> 6) A mechanism for indicating whether the three forms of Unicode elides
>>>> of item (3) should in fact be processed, or not.
>>>> 7) A mechanism for representing a byte-string data object, or possibly a
>>>> stub for such a feature, depending on which Python version serves as a
>>>> reference
>>>> Commentary:
>>>> I think that makes a complete list of the new technical features that
>>>> full Python string literals would bring to CIF, beyond those of proposal F.
>>>>  I ignore a few semantic details that are mostly consistent with the current
>>>> CIF specifications.
>>>> Python's is indeed a rich feature set, but that is one of my objections
>>>> to its use for CIF.  CIF is a data representation language, not a
>>>> programming language, so once the language can represent everything in its
>>>> present and future domain, alternative representation mechanisms add little.
>>>>  People can and do write CIF by hand, but I don't think that use case is of
>>>> sufficient import to justify convenience features solely for its support,
>>>> particularly when such features present problems in other respects.
>>>> Furthermore, Python admits essentially one implementation (changing
>>>> slowly over time), so a rich feature set does not present compatibility
>>>> problems.  CIF, however, anticipates many implementations, so the number and
>>>> complexity of its features contribute to the likelihood of incompatibility
>>>> between implementations.
>>>> Most importantly, however, I think several of the Python features are
>>>> inappropriate for CIF, and I specifically want them excluded:
>>>> a) The \N{name} syntax for designating Unicode characters by UCD name.
>>>>  I view this as the single greatest locus for bugs and incompatibility, both
>>>> among CIF implementations and between CIF and Python.  Large among the
>>>> questions here is *which version of the UCD is referenced*?  That can evolve
>>>> over time in Python, but it must be fixed in CIF, at least for each CIF
>>>> version.  Shall we plan to issue a new version of CIF every time Python
>>>> moves up to a new Unicode version, and to deal with the multiple resulting
>>>> versions?  Must every CIF2 implementation lug along a name=>character table
>>>> just for this?  It is redundant with the other two Unicode elides.
>>>> b) The [uU] prefix.  In Python, Unicode strings are a different type of
>>>> object than ordinary strings, which is the main reason for the [uU] syntax.
>>>>  All CIF2 strings are Unicode strings, however (so there's an unavoidable
>>>> semantic difference regardless).  In CIF the [uU] prefix could still turn on
>>>> and off processing of Unicode elides, but to what end?  In rare cases, to
>>>> yield a slightly simpler representation of strings that would otherwise
>>>> clash with one of the Unicode elide sequences.  Should we really require all
>>>> conforming CIF processors to implement rules to support that obscure case,
>>>> even though it can reasonably be handled by the \\ elide instead?
>>>> c) The [bB] prefix.  I'm not clear on what it will mean in Python 3, but
>>>> it is ignored in Python 2.  The only Python 3 meanings I can imagine are
>>>> incompatible with CIF, and there is no technical advantage for CIF in
>>>> including [bB] just to ignore it.
>>>> d) The [rR] prefix.  In Python, this turns off elide processing for the
>>>> string, except that if the [uU] prefix is also present then Unicode elides
>>>> are still handled.  Also, the \\ elide is handled, but differently than for
>>>> other string literals.  I would be happier with this for CIF, though  still
>>>> not in favor, if it were a universal on/off for all elides.  Furthermore, as
>>>> Simon pointed out, raw strings are what we have now.  Supposing that we use
>>>> the Python rule that unrecognized elides are treated as literals, the value
>>>> of [rR] raw strings for CIF depends on how many and which elides we adopt.
>>>>  Inasmuch as I favor restriction to only a few elides, I don't see [rR]
>>>> adding much of value.
>>>> e) The \a, \b, \f, \n, \r, \t, and \v elides.  These needlessly clash
>>>> with the IUCr elides, they are redundant with Unicode elides, and they
>>>> express characters that either can appear in as literals in triple-quoted
>>>> strings or are not allowed CIF characters (more on that in a separate
>>>> message).  Including these would complicate CIF implementations for little
>>>> or no technical advantage.
>>>> f) The \ooo and \xhh elides.  These are redundant with the Unicode
>>>> elides.  Moreover, they are byte-oriented in standard strings (so that their
>>>> actual meaning depends on the source or runtime character set), but
>>>> character-oriented in Unicode strings (there *thoroughly* redundant with the
>>>> \uxxxx and \Uxxxxxxxx forms).
>>>> That leaves very few Python string features that I could support being
>>>> added to CIF (triple-quoted strings only), to wit:
>>>> \<newline>
>>>> \uxxxx
>>>> \Uxxxxxxxx
>>>> \'
>>>> \"
>>>> \\
>>>> Among those, \' and \" serve only the purpose of delimiter elision; the
>>>> others have larger scopes.  Given that the need to elide delimiters is
>>>> likely to be quite rare, and that these two clash with the IUCr elides, I
>>>> would prefer to omit them.
>>>> As for the two Unicode escapes, it turns out that when the \[uU] is not
>>>> followed by the expected number of hex digits, the Python 2.4 behavior
>>>> differs from what the documentation lead me to believe.  Python throws a
>>>> UnicodeDecodeError in such cases, rather than applying "all unrecognized
>>>> escape sequences are left in the string unchanged" to the whole construct.
>>>>  With respect to those forms, if they are included then I would prefer that
>>>> constructs such as '''\u065q''' be treated as literals rather than error
>>>> cases.  (And thus, to be subject to further interpretation at the
>>>> application level.)
>>>> Regards,
>>>> John
>>>> --
>>>> John C. Bollinger, Ph.D.
>>>> Department of Structural Biology
>>>> St. Jude Children's Research Hospital
>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.