[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Simon's elide proposal

I respecfully disagree, but will refrain from detailed
comment until we settle on out goals, at which point
I suspect much of this discussion will become moot.
   -- Herbert

  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Tue, 18 Jan 2011, James Hester wrote:

> John has made some good points in reply, to which I'll add a few others:
> On Fri, Jan 14, 2011 at 9:25 PM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> OK:
>>  1.  I do think there is value in having CIF capable
>> for functioning as a programming language, and see
>> nothing to be gained by crippling its ability to
>> function in that role.  As noted there is a move now
>> towards the creation of executable papers to allow
>> journale articles to be better reviewed.  Opening
>> the dREL features to more general use than just in
>> dictionaries would allow the IUCr to explore this
>> very important direction and be more competitive
>> with Elsevier, so I strongly disagree with John's
>> value judgement that a rich feature set is somehow
>> negative.
> This whole paragraph makes no sense without a more concrete
> explanation of how you plan to turn a data container into a
> programming language, and why the current CIF2+DDLm+dREL framework is
> not adequate to the task you envisage.  Regardless of the
> well-advisedness or otherwise of the quest to turn CIF into a
> programming language, I would note that simply adopting the string
> literal syntax of a programming language does not in any way make a
> data format somehow more like a programming language - string literal
> syntax is simply syntactic sugar for specifying a sequence of bytes.
>> 2.  I find the ability to have escapes to handle the
>> "illegal" characters useful for imgCIF which need
>> to be able to handle at least the range of 15 out
>> of 16 bits without breaks.
> You can have all the escapes you want by either creating a new string
> type in DDLm, or describing a syntax right in the item definition.
> There is therefore no need to impose a heap of syntactic sugar on the
> entire CIF community simply to satisfy a domain-specific application.
> I will clarify for John B that I consider that disallowed Unicode code
> points should not appear in any CIF datavalue.  CIF datavalues can
> obviously be transformed to include those code points in application
> specific contexts, so, for example, Herbert can define \b to mean
> ASCII BEL in a particular imgCIF item definition if he thinks that
> useful, or a LaTeX processor can take LaTeX text inside a CIF
> datavalue and turn it into DVI.
>> 3. I find the \N{}, \a, ... constructs useful for the reasons
>> in 1, above.  In point fo fact, I think we would be best
>> of following Brian's original approach of being "maximally
>> disrupttive" and requiring a uniform translation of all
>> the IUCr glyphs that conflict with current programming
>> practice to an escaped form \\a
> In addition to John B's entirely reasonable comments, note that
> supporting the \N construct would create a dependency on the whole
> Unicode database in every CIF2 parser.  Am I really the only one who
> finds this ridiculous?   Asking the IUCr to wholesale redefine their
> glyphs would require your application to be considerably more
> important, which is far from demonstrated.
>> In any case, I think you should get the point -- this
>> really is a matter of taste, not a technical issue.
>> I find python compatability a strong plus to help
>> move us into the executable paper realm, indeed
>> to help move CIF into being a scripting language.
> No, as John says, it is an important design issue.  Unlike a
> programming language, we have several layers at which meaning can be
> created: syntactic, DDL, and domain dictionary.  To ignore the latter
> two is to misunderstand the entire CIF project.
>> If anyone wants even more detail, I will be happy
>> to send it in an off-list message, but it think it
>> really might be wise to address the broader issues
>> of what we want CIF to be, first, before we get
>> too far into those technical details.  Right now
>> I have to get packed to be able to catch a plane,
>> so the longer answer will have to wait until tomorrow.
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>> On Fri, 14 Jan 2011, James Hester wrote:
>>> Dear Herbert,
>>> Au contraire, I would not be bored, I'd be fascinated by a
>>> point-by-point rebuttal.  I find John's assessment spot-on and do not
>>> think dismissing his points as a matter of taste shows much respect
>>> for the amount of time he has put in to formulate these comments.
>>> Please go ahead and rebut his points.
>>> James.
>>> On Fri, Jan 14, 2011 at 5:42 AM, Herbert J. Bernstein
>>> <yaya@bernstein-plus-sons.com> wrote:
>>>> Dear Colleagues,
>>>>   I will not bore you all with a point-by-point rebuttal
>>>> to John B.'s negative assessment of Python treble quote
>>>> use in a CIF context.  Most of what he sees as defects,
>>>> I see as virtues.  Such are differences in taste, and
>>>> more importantly in the uses to which we put CIF.  Especially
>>>> with the introduction of dREL and DDLm, I do see CIF as
>>>> a programming language, and as one with strong similarities
>>>> to Python.  That does not mean everybody has to use it that
>>>> way, just that it would be nice if those who use it one way
>>>> and those who use it another could find some common ground.
>>>> The is now a move towards executable papers, and I suspect
>>>> a more powerful and fexible python compatible CIF could be
>>>> a strong competitor in that area.  Indeed, if current trends
>>>> continue, the IUCr is likely to need programming support
>>>> in papers if it is to keep up.
>>>> One point that does need a rebuttal...
>>>>> It should also be noted that Python source code, including its string
>>>>> literals, is restricted to being expressed in the characters of the
>>>>> 7-bit ASCII character set (though they need not necessarily be encoded
>>>>> according to US-ASCII).  Unconditional, bidirectional CIF/Python string
>>>>> compatibility would require that we apply the same restriction to CIF2
>>>>> triple-quoted strings.  I would oppose that.
>>>> That started to change in Python 2.5 which allowed explicit encoding
>>>> declarations, and by Python 3 has vanished even without an
>>>> encoding declaration.  The Python 3 spec is:
>>>> "Python reads program text as Unicode code points; the encoding
>>>> ... defaults to UTF8"
>>>> For more on how Python dealt with this issue as the same time
>>>> we were considering it, see:
>>>> http://www.python.org/dev/peps/pep-3120/
>>>> =====================================================
>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>                  +1-631-244-3035
>>>>                  yaya@dowling.edu
>>>> =====================================================
>>>> On Thu, 13 Jan 2011, Bollinger, John C wrote:
>>>>> On Thursday, January 13, 2011 7:10 AM, SIMON WESTRIP wrote:
>>>>>> Let's assume we were starting with CIF2 that included a minimal scheme
>>>>>> like F'.
>>>>> What then would be gained by adopting the full python specification of
>>>>> string literals?
>>>>>> 1) "Cleaner" presentation in the very rare cases that the eliding
>>>>>> system would be needed in order to accommodate delimiters within the value.
>>>>>> This is purely a matter of taste.
>>>>>> 2) Ability to include raw strings using the 'r' prefix. But in CIF2 as
>>>>>> it stands, all strings are 'raw'.
>>>>> Yes, but that will no longer be true if any of the proposals we're
>>>>> discussing is adopted.
>>>>>> Perhaps others can add to this list?
>>>>>> From the perspective of technical features only:
>>>>> 3) Three distinct forms for expressing Unicode characters via ASCII
>>>>> characters; one is restricted to characters from the BMP, but the others are
>>>>> general
>>>>> 4) Two forms for expressing 8-bit characters (from some undocumented
>>>>> character set, probably the source character set) via ASCII characters
>>>>> 5) Several elides for specific whitespace and non-printing ASCII
>>>>> characters, some of which are not among the allowed CIF characters, and all
>>>>> of which clash with the IUCr application-level elides
>>>>> 6) A mechanism for indicating whether the three forms of Unicode elides
>>>>> of item (3) should in fact be processed, or not.
>>>>> 7) A mechanism for representing a byte-string data object, or possibly a
>>>>> stub for such a feature, depending on which Python version serves as a
>>>>> reference
>>>>> Commentary:
>>>>> I think that makes a complete list of the new technical features that
>>>>> full Python string literals would bring to CIF, beyond those of proposal F.
>>>>>  I ignore a few semantic details that are mostly consistent with the current
>>>>> CIF specifications.
>>>>> Python's is indeed a rich feature set, but that is one of my objections
>>>>> to its use for CIF.  CIF is a data representation language, not a
>>>>> programming language, so once the language can represent everything in its
>>>>> present and future domain, alternative representation mechanisms add little.
>>>>>  People can and do write CIF by hand, but I don't think that use case is of
>>>>> sufficient import to justify convenience features solely for its support,
>>>>> particularly when such features present problems in other respects.
>>>>> Furthermore, Python admits essentially one implementation (changing
>>>>> slowly over time), so a rich feature set does not present compatibility
>>>>> problems.  CIF, however, anticipates many implementations, so the number and
>>>>> complexity of its features contribute to the likelihood of incompatibility
>>>>> between implementations.
>>>>> Most importantly, however, I think several of the Python features are
>>>>> inappropriate for CIF, and I specifically want them excluded:
>>>>> a) The \N{name} syntax for designating Unicode characters by UCD name.
>>>>>  I view this as the single greatest locus for bugs and incompatibility, both
>>>>> among CIF implementations and between CIF and Python.  Large among the
>>>>> questions here is *which version of the UCD is referenced*?  That can evolve
>>>>> over time in Python, but it must be fixed in CIF, at least for each CIF
>>>>> version.  Shall we plan to issue a new version of CIF every time Python
>>>>> moves up to a new Unicode version, and to deal with the multiple resulting
>>>>> versions?  Must every CIF2 implementation lug along a name=>character table
>>>>> just for this?  It is redundant with the other two Unicode elides.
>>>>> b) The [uU] prefix.  In Python, Unicode strings are a different type of
>>>>> object than ordinary strings, which is the main reason for the [uU] syntax.
>>>>>  All CIF2 strings are Unicode strings, however (so there's an unavoidable
>>>>> semantic difference regardless).  In CIF the [uU] prefix could still turn on
>>>>> and off processing of Unicode elides, but to what end?  In rare cases, to
>>>>> yield a slightly simpler representation of strings that would otherwise
>>>>> clash with one of the Unicode elide sequences.  Should we really require all
>>>>> conforming CIF processors to implement rules to support that obscure case,
>>>>> even though it can reasonably be handled by the \\ elide instead?
>>>>> c) The [bB] prefix.  I'm not clear on what it will mean in Python 3, but
>>>>> it is ignored in Python 2.  The only Python 3 meanings I can imagine are
>>>>> incompatible with CIF, and there is no technical advantage for CIF in
>>>>> including [bB] just to ignore it.
>>>>> d) The [rR] prefix.  In Python, this turns off elide processing for the
>>>>> string, except that if the [uU] prefix is also present then Unicode elides
>>>>> are still handled.  Also, the \\ elide is handled, but differently than for
>>>>> other string literals.  I would be happier with this for CIF, though  still
>>>>> not in favor, if it were a universal on/off for all elides.  Furthermore, as
>>>>> Simon pointed out, raw strings are what we have now.  Supposing that we use
>>>>> the Python rule that unrecognized elides are treated as literals, the value
>>>>> of [rR] raw strings for CIF depends on how many and which elides we adopt.
>>>>>  Inasmuch as I favor restriction to only a few elides, I don't see [rR]
>>>>> adding much of value.
>>>>> e) The \a, \b, \f, \n, \r, \t, and \v elides.  These needlessly clash
>>>>> with the IUCr elides, they are redundant with Unicode elides, and they
>>>>> express characters that either can appear in as literals in triple-quoted
>>>>> strings or are not allowed CIF characters (more on that in a separate
>>>>> message).  Including these would complicate CIF implementations for little
>>>>> or no technical advantage.
>>>>> f) The \ooo and \xhh elides.  These are redundant with the Unicode
>>>>> elides.  Moreover, they are byte-oriented in standard strings (so that their
>>>>> actual meaning depends on the source or runtime character set), but
>>>>> character-oriented in Unicode strings (there *thoroughly* redundant with the
>>>>> \uxxxx and \Uxxxxxxxx forms).
>>>>> That leaves very few Python string features that I could support being
>>>>> added to CIF (triple-quoted strings only), to wit:
>>>>> \<newline>
>>>>> \uxxxx
>>>>> \Uxxxxxxxx
>>>>> \'
>>>>> \"
>>>>> \\
>>>>> Among those, \' and \" serve only the purpose of delimiter elision; the
>>>>> others have larger scopes.  Given that the need to elide delimiters is
>>>>> likely to be quite rare, and that these two clash with the IUCr elides, I
>>>>> would prefer to omit them.
>>>>> As for the two Unicode escapes, it turns out that when the \[uU] is not
>>>>> followed by the expected number of hex digits, the Python 2.4 behavior
>>>>> differs from what the documentation lead me to believe.  Python throws a
>>>>> UnicodeDecodeError in such cases, rather than applying "all unrecognized
>>>>> escape sequences are left in the string unchanged" to the whole construct.
>>>>>  With respect to those forms, if they are included then I would prefer that
>>>>> constructs such as '''\u065q''' be treated as literals rather than error
>>>>> cases.  (And thus, to be subject to further interpretation at the
>>>>> application level.)
>>>>> Regards,
>>>>> John
>>>>> --
>>>>> John C. Bollinger, Ph.D.
>>>>> Department of Structural Biology
>>>>> St. Jude Children's Research Hospital
>>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
ddlm-group mailing list

Reply to: [list | sender only]