Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Searching for a compromise on eliding

Dear Colleagues

I have been out of the office all week and largely away from email. I
apologise for not saying so when I last posted, but I had at the
time anticipated being able to keep in touch with this conversation.

On technical grounds, I favour

F'    - requires least handling of special escapes
G     - formally equivalent to F'

You will recognise the first line as a verbatim extract from my
posting of 18 January in response to the first call for a vote in this
matter. I have not seen any new technical considerations to change
my preference for the economy of such a specification. Proposal G has
two possible disadvantages: the need to construct novel (but ideally
"natural") new delimiters - would paired double quotes "" suffice?;
and a break with the existing implementation of """ in Nick's initial
implementation. I know that we have ascertained that we are not bound
to retain specific novel syntactic features of that implementation,
but I see no technical advantage in moving away from it.

I disfavour Option Q because it introduces what I consider an
unnecessary domain-specific interpretation of character strings.
The "domains" involved are not mutually exclusive: in IUCr journals
we would anticipate handling both core CIFs and mmCIFs, while
applications such as SHELX work with both small and
macromolecules. However, that's not strictly a technical
consideration.

I found Ralf's intervention of 10 January very persuasive:

> In my observation any language that persisted long term has a feature to
> escape the closing quote token. Therefore I conjecture it is a small but
> vital feature.

This prompted me to revisit the need for a delimiter escape mechanism
that would then allow encapsulation of arbitrarily complex strings,
and thus, for example, remove the need for a new string concatenation
operator (a requested feature over which I was still rather unhappy).
In reviewing the discussions, I did note that Ralf's point had already
been made (by Nick, I think), but its importance was unfortunately
not appreciated at that time.

Proposal F'/G therefore address this technical "vital feature" to
my satisfaction.

===

The rest of the discussions seem to pivot around psychology more than
technical requirements. In my experience, analysis of psychology
provides useful insight into understanding a historical sequence of
events, but is rarely successful at predicting the future.

I prefer, as I stated before, to consider policy based on such
psychological or social imperatives in the COMCIFS forum, but if it
helps to indicate here my opinions on the concerns raised, I would say
the following:

> (i) Behaviour of triple-quoted strings will be too confusing unless
> Python behaviour is followed (Ralf)

There is perhaps some opportunity for confusion; but in other areas
there are similar opportunities: shell file globbing shares some
syntactic features with regular expression processing. People who
really work with such systems manage to overcome the confusion - more
easily when there is a real difference in purpose (filename globbing
is indeed distinct from regexp processing, just as string delimiting
in a data file is different from string processing within an
interpreted program). As John B has pointed out, adoption of a
proposal P or close variants also has scope for confusion if a user is
not completely familiar with the version of Python chosen as the
underlying paradigm.

> (ii) There is considerable criticism of CIF in the macromolecular
> community because of idiosyncratic behaviour, particularly concerning
> quoting.  We should therefore stick to accepted standards as much as
> possible (John W)

John W (and Herbert) are undoubtedly correct in identifying a distaste
within the macromolecular community for the idiosyncratic CIF formalism.
But I believe the second sentence is a non sequitur: I am not convinced
that adoption of a particular syntactic feature from Python is all
that is needed to persuade that community to embrace CIF with open
arms. As has been argued several times on this list, the technical
requirements on a data input parser for CIF are not very great (and by
opting for "economical" schemes such as James's proposal F' we would tend
towards minimising them). If the programmers within the macromolecular
community - many of whom I know to be extraordinarily competent and
intelligent - do not build CIF applications, I am sure it is because
they do not see sufficient scientific value in doing so, rather than
that the complexity or awkwardness of the file format defeats them.
Or at least I shall persist in believing that.

Let us take this element of the discussion onto the COMCIFS list,
preferably on the back of the revised proposal that I encourage James
to present from the ddlm-group.

===

Back to the technical considerations which I believe this group
should focus on. I consider the most desirable outcome to be a
clear and clean specification. Proposals F'/G will achieve that
elegantly. Proposal P has the potential to achieve that (though one
does need to specify the version of Python and perhaps reconsider the
handling of Unicode characters), although I still feel that as
a specification it carries too high a burden for compliance from
applications developers working outside of a Python framework. I
would strongly discourage attempts at a compromise that seeks to
provide a technical solution based on some minimisation of the 
root mean square unhappiness of the members of this group, but that
ends up with an unstructured mish-mash of features from different
proposals.

Regards
Brian

On Fri, Feb 25, 2011 at 01:50:59PM +1100, James Hester wrote:
> Dear DDLm-group,
> 
> I think we have all had a decent chance to argue our case for
> Proposals P, F and F'.  I have also been in small side discussions
> with Ralf and John W.  Their points of view can be summarised as
> follows:
> (i) Behaviour of triple-quoted strings will be too confusing unless
> Python behaviour is followed (Ralf)
> (ii) There is considerable criticism of CIF in the macromolecular
> community because of idiosyncratic behaviour, particularly concerning
> quoting.  We should therefore stick to accepted standards as much as
> possible (John W)
> 
> For John W and Ralf these points outweigh any of the disadvantages of
> Proposal P, and so Proposal P remains their first choice.  Proposal P
> is therefore the first choice of 3 out of 5 COMCIFS voters, and the
> last choice of the other two (I would rank it worse than doing
> nothing, actually).  I note that non-voting members are uniformly
> opposed to Proposal P.
> 
> I therefore want to try to seek some common middle ground in the hope
> that I can find a proposal that could be at least as acceptable as
> Proposal P to Ralf and/or Herbert and/or John W.
> 
> Consider the following four new proposals - P-prime, Q, G and null:
> 
> * Proposal P-prime: triple-quoted strings are treated as for Python
> 2.7.  No Unicode or raw strings are defined (ie no strings starting
> u""" or r""").
> 
> I interpret John W and Ralf's position to be that they would be able
> to support this proposal as the preferred choice, as our syntax would
> still be entirely consistent with Python.  This proposal is a
> considerable improvement on Proposal P, because the dangers of raw
> strings are taken out of the equation, and the Unicode database is no
> longer a dependency.  We are still left with a whole bunch of (frankly
> pointless) elides, leading to Proposal Q:
> 
> * Proposal Q: As for Proposal P-prime, with the following changes:
> (1) Only <backslash><delimiter> and <backslash><backslash> when it
> precedes <backslash><delimiter> are recognised escape sequences at the
> syntactical level
> (2) A DDLm string type, e.g. "CText", is defined in com_val.dic for
> which the remaining escape sequences have the meaning assigned to them
> by the Python 2.7 standard.  mmCIF and related domains can standardise
> their definitions on this string type and derivatives, making the
> above division between syntax and dictionary invisible to users and
> programmers in their domain.
> 
> * Proposal G: Proposal F', but with a different delimiter
> 
> Ralf has indicated that he actually thinks Proposal F' is best, but
> only if the delimiters are not going to be confused with Python
> delimiters.  I interpret John W's position to be that he would not
> support such a change in delimiters as that would simply make CIF even
> more idiosyncratic.  Anyway, any such replacement delimiter would need
> to be multi-character, easy to type and unlikely to occur as the first
> characters in CIF1 datavalues.  We would also need to reduce the
> characterset of non-delimited CIF2 strings to exclude any such
> delimiters.  Ideas?
> 
> * Null proposal: do nothing as we can't agree
> 
> I think I could support Proposal Q as an acceptable fallback from F',
> and if somebody can find sensible delimiters for Proposal G that works
> for me as well.  The preferred treatment for backslash rich text for
> Proposals P,P' and Q will necessarily be semicolon-delimited strings.
> 
> James.
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.