Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Searching for a compromise on eliding. .


On Thursday, February 24, 2011 8:51 PM, James Hester wrote:

>Consider the following four new proposals - P-prime, Q, G and null:
>
>* Proposal P-prime: triple-quoted strings are treated as for Python
>2.7.  No Unicode or raw strings are defined (ie no strings starting
>u""" or r""").

This is a considerable improvement over proposal P, but it still suffers from some technical issues.  The most important of those is that the \xhh and \ooo escapes can introduce bytes into a string literal that cannot be decoded into characters, or whose decoded value depends on the encoding scheme.  Any use of them therefore makes the body of a CIF dependent on the character encoding scheme with which it is interpreted.  This is a worse problem than character encoding in general (and no one can have forgotten the extended debate we had on that topic) because a CIF using these escapes could be correctly transcoded only by a CIF-aware program.

Python 3 behavior with respect to hex and octal escapes is much more suitable, but switching all the way over to using Python 3 as the basis for triple-quote syntax loses the single greatest advantage of P' over P, which is absence of the \N{name} escape.

Alternatively, we could omit the \ooo and \xhh elides, or treat them as in Python 3 instead of Python 2, but I suspect those who value compliance with an existing standard would disfavor those alternatives.

>* Proposal Q: As for Proposal P-prime, with the following changes:
>(1) Only <backslash><delimiter> and <backslash><backslash> when it
>precedes <backslash><delimiter> are recognised escape sequences at the
>syntactical level
>(2) A DDLm string type, e.g. "CText", is defined in com_val.dic for
>which the remaining escape sequences have the meaning assigned to them
>by the Python 2.7 standard.  mmCIF and related domains can standardise
>their definitions on this string type and derivatives, making the
>above division between syntax and dictionary invisible to users and
>programmers in their domain.

I prefer Q to P', in part because it avoids the technical issue above, at least at the language level, and also because it provides more latitude for dealing with the IUCr elide system.  Under Q, most of the conflicting elides are handled at the same level, so which are used can be a matter of policy.

>* Proposal G: Proposal F', but with a different delimiter
>
>Ralf has indicated that he actually thinks Proposal F' is best, but
>only if the delimiters are not going to be confused with Python
>delimiters.  I interpret John W's position to be that he would not
>support such a change in delimiters as that would simply make CIF even
>more idiosyncratic.  Anyway, any such replacement delimiter would need
>to be multi-character, easy to type and unlikely to occur as the first
>characters in CIF1 datavalues.  We would also need to reduce the
>characterset of non-delimited CIF2 strings to exclude any such
>delimiters.  Ideas?

Matching '"' and / or "'".

loop_
  _quoting.example
  '"'The cell is rhombohedral with a = 12.011 \%A and \a = 85\%'"'
  "'"There is no \"u in "Karlsruhe""'"
  "'"This value contains "'\
""'"

I quite like this, actually.


>* Null proposal: do nothing as we can't agree
>
>I think I could support Proposal Q as an acceptable fallback from F',
>and if somebody can find sensible delimiters for Proposal G that works
>for me as well.  The preferred treatment for backslash rich text for
>Proposals P,P' and Q will necessarily be semicolon-delimited strings.

I could grudgingly accept a P'' that resolves the issue with hex and octal escapes either by removing them or by treating them as in Python 3.  I would accept such a proposal less grudgingly if it also removed at least the \a \b \f and \v escapes, which represent characters outside the CIF 1.1 and CIF 2.0 character sets, and which I think are anyway rarely used (and not widely known).  Supposing that the \' and \" are inviolate, my favorite proposal along these lines would also omit \n, \r, and \t, which can be expressed as literals.


Regards,

John


Email Disclaimer:  www.stjude.org/emaildisclaimer

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.