Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Dear All,

This was formulated before the latest round of messages, but I think
what I have said still holds:

Nick writes:

     (1) restricting the character set of non-delimited strings is
     NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive
     data structures and exploit DDLm. If we aren't going to exploit DDLm,
     IUCr should drop it now and stick with its current DDL.

Nick, surely it is negotiable insofar as non-delimited strings inside
bracket expressions could be treated differently to those outside
bracket expressions?  I realise that having different productions
would be inelegant, but we need to have some regard for the installed
base of CIF writing software (obviously reading software is not
affected).  It is good that the PDB are happy to update their
software, but what about all the single-crystal software that is
actively used at the moment in all those labs?  Don't those software
authors require some warning in the form of deprecation first, and
some time to distribute updated versions?

Nick again:

   Now guess what? If we don't allow a ' within a '..' and a " within a ".."
   and any "',:{} within a non-delimited string or a data name WE DON'T NEED A
   importantly NORMALIZES the grammar.

I agree with removing string delimiters from inside delimited
strings. I have already stated that I prefer to keep enforced
whitespace separation for human-readability purposes. I'll admit that
I don't understand the use of 'normalizing' in the above sentence, and
as my ability to find out anything on the internet is very limited at
the moment, perhaps Nick could elaborate on what is meant by this and
why it is desirable in the CIF context.

Apropos UTF8: it is clear (to me at least) that, with time, UTF8 will
become the new ASCII (ie universally understood encoding), so the
question for CIF is not if to do this, but when and how.  A possible
downside for CIF is that, if we jump to UTF8 too early, we sacrifice
the ability to view any CIF perfectly in any text editor on any OS.
But this is ultimately not a big deal, as the 'mystery' characters
will not be interpreted as dangerous control characters (all will be
>127) and the rest of the file will remain understandable in any case.
I don't think the example given by Nick of someone wanting to insert a
unicode character in vi or emacs is not such a big deal; presumably
someone who needs to use such characters from day to day will know how
to deal with unicode.

More importantly, introducing non-ASCII characters into strings
immediately breaks most (or all) current CIF readers, so the right
time to do this is when we are breaking CIF readers anyway.  That
would be now, as we are introducing bracket structures which break
CIF1.1 readers.

I don't think there is a need to restrict UTF8 to special string types
(e.g. triple quote delimited strings), and I don't see the long-term
advantage of using an intermediate representation (i.e. \uA054 etc.).
While an intermediate representation may keep us from breaking CIF
readers now, it renders the string incomprehensible even in a
unicode-aware editor, and when UTF8 has taken over the world, will
appear backward.

Nick is concerned about the task of encoding/decoding unicode - but if
we use pure UTF8 with no intermediate encoding, this becomes a job for
the producer/consumer of the string, and the CIF layer simply passes
these strings back and forth.

I believe the journals should comment on the usefulness of UTF8 to
them, as most of the non ASCII characters will be the various greek
and mathematical symbols, as well as various diacritics in author's
names.  Note that the current adhoc CIF control characters can be
interleaved with UTF8 encoded text without confusion, as the former
are a pure ASCII representation.  There is also a certain symmetry in
introducing UTF8 at the same time as we disallow string delimiters
internally, as most of the single quotes that appear internally in
single quote delimited strings are in order to represent accented
characters.  We give with one hand (these characters can be included
in UTF8 encoding instead) and take with the other (no more single
quotes in single quote delimited strings).

I will give my vote in Herb's straw poll in a separate email.

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.