Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Some general comments.

In agreeing (in principle) to adopt DDLm, COMCIFS has accepted the
need for changes to STAR and to DDL applications that are incompatible
with the original formulations. Not necessarily to CIF data files
(as Nick points out, COMCIFS can still mandate invariant data files);
but I think the mood of COMCIFS is to accept this as an opportunity
to improve CIF.

I also think that introducing new constructs such as bracket delimiters
to STAR/DDL will inevitably lead to pressure to include them in CIF.
COMCIFS may do this quickly or slowly, depending on the pressure from
the community, but we should suppose that at some point CIFs will
exist that have whatever syntactic changes we introduce here into
STAR and DDL.

The trick then is ensuring that the community can handle a universe
containing "old" and "new" CIFs. "Remediation" is not the answer,
because one can always use legacy software to create an "old" CIF
that is perfectly valid against the original specifications.

It is also unlikely that all CIF software will be upgraded to handle new
CIFs. We might want it to be, but suppose Ton Spek is unwilling or unable
to modify PLATON to read UTF-8 (is this easy to do with Fortran?). This
would have a severe impact on Acta's validation procedures. And for
the purposes of that particular program, the proposed CIF enhancements
have little relevance.

So there will need to be procedures allowing old software to handle
"new" CIFs to the extent that that is useful - and as in my PLATON
example, it may still be very useful. Hence I would like to be sure
that the new features we introduce will at least allow lossless
"old"->"new"->"old" AND "new"->"old"->"new" cycles of conversion.

Such conversions might actually be performed by standalone applications
or by library subroutines allowing on-the-fly management of CIFs of
both the new and the old type.

Lossless need not require the initial and final files to be
identical, so
  _name  O'Neill  ->  _name  "O\u27Neill"  ->  _name "O'Neill"
is acceptable (where I use \u27 in this email to stand for
whatever Unicode encoding we decide to support; though if I
understand things correctly, UTF-8 encoding of that character
is the same as an ASCII apostrophe, so would not be permitted
under the current proposal!).

This is still somewhat problematic, as one could not guarantee
that PLATON, let us say, will actually treat the atom label
identically in these two cases:
    _atom_site_label       O1'
    _atom_site_label      "O1'"
and one may therefore need additional normalization or translation
tools for specific legacy applications; but I think you need at
least to ensure that the information content can go through several
such cycles without loss.

Taking "new" CIFs with bracketed delimiters through the inverse cycle
should not be problematic, to the extent that one assumes "old" software
can't do anything useful with the contents of a bracketed data value,
so you just surround it with semicolon delimiters and some "magic
number" to indicate that's what you have done.

As I finish writing this, I see James' vote and comments has just come
in, and in some of what he says I see resonances with at least some of
these ideas.

I'll send more comments later.
Brian
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.