Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .

Thanks to Herbert, John and Simon for responding. I'm sorry if it
seems like once again round an endless loop, but your replies have
helped me to settle on the way I would like to see things move
forward. For what it's worth:


I favour the specification *recommending* a magic string to begin a
file: an optional BOM followed by the 11 characters


I favour the specification *recommending* that this initial comment
should be extended with an indication of the character
encoding where this is not ASCII. I suggest the specification's
discussion of the form this will take, as well as any other comments
on character-set encoding, be presented in a distinct section of the
specification (Part 3, or an Annexe or Appendix).

These are recommendations, not requirements,

1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input
streams (whether "decorated" or not;

2. because you can only ever take this meta-information as
well-intentioned hints.


I like the idea of a checksum, but I think it's premature to require
any particular formulation at this revision of the specification.


I favour this new "Part 3" of the specification providing some general
commentary on the nature of text files and transcoding issues.  It
should present UTF-8 as a "concrete" instantiation, and stipulate a
suitable tag for incorporation in the "magic number" comment, let us
say something like <UTF-8>. It should explain the importance of
developers following the "recommendations", and should caution against
(but not prohibit) gratuitous proliferation of encodings. It should
identify an additional resource hosted on the COMCIFS web site that
provides guidance to developers.

Use of the term "concrete" here harks back to the SGML specification.
SGML is actually a metastandard for document markup languages, and in
principle permits many different ways of tagging markup. But in
describing just one "concrete" example, based on angle brackets, it
encouraged the universal adoption of such tags right through HTML and


John said:

> "Were I setting policy for Acta Crystallographica with respect to CIF2,
> I would require CIF2 submissions to be encoded in UTF-8 ... If
> IUCr wishes to be relaxed about _enforcement_ of such a policy in
> order to better serve authors, then fine, but that's a tricky
> proposition.

I have some concerns about "enforceability" - an end-user (author) may
simply not know how to comply with a requirement to supply a document
in a specified encoding. However, the IUCr Managing Editor would
accept a policy that required authors whose CIFs we had "difficulty
in reading" to use a particular tool, namely publCIF.


The "additional resource" I referred to could contain among other

a list of organisations (IUCr journals, PDB, CCDC, individual synchrotron
facilities) and their policies on accepting or outputting specific
character-set encodings;

a list of preferred encoding tags (initially just <UTF-8> and perhaps
<UTF-16>, but extended in response to requests from specific

best-practice recommendations.

I would prefer these to evolve from community discussions and
practical requirements, rather than appear to be imposed by fiat of
COMCIFS or IUCr - so maybe this should be a "cif-developers" rather
than "COMCIFS" website.


This approach tries to close off the formal specification while
allowing controlled extensions. Essentially my "additional resource"
becomes the framework for establishing protocols for conversion
between different character-set encodings and serializations.

For instance, Herbert replied to my comments on needing a pure ASCII
representation in-house:

> There is no way to make a "pure ascii version" of a general UTF-8
> file without adopting some reserved characters strings at the lexical
> level -- \U... or &#...; or somesuch as used in many other systems,
> but with such an extension, it is easy. 

That's perfectly understood, and I would expect that we (Acta) would
devise an informal scheme to allow us to do so for whatever purposes we
needed. We wouldn't expect that to be an integral part of the CIF-2
standard. On the other hand, if it became clear that other people were
having difficulty in processing UTF-8 CIFs, we could formalise what we
had done with a new encoding tag, post that on our cif-developers

   Encoding scheme       Details                    Reference
   <ASCII UNICODE-CJO>   Crystallography Journals   http://........
                         ASCII-fication of
                         Unicode characters

and serve CIFs on request with the initial header


(I understand that this is different from character-set transcoding
because it involves additional processing at the lexical level, so it
may not be an appropriate thing to bundle these together in the same
way. That's open to later discussion, but my point is that we're
at least setting up a system allowing the community to exchange
information about practical representation conversions, and so reduce
the likelihood of uncontrolled chaos.)

cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.