[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ...

I haven't commented directly on this proposal, so here are my comments.

1. In spirit, this proposal reads as a supplement to Herbert's
proposal to do things as for CIF1.  This is reinforced by the dropping
of the mandatory CIF2 header.

2. I am not overly concerned about not requiring the CIF2 header.
Attempts to read a CIF1/2 file using both syntaxes can be attempted,
and I can't imagine any pathological case that would give differing
parse results in the case that both CIF2 and CIF1 syntaxes gave
successful parses.  This outcome is largely because we have retained
whitespace separation between tokens, something we hadn't yet decided
when we decided on the CIF2 header.

3. This proposal does not a priori do anything to resolve the concrete
problem faced by programmers as to what encodings to expect in input
CIF files.  With the exception of UTF8 and UTF16, they must rely on
unreliable external information.

4. There is similarly no guarantee that the encoding tag corresponds
to the real encoding, and the only way to confirm the tag is to run it
past the author again.  While this is easy for a publisher accepting
manuscripts to do, it is not a generally applicable approach.  Have I
misunderstood something, and we are in fact simply producing a
standard for publishing use?

5. I believe the Chester office's potential problems can be solved by
applying the simple insight that they have access to the original
authors: therefore, when a non-UTF8 or UTF16 file is received, they
can either try heuristic autodetection with feedback from the author,
or simply request the author to sort it out, with a link to a helpful
webpage.  I therefore do not believe that initial inability to produce
UTF8/16 on the part of some authors is sufficient cause to reject the
UTF8/16 (+local?) proposal.

6. Regarding a list of preferred encodings in the annexe: each time
this list is expanded, all current CIF software becomes unable to read
and write files in at least one of the preferred encodings.  With
time, this list will be of little use, as there is no guarantee that
an encoding that is on the list is one that a given software package
will understand.  Therefore, this list has to be essentially fixed
over a reasonably long timescale (say 5 years) and additions would
ideally correspond with a bump of the standard version number. You
might just as well specify a fixed set in the standard itself.
Dividing encoding tags according to journals creates even more
confusion and balkanisation.

7. "Reducing the likelihood of uncontrolled chaos".  In my opinion,
this likelihood is lowest with a fixed set of encodings.  Any support
for open-slather encoding in the standard will increase the likelihood
of chaos.  If we end up accepting Brain and Herbert's proposals, of
course each CIF-handling entity will be forced to attempt encoding
management in ways that Brian suggests, and I'll wager that there will
be more headaches for Chester and everyone else dealing with it than
if they adopted strategies as described in point (5).  Unless the
community spontaneously adopts UTF8.

On Thu, Sep 16, 2010 at 11:17 PM, Brian McMahon <bm@iucr.org> wrote:
> Thanks to Herbert, John and Simon for responding. I'm sorry if it
> seems like once again round an endless loop, but your replies have
> helped me to settle on the way I would like to see things move
> forward. For what it's worth:
> ***
> I favour the specification *recommending* a magic string to begin a
> file: an optional BOM followed by the 11 characters
> #\#CIF_2.0<whitespace>
> I favour the specification *recommending* that this initial comment
> should be extended with an indication of the character
> encoding where this is not ASCII. I suggest the specification's
> discussion of the form this will take, as well as any other comments
> on character-set encoding, be presented in a distinct section of the
> specification (Part 3, or an Annexe or Appendix).
> These are recommendations, not requirements,
> 1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input
> streams (whether "decorated" or not;
> 2. because you can only ever take this meta-information as
> well-intentioned hints.
> ***
> I like the idea of a checksum, but I think it's premature to require
> any particular formulation at this revision of the specification.
> ***
> I favour this new "Part 3" of the specification providing some general
> commentary on the nature of text files and transcoding issues.  It
> should present UTF-8 as a "concrete" instantiation, and stipulate a
> suitable tag for incorporation in the "magic number" comment, let us
> say something like <UTF-8>. It should explain the importance of
> developers following the "recommendations", and should caution against
> (but not prohibit) gratuitous proliferation of encodings. It should
> identify an additional resource hosted on the COMCIFS web site that
> provides guidance to developers.
> Use of the term "concrete" here harks back to the SGML specification.
> SGML is actually a metastandard for document markup languages, and in
> principle permits many different ways of tagging markup. But in
> describing just one "concrete" example, based on angle brackets, it
> encouraged the universal adoption of such tags right through HTML and
> XML.
> ***
> John said:
>> "Were I setting policy for Acta Crystallographica with respect to CIF2,
>> I would require CIF2 submissions to be encoded in UTF-8 ... If
>> IUCr wishes to be relaxed about _enforcement_ of such a policy in
>> order to better serve authors, then fine, but that's a tricky
>> proposition.
> I have some concerns about "enforceability" - an end-user (author) may
> simply not know how to comply with a requirement to supply a document
> in a specified encoding. However, the IUCr Managing Editor would
> accept a policy that required authors whose CIFs we had "difficulty
> in reading" to use a particular tool, namely publCIF.
> ***
> The "additional resource" I referred to could contain among other
> things:
> a list of organisations (IUCr journals, PDB, CCDC, individual synchrotron
> facilities) and their policies on accepting or outputting specific
> character-set encodings;
> a list of preferred encoding tags (initially just <UTF-8> and perhaps
> <UTF-16>, but extended in response to requests from specific
> developers);
> best-practice recommendations.
> I would prefer these to evolve from community discussions and
> practical requirements, rather than appear to be imposed by fiat of
> COMCIFS or IUCr - so maybe this should be a "cif-developers" rather
> than "COMCIFS" website.
> ***
> This approach tries to close off the formal specification while
> allowing controlled extensions. Essentially my "additional resource"
> becomes the framework for establishing protocols for conversion
> between different character-set encodings and serializations.
> For instance, Herbert replied to my comments on needing a pure ASCII
> representation in-house:
>> There is no way to make a "pure ascii version" of a general UTF-8
>> file without adopting some reserved characters strings at the lexical
>> level -- \U... or &#...; or somesuch as used in many other systems,
>> but with such an extension, it is easy.
> That's perfectly understood, and I would expect that we (Acta) would
> devise an informal scheme to allow us to do so for whatever purposes we
> needed. We wouldn't expect that to be an integral part of the CIF-2
> standard. On the other hand, if it became clear that other people were
> having difficulty in processing UTF-8 CIFs, we could formalise what we
> had done with a new encoding tag, post that on our cif-developers
> resource:
>   Encoding scheme       Details                    Reference
>   <ASCII UNICODE-CJO>   Crystallography Journals   http://........
>                         ASCII-fication of
>                         Unicode characters
> and serve CIFs on request with the initial header
> (I understand that this is different from character-set transcoding
> because it involves additional processing at the lexical level, so it
> may not be an appropriate thing to bundle these together in the same
> way. That's open to later discussion, but my point is that we're
> at least setting up a system allowing the community to exchange
> information about practical representation conversions, and so reduce
> the likelihood of uncontrolled chaos.)
> Regards
> Brian
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
cif2-encoding mailing list

Reply to: [list | sender only]