[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- From: Brian McMahon <bm@xxxxxxxx>
- Date: Thu, 16 Sep 2010 14:17:53 +0100
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTinxkquC5cY0m23yzBVgm7afmYYfh6+2yMz=Hr_w@mail.gmail.com><alpine.BSF.2.00.1009100711070.59446@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDBD@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikuoQEU-rv9GkTqqc0u0qgd1ugf+cGTfqF77j-E@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DEDC0@SJMEMXMBS11.stjude.sjcrh.local><930138.36485.qm@web87008.mail.ird.yahoo.com><alpine.BSF.2.00.1009141032080.26597@epsilon.pair.com><alpine.BSF.2.00.1009141050260.26597@epsilon.pair.com><20100915123927.GA26246@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA5416659DEDC7@SJMEMXMBS11.stjude.sjcrh.local>
Thanks to Herbert, John and Simon for responding. I'm sorry if it seems like once again round an endless loop, but your replies have helped me to settle on the way I would like to see things move forward. For what it's worth: *** I favour the specification *recommending* a magic string to begin a file: an optional BOM followed by the 11 characters #\#CIF_2.0<whitespace> I favour the specification *recommending* that this initial comment should be extended with an indication of the character encoding where this is not ASCII. I suggest the specification's discussion of the form this will take, as well as any other comments on character-set encoding, be presented in a distinct section of the specification (Part 3, or an Annexe or Appendix). These are recommendations, not requirements, 1. to include existing CIF1.0 and CIF1.1 instances as valid CIF input streams (whether "decorated" or not; 2. because you can only ever take this meta-information as well-intentioned hints. *** I like the idea of a checksum, but I think it's premature to require any particular formulation at this revision of the specification. *** I favour this new "Part 3" of the specification providing some general commentary on the nature of text files and transcoding issues. It should present UTF-8 as a "concrete" instantiation, and stipulate a suitable tag for incorporation in the "magic number" comment, let us say something like <UTF-8>. It should explain the importance of developers following the "recommendations", and should caution against (but not prohibit) gratuitous proliferation of encodings. It should identify an additional resource hosted on the COMCIFS web site that provides guidance to developers. Use of the term "concrete" here harks back to the SGML specification. SGML is actually a metastandard for document markup languages, and in principle permits many different ways of tagging markup. But in describing just one "concrete" example, based on angle brackets, it encouraged the universal adoption of such tags right through HTML and XML. *** John said: > "Were I setting policy for Acta Crystallographica with respect to CIF2, > I would require CIF2 submissions to be encoded in UTF-8 ... If > IUCr wishes to be relaxed about _enforcement_ of such a policy in > order to better serve authors, then fine, but that's a tricky > proposition. I have some concerns about "enforceability" - an end-user (author) may simply not know how to comply with a requirement to supply a document in a specified encoding. However, the IUCr Managing Editor would accept a policy that required authors whose CIFs we had "difficulty in reading" to use a particular tool, namely publCIF. *** The "additional resource" I referred to could contain among other things: a list of organisations (IUCr journals, PDB, CCDC, individual synchrotron facilities) and their policies on accepting or outputting specific character-set encodings; a list of preferred encoding tags (initially just <UTF-8> and perhaps <UTF-16>, but extended in response to requests from specific developers); best-practice recommendations. I would prefer these to evolve from community discussions and practical requirements, rather than appear to be imposed by fiat of COMCIFS or IUCr - so maybe this should be a "cif-developers" rather than "COMCIFS" website. *** This approach tries to close off the formal specification while allowing controlled extensions. Essentially my "additional resource" becomes the framework for establishing protocols for conversion between different character-set encodings and serializations. For instance, Herbert replied to my comments on needing a pure ASCII representation in-house: > There is no way to make a "pure ascii version" of a general UTF-8 > file without adopting some reserved characters strings at the lexical > level -- \U... or &#...; or somesuch as used in many other systems, > but with such an extension, it is easy. That's perfectly understood, and I would expect that we (Acta) would devise an informal scheme to allow us to do so for whatever purposes we needed. We wouldn't expect that to be an integral part of the CIF-2 standard. On the other hand, if it became clear that other people were having difficulty in processing UTF-8 CIFs, we could formalise what we had done with a new encoding tag, post that on our cif-developers resource: Encoding scheme Details Reference <ASCII UNICODE-CJO> Crystallography Journals http://........ ASCII-fication of Unicode characters and serve CIFs on request with the initial header #\#CIF_2.0 <ASCII UNICODE-CJO> (I understand that this is different from character-set transcoding because it involves additional processing at the lexical level, so it may not be an appropriate thing to bundle these together in the same way. That's open to later discussion, but my point is that we're at least setting up a system allowing the community to exchange information about practical representation conversions, and so reduce the likelihood of uncontrolled chaos.) Regards Brian _______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- Follow-Ups:
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. ... (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics..... .. . (Bollinger, John C)
- References:
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. . (James Hester)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Bollinger, John C)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (SIMON WESTRIP)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Herbert J. Bernstein)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. . (Brian McMahon)
- Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... . (Bollinger, John C)
- Prev by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Next by Date: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .
- Prev by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .... .
- Next by thread: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics..... .. .
- Index(es):