[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 29 Oct 2009 07:59:38 -0400 (EDT)
- In-Reply-To: <C70E10D8.1221C%nick@csse.uwa.edu.au>
- References: <C70E10D8.1221C%nick@csse.uwa.edu.au>
Dear Nick, Unlike the rest of the group, I am not much for mandatory standards, so I expect that CBF will muddle along with a mixture of UTF-8 and UCS-2 and its pure binary forms as well, with only the UTF-8 versions being officially a CIF. Inasmuch a the UCS-2 versions will not be officially CIFs, there is no harm in them having BOMs (which is an essential part of UCS-2 files) as well as the readable encoding comment as the second line, which is what I suspect people will actually do. In 5 or 10 years, when there as been some experience with this and the world as a whole has decided what to do with the C world if UTF-8 versus the java world of UCS-2, we can revisit the issue. For the moment, I would suggest we turn a blind eye to the non-CIF, but CIF-like variants of imgCIF files, just as we turn a blind eye to the not-exactly-CIF practices at the IUCr and the PDB, so people can get their work done. Bottom line: A CIF is in UTF-8 and does not have a BOM or encoding comment, and something with a BOM or encoding comment (or both) may be useful and may have a trasliteration into a CIF, but is not itself a CIF, and should use a file extension other than ".cif", such as ".cbf" As in the past, the cbf files will continue to have multiple alternate representations of the same information. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 28 Oct 2009, Nick Spadaccini wrote: > OK. Well I think this discussion has dies its death. One of the benefits (to > Herb mostly) is that with UCS-2 encoding, the CBF comes in to the imgCIF > fold. Herbšs group has a binary->UCS-2 encoding with a 7% added footprint so > it is efficient. At the moment (or at least shortly) the alternative is to > go binary->BASE64 which has a 33% added footprint. I guess binary->UTF-8 > would have the same footprint. By being able to switch encodings Herb can > achieve a single file for his work. The price is that we have to adopt a > switching encoding option not attractive to most people here apparently. > > It seems the consensus is that UTF-8 is the encoding standard and everything > else is a broken file. Not with standing that any application may parse it, > attempt to determine the encoding and produce a corrected file. That is > nice, but not a required part of the standard (which is the standard, to > quote David, quoting me). > > It seems to me Herb, the easiest thing you can do with regard to CBF (since > it canšt be a real CIF) is to adopt everything from CIF-2 except that CBF > must be encoded as UCS-2. That may give you a bigger footprint per ASCII > character, but since these are a small number of data values in a CBF it may > well be worth the price. That way at least the binary part is efficiently > encoded. > > If in the DDLm we manage to flag order dependency in the loop_ header (along > with a flag to indicate row ordering is important in the loop_ - something > John W wanted) then CBF will essentially be CIF-2 except that the encoding > is UCS-2. > > > On 26/10/09 10:43 PM, "David Brown" <idbrown@mcmaster.ca> wrote: > >> James has asked for the views of those of us who have just been watching this >> discussion (it seems to go by faster than I can follow). >> >> For what is is worth, my stance is strongly the same as James'. He has laid >> out all the argumemts very succinctly - I will just briefly reinforce those I >> think most importatn. >> >> 1. Permitting one or more other encodings (presumably these must constitute a >> well defined list as their names must be recognized) immediately invites >> people to use them. To quote Nick, if we have a standard it should be a >> standard. If people decide to bend the standard they do so at their own >> peril. >> >> 2. If other encodings are depricated (yes, that word again) in order to >> encourage people to use the real standard, they are likely to be unaware that >> their program has used a depricated encoding and therefore has failed to >> identify it. If the writer of the CIF knows that it is using a depricated >> standard it can make the conversion. Only if it does not know it is using a >> depricated standard and therefore neither converts nor identifies the encoding >> will the depricated encoding get through. >> >> 3. Providing a space (at this stage) for identifying encodings that may (or >> may not) later become part of the standard is unnecessary since it can be >> added if and when such other encodings are allowed. UTF-8 then becomes the >> default. >> >> 4. Having a single standard requires that the readers need only consider one >> encoding and the writers need only support a conversion from the native to the >> CIF2 standard. Allowing 5 or 10 other encodings makes life easier for the >> writer since it does not have to provide any conversion, but there is a price. >> Every reader must be able to read 5 or 10 different encodings because it is >> not allowed to reject any of the depricated standardss. With a single >> standard the IUCr may or may not decide that they will handle different >> encodings, but that is their choice. Making different encodings legal removes >> that choice from the reader: it has to handle all possibilities, which is only >> likely to discourage people writing local programs for occasional use. >> >> David >> >> >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] New syntax: 'marker' characters
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Index(es):