[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: James Hester <jamesrhester@gmail.com>
- Date: Fri, 23 Oct 2009 10:13:10 +1100
- In-Reply-To: <C6F976F1.1206C%nick@csse.uwa.edu.au>
- References: <279aad2a0910120838t5f400d71wf1f237d05338c08@mail.gmail.com><C6F976F1.1206C%nick@csse.uwa.edu.au>
To continue... In what follows, by UTF8 I mean the UTF8 standard as such, and not some asciified unicode like \u1212. First, to answer Nick's points: I wrote: > 1. Encoding can be automatically determined: If a given CIF1.2 file > contains any bytes with values >127 then it can/should only be UTF8. Nick commented: Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley" coding algorithm? The images in imgCIF are encoded in pure ASCII. CBF encodes the image only in binary, the other parts conform to CIF standards (Herbert may want to correct me on that), so a criterion similar to the one I propose above would apply to CBF except that the detection of non-ASCII characters would need to occur during parsing, rather than possibly before. I wrote: > Would such dictionary-based regulation give the PDB and IUCr > sufficient control over UTF8 introduction (John/Brian/Simon?). Which prompted the following witticism: OK Only one-byte UTF-8 is allowed. Voila. Problem solved. At the risk of being terribly earnest, well yes, that was my point. If the end consumer of a dataname can only handle pure ASCII, that can be stated in the dictionary and we need not worry ourselves about taking this problem into account when setting the syntax standard. I would be interested in hearing from John, Brian or Simon regarding how satisfied they would be with a dictionary-based approach to restricting UTF8 use. Finally, I wrote: > 3. An additional UTF8 encoding magic number could complicate the > simple magic number scheme we currently have in place. Nick: I don't think it does. It would simplify the case for those parsers not supporting yet UTF-8. It would tell them to terminate the process. Firstly, note that for a parser to support UTF8 it simply needs to treat the UTF8 bytes in the same way as the non-syntax-special ASCII bytes, with no other effort required, so compared to the effort involved in coding bracket construct parsing, there's very little for the programmer to do in order to support UTF8. In any case, if UTF8 is not supported, the parser throws a syntax error (or applies some coercion rules) upon reaching the offending byte, which is arguably more productive than giving up without even starting - especially if there turns out to be only one-byte UTF8 (ASCII) in the file. Secondly, I envision UTF8 support as part of the CIF2.0 "package". If you support CIF2.0, you support UTF8 and bracket constructs etc. Why make a special case for the UTF8 part of CIF2.0? I'm still not sure I understand the nature of the encoding problems that John alludes to if the only allowed encodings are ASCII and UTF8 - especially as a UTF8-compliant parser is also an ASCII-compliant parser. -- T +61 (02) 9717 9907 F +61 (02) 9717 3145 M +61 (04) 0249 4148 _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- References:
- [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):