[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 22 Oct 2009 21:53:30 -0400 (EDT)
- Cc: Nick.Spadaccini@uwa.edu.au
- In-Reply-To: <279aad2a0910221613m2a2a7891k4ae23476e50f98e4@mail.gmail.com>
- References: <279aad2a0910120838t5f400d71wf1f237d05338c08@mail.gmail.com><C6F976F1.1206C%nick@csse.uwa.edu.au><279aad2a0910221613m2a2a7891k4ae23476e50f98e4@mail.gmail.com>
Dear Colleagues, How about the following: Any CIF dataset with no information about encoding is presumed to be in UTF-8 encoding, but CIF writers using UTF-8 should include one of recommended UTF-8 identifiers. All CIF-2 parsers are required to handle UTF-8, and may reject other encodings Systems that are handling CIFs in encodings other than UTF-8 are required to include on of the recommended encoding identifers to clearly identify the encoding they are using and must _not_ use a UTF-8 idntifier if the coding is something other than UTF-8 Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Fri, 23 Oct 2009, James Hester wrote: > To continue... > > In what follows, by UTF8 I mean the UTF8 standard as such, and not > some asciified unicode like \u1212. > > First, to answer Nick's points: > > I wrote: >> 1. Encoding can be automatically determined: If a given CIF1.2 file >> contains any bytes with values >127 then it can/should only be UTF8. > > Nick commented: > Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley" > coding algorithm? > > The images in imgCIF are encoded in pure ASCII. CBF encodes the image > only in binary, the other parts conform to CIF standards (Herbert may > want to correct me on that), so a criterion similar to the one I > propose above would apply to CBF > except that the detection of non-ASCII characters would need to occur > during parsing, rather than possibly before. > > I wrote: >> Would such dictionary-based regulation give the PDB and IUCr >> sufficient control over UTF8 introduction (John/Brian/Simon?). > > Which prompted the following witticism: > OK Only one-byte UTF-8 is allowed. Voila. Problem solved. > > At the risk of being terribly earnest, well yes, that was my point. > If the end consumer of a dataname can only handle pure ASCII, that can > be stated in the dictionary and we need not worry ourselves about > taking this problem into account when setting the syntax standard. I > would be interested in hearing from John, Brian or Simon regarding how > satisfied they would be with a dictionary-based approach to > restricting UTF8 use. > > Finally, I wrote: >> 3. An additional UTF8 encoding magic number could complicate the >> simple magic number scheme we currently have in place. > > Nick: > I don't think it does. It would simplify the case for those parsers not > supporting yet UTF-8. It would tell them to terminate the process. > > Firstly, note that for a parser to support UTF8 it simply needs to > treat the UTF8 bytes in the same way as the non-syntax-special ASCII > bytes, with no other > effort required, so compared to the effort involved in coding bracket > construct parsing, there's very little for the programmer to do in > order to support UTF8. In any case, if UTF8 is not supported, the > parser throws a syntax error (or applies some coercion rules) upon > reaching the offending byte, which is arguably more productive than > giving up without even starting - especially if there turns out to be > only one-byte > UTF8 (ASCII) in the file. > > Secondly, I envision UTF8 support as part of the CIF2.0 "package". If > you support CIF2.0, you support UTF8 and bracket constructs etc. Why > make a special case for the UTF8 part of CIF2.0? I'm still not sure I > understand the nature of the encoding problems that John alludes to if > the only allowed encodings are ASCII and UTF8 - especially as a > UTF8-compliant parser is also an ASCII-compliant parser. > > > -- > T +61 (02) 9717 9907 > F +61 (02) 9717 3145 > M +61 (04) 0249 4148 > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- References:
- [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):