Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To continue...

In what follows, by UTF8 I mean the UTF8 standard as such, and not
some asciified unicode like \u1212.

First, to answer Nick's points:

I wrote:
> 1. Encoding can be automatically determined: If a given CIF1.2 file
> contains any bytes with values >127 then it can/should only be UTF8.

Nick commented:
     Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
     coding algorithm?

The images in imgCIF are encoded in pure ASCII.  CBF encodes the image
only in binary, the other parts conform to CIF standards (Herbert may
want to correct me on that), so a criterion similar to the one I
propose above would apply to CBF
except that the detection of non-ASCII characters would need to occur
during parsing, rather than possibly before.

I wrote:
> Would such dictionary-based regulation give the PDB and IUCr
> sufficient control over UTF8 introduction (John/Brian/Simon?).

Which prompted the following witticism:
     OK Only one-byte UTF-8 is allowed. Voila. Problem solved.

At the risk of being terribly earnest, well yes, that was my point.
If the end consumer of a dataname can only handle pure ASCII, that can
be stated in the dictionary and we need not worry ourselves about
taking this problem into account when setting the syntax standard.  I
would be interested in hearing from John, Brian or Simon regarding how
satisfied they would be with a dictionary-based approach to
restricting UTF8 use.

Finally, I wrote:
> 3. An additional UTF8 encoding magic number could complicate the
> simple magic number scheme we currently have in place.

  I don't think it does. It would simplify the case for those parsers not
  supporting yet UTF-8. It would tell them to terminate the process.

Firstly, note that for a parser to support UTF8 it simply needs to
treat the UTF8 bytes in the same way as the non-syntax-special ASCII
bytes, with no other
effort required, so compared to the effort involved in coding bracket
construct parsing, there's very little for the programmer to do in
order to support UTF8.  In any case, if UTF8 is not supported, the
parser throws a syntax error (or applies some coercion rules) upon
reaching the offending byte, which is arguably more productive than
giving up without even starting - especially if there turns out to be
only one-byte
UTF8 (ASCII) in the file.

Secondly, I envision UTF8 support as part of the CIF2.0 "package".  If
you support CIF2.0, you support UTF8 and bracket constructs etc.  Why
make a special case for the UTF8 part of CIF2.0?  I'm still not sure I
understand the nature of the encoding problems that John alludes to if
the only allowed encodings are ASCII and UTF8 - especially as a
UTF8-compliant parser is also an ASCII-compliant parser.

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.