Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Multiple encodings for CIF2?

My thoughts on this:

i) If the encoding isnt specified, it could prove very difficult to determine what encoding was used for a particular CIF.
There are numerous text encodings that are impossible to identify without some sort of content header?

ii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), there is still the possibility that a CIF will be written without the appropriate identifier (e.g. BOM)

iii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), then it is likely that it is easy to switch between these encodings, so why not specify only one (e.g. UTF-8) knowing that any system that wishes to process CIFs in one of the other encodings can readily write it out again in the original encoding?

iv) If the encoding is specified as *one or more* of a set of *identifiable* encodings (e.g. UTF-8 and UTF-32 in the same CIF), it is unlikely that any basic text-handling software would perform the appropriate encoding switch? Furthermore, what is to be gained from allowing mixed encodings? If an encoding switch were to be allowed at the start of a data value, this could be useful for storing data that cant be readily stored otherwise, but if the allowed encodings are all identifiable 'text' encodings, then it should be possible to store the data in just one encoding?

v) Following on from (iv), if the encoding is specified as e.g. UTF-8 for CIF keywords, data names, delimiters, constructs... but any other encoding is allowed for data values to enable any machine-readable data to be stored in the CIF, then once again it is unlikely that a traditional text editor will cope, and the chances of data corruption are high without specialized software?

So I think CIF2 should be 'text' with a single encoding - UTF-8

If mixed encodings are to be allowed, then a single encoding should be specified for cif keywords, datanames, etc - i.e. everything outside the data values.
Data *values* could then default to that encoding but could also be anything else - as long as they're identifiable?



From: James Hester <jamesrhester@gmail.com>
To: ddlm-group <ddlm-group@iucr.org>
Sent: Monday, 21 June, 2010 7:55:48
Subject: [ddlm-group] Multiple encodings for CIF2?

Dear DDLm-ers,

It seems that we are destined to reopen the debate on multiple
encodings.  The common ground as far as I can tell is that all
compliant CIF2 readers/writers must be able to produce UTF-8 encoded
files (if encoding is mentioned at all), and will not be required to
read/write any other encoding.  The issues requiring discussion that I
can see are:

1. Should encoding be specified at all in the CIF2 standard?

2. If yes to (1), then we are all agreed that UTF8 must be supported
by compliant processors.  What other encodings, if any, should be
supported?  Should this support be optional?

To put my own point of view (again), I do not see a use for multiple
encodings that would justify the added complexity and the threat to
reliable file transmission over time and space.  I am therefore in
favour of saying yes to (1), and specifying UTF8.  This makes a whole
bunch of issues go away.

T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
ddlm-group mailing list
ddlm-group mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.