[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] Multiple encodings for CIF2?
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] Multiple encodings for CIF2?
- From: SIMON WESTRIP <simonwestrip@btinternet.com>
- Date: Mon, 21 Jun 2010 09:49:56 +0000 (GMT)
- In-Reply-To: <AANLkTikNaeBm0mVcM0XMvkueESVS6i6mvCUCJSrmK2rR@mail.gmail.com>
- References: <AANLkTikNaeBm0mVcM0XMvkueESVS6i6mvCUCJSrmK2rR@mail.gmail.com>
My thoughts on this:
i) If the encoding isnt specified, it could prove very difficult to determine what encoding was used for a particular CIF.
There are numerous text encodings that are impossible to identify without some sort of content header?
ii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), there is still the possibility that a CIF will be written without the appropriate identifier (e.g. BOM)
iii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), then it is likely that it is easy to switch between these encodings, so why not specify only one (e.g. UTF-8) knowing that any system that wishes to process CIFs in one of the other encodings can readily write it out again in the original encoding?
iv) If the encoding is specified as *one or more* of a set of *identifiable* encodings (e.g. UTF-8 and UTF-32 in the same CIF), it is unlikely that any basic text-handling software would perform the appropriate encoding switch? Furthermore, what is to be gained from allowing mixed encodings? If an encoding switch were to be allowed at the start of a data value, this could be useful for storing data that cant be readily stored otherwise, but if the allowed encodings are all identifiable 'text' encodings, then it should be possible to store the data in just one encoding?
v) Following on from (iv), if the encoding is specified as e.g. UTF-8 for CIF keywords, data names, delimiters, constructs... but any other encoding is allowed for data values to enable any machine-readable data to be stored in the CIF, then once again it is unlikely that a traditional text editor will cope, and the chances of data corruption are high without specialized software?
So I think CIF2 should be 'text' with a single encoding - UTF-8
If mixed encodings are to be allowed, then a single encoding should be specified for cif keywords, datanames, etc - i.e. everything outside the data values.
Data *values* could then default to that encoding but could also be anything else - as long as they're identifiable?
Cheers
Simon
From: James Hester <jamesrhester@gmail.com>
To: ddlm-group <ddlm-group@iucr.org>
Sent: Monday, 21 June, 2010 7:55:48
Subject: [ddlm-group] Multiple encodings for CIF2?
Dear DDLm-ers,
It seems that we are destined to reopen the debate on multiple
encodings. The common ground as far as I can tell is that all
compliant CIF2 readers/writers must be able to produce UTF-8 encoded
files (if encoding is mentioned at all), and will not be required to
read/write any other encoding. The issues requiring discussion that I
can see are:
1. Should encoding be specified at all in the CIF2 standard?
2. If yes to (1), then we are all agreed that UTF8 must be supported
by compliant processors. What other encodings, if any, should be
supported? Should this support be optional?
To put my own point of view (again), I do not see a use for multiple
encodings that would justify the added complexity and the threat to
reliable file transmission over time and space. I am therefore in
favour of saying yes to (1), and specifying UTF8. This makes a whole
bunch of issues go away.
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
i) If the encoding isnt specified, it could prove very difficult to determine what encoding was used for a particular CIF.
There are numerous text encodings that are impossible to identify without some sort of content header?
ii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), there is still the possibility that a CIF will be written without the appropriate identifier (e.g. BOM)
iii) If the encoding is specified as *any one* of a set of *identifiable* encodings (e.g. UTF-n), then it is likely that it is easy to switch between these encodings, so why not specify only one (e.g. UTF-8) knowing that any system that wishes to process CIFs in one of the other encodings can readily write it out again in the original encoding?
iv) If the encoding is specified as *one or more* of a set of *identifiable* encodings (e.g. UTF-8 and UTF-32 in the same CIF), it is unlikely that any basic text-handling software would perform the appropriate encoding switch? Furthermore, what is to be gained from allowing mixed encodings? If an encoding switch were to be allowed at the start of a data value, this could be useful for storing data that cant be readily stored otherwise, but if the allowed encodings are all identifiable 'text' encodings, then it should be possible to store the data in just one encoding?
v) Following on from (iv), if the encoding is specified as e.g. UTF-8 for CIF keywords, data names, delimiters, constructs... but any other encoding is allowed for data values to enable any machine-readable data to be stored in the CIF, then once again it is unlikely that a traditional text editor will cope, and the chances of data corruption are high without specialized software?
So I think CIF2 should be 'text' with a single encoding - UTF-8
If mixed encodings are to be allowed, then a single encoding should be specified for cif keywords, datanames, etc - i.e. everything outside the data values.
Data *values* could then default to that encoding but could also be anything else - as long as they're identifiable?
Cheers
Simon
From: James Hester <jamesrhester@gmail.com>
To: ddlm-group <ddlm-group@iucr.org>
Sent: Monday, 21 June, 2010 7:55:48
Subject: [ddlm-group] Multiple encodings for CIF2?
Dear DDLm-ers,
It seems that we are destined to reopen the debate on multiple
encodings. The common ground as far as I can tell is that all
compliant CIF2 readers/writers must be able to produce UTF-8 encoded
files (if encoding is mentioned at all), and will not be required to
read/write any other encoding. The issues requiring discussion that I
can see are:
1. Should encoding be specified at all in the CIF2 standard?
2. If yes to (1), then we are all agreed that UTF8 must be supported
by compliant processors. What other encodings, if any, should be
supported? Should this support be optional?
To put my own point of view (again), I do not see a use for multiple
encodings that would justify the added complexity and the threat to
reliable file transmission over time and space. I am therefore in
favour of saying yes to (1), and specifying UTF8. This makes a whole
bunch of issues go away.
--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- [ddlm-group] Multiple encodings for CIF2? (James Hester)
- Prev by Date: [ddlm-group] Multiple encodings for CIF2?
- Next by Date: Re: [ddlm-group] Recommended character set and use restrictions. .
- Prev by thread: [ddlm-group] Multiple encodings for CIF2?
- Next by thread: [ddlm-group] Handling of null byte in CIF2
- Index(es):