[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
From: Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 12 October, 2009 17:14:41
Subject: Re: [ddlm-group] [THREAD 4] UTF8
On 12/10/09 11:38 PM, "James Hester" <jamesrhester@gmail.com> wrote:
> I've started a separate thread for the UTF8 discussion.
>
> John has floated the option of delinking the file encoding from the
> syntax specification, so CIF1.2 files could have either ASCII or UTF8
> encodings. I believe that this is unnecessary for the following reasons
>
> 1. Encoding can be automatically determined: If a given CIF1.2 file
> contains any bytes with values >127 then it can/should only be UTF8.
Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
coding algorithm?
> 2. The fact that CIF1.2 syntax allows UTF8 encoding does not mean that
> any given string-valued data item could be presented in UTF8:
> dictionary writers are free to restrict the character set of data
> values. Would such dictionary-based regulation give the PDB and IUCr
> sufficient control over UTF8 introduction (John/Brian/Simon?).
OK Only one-byte UTF-8 is allowed. Voila. Problem solved.
> 3. An additional UTF8 encoding magic number could complicate the
> simple magic number scheme we currently have in place.
I don't think it does. It would simplify the case for those parsers not
supporting yet UTF-8. It would tell them to terminate the process.
cheers
Nick
--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering
The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002
CRICOS Provider Code: 00126G
e: Nick.Spadaccini@uwa.edu.au
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: SIMON WESTRIP <simonwestrip@btinternet.com>
- Date: Mon, 12 Oct 2009 19:20:56 +0000 (GMT)
- In-Reply-To: <C6F976F1.1206C%nick@csse.uwa.edu.au>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au>
"OK Only one-byte UTF-8 is allowed. Voila. Problem solved."
Please forgive me, but for the first time in my life I think I might have to type 'lol' :-)
(Sorry if this is inappropriate - I'll try to add something constructive tomorrow.)
Cheers
Simon
Please forgive me, but for the first time in my life I think I might have to type 'lol' :-)
(Sorry if this is inappropriate - I'll try to add something constructive tomorrow.)
Cheers
Simon
From: Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 12 October, 2009 17:14:41
Subject: Re: [ddlm-group] [THREAD 4] UTF8
On 12/10/09 11:38 PM, "James Hester" <jamesrhester@gmail.com> wrote:
> I've started a separate thread for the UTF8 discussion.
>
> John has floated the option of delinking the file encoding from the
> syntax specification, so CIF1.2 files could have either ASCII or UTF8
> encodings. I believe that this is unnecessary for the following reasons
>
> 1. Encoding can be automatically determined: If a given CIF1.2 file
> contains any bytes with values >127 then it can/should only be UTF8.
Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
coding algorithm?
> 2. The fact that CIF1.2 syntax allows UTF8 encoding does not mean that
> any given string-valued data item could be presented in UTF8:
> dictionary writers are free to restrict the character set of data
> values. Would such dictionary-based regulation give the PDB and IUCr
> sufficient control over UTF8 introduction (John/Brian/Simon?).
OK Only one-byte UTF-8 is allowed. Voila. Problem solved.
> 3. An additional UTF8 encoding magic number could complicate the
> simple magic number scheme we currently have in place.
I don't think it does. It would simplify the case for those parsers not
supporting yet UTF-8. It would tell them to terminate the process.
cheers
Nick
--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering
The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002
CRICOS Provider Code: 00126G
e: Nick.Spadaccini@uwa.edu.au
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):