[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
So now I return to the STAR syntax. DDLm is part of STAR and hence restrictions on data names so they can be parsed etc is a STAR issue. I am brought around to Joe’s idea that STAR accepts any 8 bit character sequence since that is the most complete set – and that this will be restricted to UTF-8 within the CIF specification. Any other adoptee of STAR can choose whatever restricted encoding they wish.
I still need to treat data names as programming identifiers within dREL so accordingly I propose we restrict the data names in STAR (and all variants) to be ASCII [A-Za-z0-9_.] as we have used in the sample dictionaries, DDLm and dREL.
The data values will be represented as discussed in previous threads and that the reverse solidus and the token delimiters discussed will be ASCII characters. We can now return to [] as the list delimiters, and {} as the associative array delimiters.
Backward compatibility to CIF1 names is handled by exploiting the _alias attributes in the definition. A CIF2 parser with dictionary can handle everything. Any CIF1 parser can handle CIF1 data files (also CIF2 data files up to a point, but won’t know what the data names mean – unless they have hardcoded it).
A CIF2 parser would like a leading comment to tell it what sort of file it is parsing. It the absence of that comment, a pre-scan will need to be done. The telltale indicators it is a CIF1 data file are multiple occurrences of,
(1) data names that potentially contain [] or /
(2) unquoted strings with illegal characters
(3) quoted strings that result in parse failure (typically because they must have an embedded [but not elided] quote character as allowed in CIF1).
It needs to be a pre-scan because all 3 of the above in an identified CIF2 data file would result in something quite different since there are coercion rules for when the whitespace separator is missing.
For instance IF I KNOW it is a CIF2 file and I read
_name[1]
Then this can only be an error and I coerce into
_name [1]
IF I DON’T KNOW the file type, the occurrence of _name[1] flags it as potentially a CIF1 file. If _name[1] is in an alias list, this re-enforces the likelihood of CIF1. Multiple instances of these “errors” (or any others in the above list) indicate it is a CIF1 file (my only other conclusion would be it is a VERY BADLY written CIF2).
I think this takes us back to a very simple rule set, and I don’t think the restriction in the character set for data names will cause problems. For all the excitement of UTF-8 etc I know of programming languages that support reading and writing data in such encodings but I haven’t seen one that allows/encourages one to write programmes declaring identifiers in UTF-8 character sets. (They well exist I just haven’t seen them).
On 17/11/09 12:04 AM, "David Brown" <idbrown@mcmaster.ca> wrote:
cheers
Nick
--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering
The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002
CRICOS Provider Code: 00126G
e: Nick.Spadaccini@uwa.edu.au
Reply to: [list | sender only]
Re: [ddlm-group] CIF-2 changes
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] CIF-2 changes
- From: Nick Spadaccini <nick@csse.uwa.edu.au>
- Date: Tue, 17 Nov 2009 14:29:17 +0800
- Authentication-Results: postfix;
- In-Reply-To: <4B017813.6020302@mcmaster.ca>
So now I return to the STAR syntax. DDLm is part of STAR and hence restrictions on data names so they can be parsed etc is a STAR issue. I am brought around to Joe’s idea that STAR accepts any 8 bit character sequence since that is the most complete set – and that this will be restricted to UTF-8 within the CIF specification. Any other adoptee of STAR can choose whatever restricted encoding they wish.
I still need to treat data names as programming identifiers within dREL so accordingly I propose we restrict the data names in STAR (and all variants) to be ASCII [A-Za-z0-9_.] as we have used in the sample dictionaries, DDLm and dREL.
The data values will be represented as discussed in previous threads and that the reverse solidus and the token delimiters discussed will be ASCII characters. We can now return to [] as the list delimiters, and {} as the associative array delimiters.
Backward compatibility to CIF1 names is handled by exploiting the _alias attributes in the definition. A CIF2 parser with dictionary can handle everything. Any CIF1 parser can handle CIF1 data files (also CIF2 data files up to a point, but won’t know what the data names mean – unless they have hardcoded it).
A CIF2 parser would like a leading comment to tell it what sort of file it is parsing. It the absence of that comment, a pre-scan will need to be done. The telltale indicators it is a CIF1 data file are multiple occurrences of,
(1) data names that potentially contain [] or /
(2) unquoted strings with illegal characters
(3) quoted strings that result in parse failure (typically because they must have an embedded [but not elided] quote character as allowed in CIF1).
It needs to be a pre-scan because all 3 of the above in an identified CIF2 data file would result in something quite different since there are coercion rules for when the whitespace separator is missing.
For instance IF I KNOW it is a CIF2 file and I read
_name[1]
Then this can only be an error and I coerce into
_name [1]
IF I DON’T KNOW the file type, the occurrence of _name[1] flags it as potentially a CIF1 file. If _name[1] is in an alias list, this re-enforces the likelihood of CIF1. Multiple instances of these “errors” (or any others in the above list) indicate it is a CIF1 file (my only other conclusion would be it is a VERY BADLY written CIF2).
I think this takes us back to a very simple rule set, and I don’t think the restriction in the character set for data names will cause problems. For all the excitement of UTF-8 etc I know of programming languages that support reading and writing data in such encodings but I haven’t seen one that allows/encourages one to write programmes declaring identifiers in UTF-8 character sets. (They well exist I just haven’t seen them).
On 17/11/09 12:04 AM, "David Brown" <idbrown@mcmaster.ca> wrote:
James,
There seems to be a lull in the discussions on CIF2 syntax so this would be a good time for you, or appointed chosen by you, to summarize where we are at and propose a set of rules that will can work with as we move forward. I realize that much of the work I have already done on dictionaries will need to be revisited, and Herbert also seems anxious to have some decisions on the various topics that have been discussed.
I believe we have a consensus on a number of points, but these need to be written down clearly and need our formal agreement so we can move ahead.
David
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group
cheers
Nick
--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering
The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002
CRICOS Provider Code: 00126G
e: Nick.Spadaccini@uwa.edu.au
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] CIF-2 changes (James Hester)
- Re: [ddlm-group] CIF-2 changes (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] CIF-2 changes (David Brown)
- Prev by Date: Re: [ddlm-group] CIF-2 changes
- Next by Date: Re: [ddlm-group] CIF-2 changes
- Prev by thread: Re: [ddlm-group] CIF-2 changes
- Next by thread: Re: [ddlm-group] CIF-2 changes
- Index(es):