[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Tue, 13 Oct 2009 10:09:18 -0400 (EDT)
- Cc: [email protected]
- In-Reply-To: <[email protected]>
- References: <C6F976F1.1206C%[email protected]><[email protected]>
Dear Colleagues,
Let us "zero-base" this dicsussion and consider just the lexical
analysis appropriate to some future CIF-like language. Let us look at
some of the lexical issues that python deals with and consider what
lessons we may learn there in trying to go from a string of characters
to a string of tokens.
First, we need to settle on what characters we will be using.
Origincally, python restricted its attention to just 7-bit ascii
characters "for program text." Now (from version 2.3 onwards), python
allows "an encoding declaration [to be] used to indicate that string
literals and comments use an encoding different from ASCII".
I propose that we do something similar, but with a more modern starting
point:
new cif character set and encoding:
C1: that the character set for a "new cif" be unicode, and
C2: that the default encoding be UTF-8; and
C3: that other encodings be permitted as an optional
system-dependent feature when an explicit encoding
has been specified by
C3.1: a unicode BOM (byte-order-mark) (see
http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
into a character stream, or
C3.2. the first or second line being a comment of the form:
# -*- coding: <encoding-name> -*-
as recognized by GNU Emacs, or
C3.3. the first or second line being a comment of the form:
# vim:fileencoding=<encoding-name>
as recognized by Bram Moolenaar's VIM
(see section 2.1.4 of
http://docs.python.org/reference/lexical_analysis.html for a more
information).
For the rest of this discussion, let us assume unicode conventions
Next, we need to decide on the rules for handling lines breaks. I would
suggest we follow the pythn convention of first considering "physical
lines" and then introduce rules for joinng those physcial lines into
"logcal lines".
Here, past practice with CIF rears its head -- what do we do with trailing
white space? In CIF until now, in order to deal with old fortran systems,
we have assumed that we cannot tell the difference between lines that end
with one blank or with an arbitrary number of blanks. Many fortran
implementations do not support an clean way to detect end of line, and,
worse, have no way to cope with lines of arbitrary length. We also still
have the system-dependent definitions of line termination. For our
"customer-base" I do not see any practical way around this right now, so,
with regret, I propose
physical line:
PL1: In describing the lexer, the system-dependent end-of-line will be
given a '\n'. In source files, any of the standard platform line
termination sequences can be used - the Unix form using ASCII LF
(linefeed), the Windows form using the ASCII sequence CR LF (return
followed by linefeed), or the old Macintosh form using the ASCII CR
(return) character. All of these forms can be used equally, regardless of
platform. I addition, all space and tab charcaters, '\x20' '\x09',
immediately prior to the system-dependent end-of-line will be removed
prior to further lexical analysis; and
PL2: There may be a system-dependent limit on the maximal length
of the resulting line, but in all cases, lines of up to 2048 charcaters
will be accepted.
comments:
LC1: A comment starts with a hash character (#) that is not part of a
string literal, and ends at the end of the physical line. A comment
signifies the end of the logical line unless the implicit line joining
rules are invoked. Comments are ignored by the syntax; they are not
tokens.
logical line:
LL1: A logical line is constructed from one or more physical lines by
following explicit or implicit joining rules
LL2: Explicit line joining: Two or more physical lines may be joined
into
logical lines using reverse solidus characters (\), as follows: when a
physical
line ends in a reverse solidus that is not part of a string literal or
comment,
it is joined with the following forming a single logical line, deleting
the backslash and the following end-of-line character.
LL2. Implicit line joining: Expressions in parentheses, square brackets
or curly braces can be split over more than one physical line without
using backslashes. Implicitly continued lines can carry comments. Blank
continuation lines are allowed. There is no end-of-line token between
implicit continuation lines. Implicitly continued lines can also occur
within triple-quoted strings (see below); in that case they cannot carry
comments.
Strings
With the character stream and the lines defined, the next thing we need
to define are string. I propose we adopt a subset of the python
convention, but without the string prefixes. :
String literals can be enclosed in matching single quotes (') or double
quotes ("). They can also be enclosed in matching groups of three single
or double quotes (these are generally referred to as triple-quoted
strings). The reverse solidus (\) character is used to escape characters
that otherwise have a special meaning, such as newline, backslash itself,
or the quote character.
In triple-quoted strings, unescaped newlines and quotes are allowed (and
are retained), except that three unescaped quotes in a row terminate the
string. (A quote is the character used to open the string, i.e. either '
or ".)
There is more to define, but if we go this far, we should be able to
have fairly clean lexical scanners that are able to handle nested
quotation marks in a way that most programmers will understand.
Regards,
Herbert
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Mon, 12 Oct 2009, SIMON WESTRIP wrote:
> "OK Only one-byte UTF-8 is allowed. Voila. Problem solved."
Please forgive me, but for the first time in my life I think I might have to type 'lol' :-)
(Sorry if this is inappropriate - I'll try to add something constructive tomorrow.)
Cheers
Simon
________________________________
From: Nick Spadaccini <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Monday, 12 October, 2009 17:14:41
Subject: Re: [ddlm-group] [THREAD 4] UTF8
On 12/10/09 11:38 PM, "James Hester" <[email protected]> wrote:
> I've started a separate thread for the UTF8 discussion.
>
> John has floated the option of delinking the file encoding from the
> syntax specification, so CIF1.2 files could have either ASCII or UTF8
> encodings. I believe that this is unnecessary for the following reasons
>
> 1. Encoding can be automatically determined: If a given CIF1.2 file
> contains any bytes with values >127 then it can/should only be UTF8.
Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
coding algorithm?
> 2. The fact that CIF1.2 syntax allows UTF8 encoding does not mean that
> any given string-valued data item could be presented in UTF8:
> dictionary writers are free to restrict the character set of data
> values. Would such dictionary-based regulation give the PDB and IUCr
> sufficient control over UTF8 introduction (John/Brian/Simon?).
OK Only one-byte UTF-8 is allowed. Voila. Problem solved.
> 3. An additional UTF8 encoding magic number could complicate the
> simple magic number scheme we currently have in place.
I don't think it does. It would simplify the case for those parsers not
supporting yet UTF-8. It would tell them to terminate the process.
cheers
Nick
--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering
The University of Western Australia t: +61 (0)8 6488 3452
35 Stirling Highway f: +61 (0)8 6488 1089
CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
MBDP M002
CRICOS Provider Code: 00126G
e: [email protected]
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Brian McMahon)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):

