Re: [ddlm-group] [THREAD 4] UTF8

Dear Colleagues,

   Let us "zero-base" this dicsussion and consider just the lexical 
analysis appropriate to some future CIF-like language.  Let us look at 
some of the lexical issues that python deals with and consider what
lessons we may learn there in trying to go from a string of characters
to a string of tokens.

   First, we need to settle on what characters we will be using. 
Origincally, python restricted its attention to just 7-bit ascii 
characters "for program text."  Now (from version 2.3 onwards), python 
allows "an encoding declaration [to be] used to indicate that string 
literals and comments use an encoding different from ASCII".

   I propose that we do something similar, but with a more modern starting

new cif character set and encoding:

   C1:  that the character set for a "new cif" be unicode, and
   C2:  that the default encoding be UTF-8; and
   C3:  that other encodings be permitted as an optional 
system-dependent feature when an explicit encoding
has been specified by
     C3.1:  a unicode BOM (byte-order-mark) (see
http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
into a character stream, or
     C3.2.  the first or second line being a comment of the form:
       # -*- coding: <encoding-name> -*-
     as recognized by GNU Emacs, or
     C3.3.  the first or second line being a comment of the form:
       # vim:fileencoding=<encoding-name>
     as recognized by Bram Moolenaar's VIM
(see section 2.1.4 of 
http://docs.python.org/reference/lexical_analysis.html for a more

For the rest of this discussion, let us assume unicode conventions

Next, we need to decide on the rules for handling lines breaks.  I would
suggest we follow the pythn convention of first considering "physical
lines" and then introduce rules for joinng those physcial lines into
"logcal lines".

Here, past practice with CIF rears its head -- what do we do with trailing 
white space?  In CIF until now, in order to deal with old fortran systems, 
we have assumed that we cannot tell the difference between lines that end 
with one blank or with an arbitrary number of blanks. Many fortran 
implementations do not support an clean way to detect end of line, and, 
worse, have no way to cope with lines of arbitrary length. We also still 
have the system-dependent definitions of line termination.  For our 
"customer-base" I do not see any practical way around this right now, so, 
with regret, I propose

   physical line:

   PL1: In describing the lexer, the system-dependent end-of-line will be
given a '\n'.  In source files, any of the standard platform line 
termination sequences can be used - the Unix form using ASCII LF 
(linefeed), the Windows form using the ASCII sequence CR LF (return 
followed by linefeed), or the old Macintosh form using the ASCII CR 
(return) character. All of these forms can be used equally, regardless of 
platform.  I addition, all space and tab charcaters, '\x20' '\x09', 
immediately prior to the system-dependent end-of-line will be removed
prior to further lexical analysis; and
   PL2: There may be a system-dependent limit on the maximal length
of the resulting line, but in all cases, lines of up to 2048 charcaters
will be accepted.


   LC1:  A comment starts with a hash character (#) that is not part of a 
string literal, and ends at the end of the physical line. A comment 
signifies the end of the logical line unless the implicit line joining 
rules are invoked. Comments are ignored by the syntax; they are not 

   logical line:

   LL1:  A logical line is constructed from one or more physical lines by
following explicit or implicit joining rules
   LL2:  Explicit line joining:  Two or more physical lines may be joined 
logical lines using reverse solidus characters (\), as follows: when a 
line ends in a reverse solidus that is not part of a string literal or 
it is joined with the following forming a single logical line, deleting 
the backslash and the following end-of-line character.
   LL2.  Implicit line joining: Expressions in parentheses, square brackets 
or curly braces can be split over more than one physical line without 
using backslashes.  Implicitly continued lines can carry comments. Blank 
continuation lines are allowed. There is no end-of-line token between 
implicit continuation lines. Implicitly continued lines can also occur 
within triple-quoted strings (see below); in that case they cannot carry 


   With the character stream and the lines defined, the next thing we need
to define are string.  I propose we adopt a subset of the python 
convention, but without the string prefixes.  :

String literals can be enclosed in matching single quotes (') or double 
quotes ("). They can also be enclosed in matching groups of three single 
or double quotes (these are generally referred to as triple-quoted 
strings). The reverse solidus (\) character is used to escape characters 
that otherwise have a special meaning, such as newline, backslash itself, 
or the quote character.

In triple-quoted strings, unescaped newlines and quotes are allowed (and 
are retained), except that three unescaped quotes in a row terminate the 
string. (A quote is the character used to open the string, i.e. either ' 
or ".)

There is more to define, but if we go this far, we should be able to
have fairly clean lexical scanners that are able to handle nested
quotation marks in a way that most programmers will understand.


  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769


On Mon, 12 Oct 2009, SIMON WESTRIP wrote:

> "OK Only one-byte UTF-8 is allowed. Voila. Problem solved."

Please forgive me, but for the first time in my life I think I might have to type 'lol'  :-)

(Sorry if this is inappropriate - I'll try to add something constructive tomorrow.)



From: Nick Spadaccini <nick@csse.uwa.edu.au>
To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Sent: Monday, 12 October, 2009 17:14:41
Subject: Re: [ddlm-group] [THREAD 4] UTF8

On 12/10/09 11:38 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> I've started a separate thread for the UTF8 discussion.
> John has floated the option of delinking the file encoding from the
> syntax specification, so CIF1.2 files could have either ASCII or UTF8
> encodings.  I believe that this is unnecessary for the following reasons
> 1. Encoding can be automatically determined: If a given CIF1.2 file
> contains any bytes with values >127 then it can/should only be UTF8.

Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley"
coding algorithm?

> 2. The fact that CIF1.2 syntax allows UTF8 encoding does not mean that
> any given string-valued data item could be presented in UTF8:
> dictionary writers are free to restrict the character set of data
> values. Would such dictionary-based regulation give the PDB and IUCr
> sufficient control over UTF8 introduction (John/Brian/Simon?).

OK Only one-byte UTF-8 is allowed. Voila. Problem solved.

> 3. An additional UTF8 encoding magic number could complicate the
> simple magic number scheme we currently have in place.

I don't think it does. It would simplify the case for those parsers not
supporting yet UTF-8. It would tell them to terminate the process.



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list
