[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: Brian McMahon <bm@iucr.org>
Date: Tue, 13 Oct 2009 15:22:08 +0100
In-Reply-To: <20091013055314.F86319@epsilon.pair.com>
References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com>

Without having had time to analyse it in detail, I like the
pragmatic feel of much of what Herbert says. 

But I wonder about the reference to "old fortran systems":
 
> Here, past practice with CIF rears its head -- what do we do with trailing 
> white space?  In CIF until now, in order to deal with old fortran systems, 
> we have assumed that we cannot tell the difference between lines that end 
> with one blank or with an arbitrary number of blanks...

We've ascertained that "modern" Fortran systems can accommodate UTF-8
byte streams - can the "old" ones? In other words, if the principle of
maximal disruption applies and we accept UTF-8, are we justified at
the same time in sacrificing compatibility with such "old"
Fortran-based systems? And if so, does that allow a different
handling of "physical lines" ?

Regards
Brian



On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote:
> Dear Colleagues,
> 
>    Let us "zero-base" this dicsussion and consider just the lexical 
> analysis appropriate to some future CIF-like language.  Let us look at 
> some of the lexical issues that python deals with and consider what
> lessons we may learn there in trying to go from a string of characters
> to a string of tokens.
> 
>    First, we need to settle on what characters we will be using. 
> Origincally, python restricted its attention to just 7-bit ascii 
> characters "for program text."  Now (from version 2.3 onwards), python 
> allows "an encoding declaration [to be] used to indicate that string 
> literals and comments use an encoding different from ASCII".
> 
>    I propose that we do something similar, but with a more modern starting
> point:
> 
> new cif character set and encoding:
> 
>    C1:  that the character set for a "new cif" be unicode, and
>    C2:  that the default encoding be UTF-8; and
>    C3:  that other encodings be permitted as an optional 
> system-dependent feature when an explicit encoding
> has been specified by
>      C3.1:  a unicode BOM (byte-order-mark) (see
> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
> into a character stream, or
>      C3.2.  the first or second line being a comment of the form:
>        # -*- coding: <encoding-name> -*-
>      as recognized by GNU Emacs, or
>      C3.3.  the first or second line being a comment of the form:
>        # vim:fileencoding=<encoding-name>
>      as recognized by Bram Moolenaar's VIM
> (see section 2.1.4 of 
> http://docs.python.org/reference/lexical_analysis.html for a more
> information).
> 
> For the rest of this discussion, let us assume unicode conventions
> 
> 
> Next, we need to decide on the rules for handling lines breaks.  I would
> suggest we follow the pythn convention of first considering "physical
> lines" and then introduce rules for joinng those physcial lines into
> "logcal lines".
> 
> Here, past practice with CIF rears its head -- what do we do with trailing 
> white space?  In CIF until now, in order to deal with old fortran systems, 
> we have assumed that we cannot tell the difference between lines that end 
> with one blank or with an arbitrary number of blanks. Many fortran 
> implementations do not support an clean way to detect end of line, and, 
> worse, have no way to cope with lines of arbitrary length. We also still 
> have the system-dependent definitions of line termination.  For our 
> "customer-base" I do not see any practical way around this right now, so, 
> with regret, I propose
> 
>    physical line:
> 
>    PL1: In describing the lexer, the system-dependent end-of-line will be
> given a '\n'.  In source files, any of the standard platform line 
> termination sequences can be used - the Unix form using ASCII LF 
> (linefeed), the Windows form using the ASCII sequence CR LF (return 
> followed by linefeed), or the old Macintosh form using the ASCII CR 
> (return) character. All of these forms can be used equally, regardless of 
> platform.  I addition, all space and tab charcaters, '\x20' '\x09', 
> immediately prior to the system-dependent end-of-line will be removed
> prior to further lexical analysis; and
>    PL2: There may be a system-dependent limit on the maximal length
> of the resulting line, but in all cases, lines of up to 2048 charcaters
> will be accepted.
> 
>    comments:
> 
>    LC1:  A comment starts with a hash character (#) that is not part of a 
> string literal, and ends at the end of the physical line. A comment 
> signifies the end of the logical line unless the implicit line joining 
> rules are invoked. Comments are ignored by the syntax; they are not 
> tokens.
> 
>    logical line:
> 
>    LL1:  A logical line is constructed from one or more physical lines by
> following explicit or implicit joining rules
>    LL2:  Explicit line joining:  Two or more physical lines may be joined 
> into 
> logical lines using reverse solidus characters (\), as follows: when a 
> physical 
> line ends in a reverse solidus that is not part of a string literal or 
> comment, 
> it is joined with the following forming a single logical line, deleting 
> the backslash and the following end-of-line character.
>    LL2.  Implicit line joining: Expressions in parentheses, square brackets 
> or curly braces can be split over more than one physical line without 
> using backslashes.  Implicitly continued lines can carry comments. Blank 
> continuation lines are allowed. There is no end-of-line token between 
> implicit continuation lines. Implicitly continued lines can also occur 
> within triple-quoted strings (see below); in that case they cannot carry 
> comments.
> 
> Strings
> 
>    With the character stream and the lines defined, the next thing we need
> to define are string.  I propose we adopt a subset of the python 
> convention, but without the string prefixes.  :
> 
> String literals can be enclosed in matching single quotes (') or double 
> quotes ("). They can also be enclosed in matching groups of three single 
> or double quotes (these are generally referred to as triple-quoted 
> strings). The reverse solidus (\) character is used to escape characters 
> that otherwise have a special meaning, such as newline, backslash itself, 
> or the quote character.
> 
> In triple-quoted strings, unescaped newlines and quotes are allowed (and 
> are retained), except that three unescaped quotes in a row terminate the 
> string. (A quote is the character used to open the string, i.e. either ' 
> or ".)
> 
> There is more to define, but if we go this far, we should be able to
> have fairly clean lexical scanners that are able to handle nested
> quotation marks in a way that most programmers will understand.
> 
> Regards,
>     Herbert
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8