Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Without having had time to analyse it in detail, I like the
pragmatic feel of much of what Herbert says. 

But I wonder about the reference to "old fortran systems":
 
> Here, past practice with CIF rears its head -- what do we do with trailing 
> white space?  In CIF until now, in order to deal with old fortran systems, 
> we have assumed that we cannot tell the difference between lines that end 
> with one blank or with an arbitrary number of blanks...

We've ascertained that "modern" Fortran systems can accommodate UTF-8
byte streams - can the "old" ones? In other words, if the principle of
maximal disruption applies and we accept UTF-8, are we justified at
the same time in sacrificing compatibility with such "old"
Fortran-based systems? And if so, does that allow a different
handling of "physical lines" ?

Regards
Brian



On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote:
> Dear Colleagues,
> 
>    Let us "zero-base" this dicsussion and consider just the lexical 
> analysis appropriate to some future CIF-like language.  Let us look at 
> some of the lexical issues that python deals with and consider what
> lessons we may learn there in trying to go from a string of characters
> to a string of tokens.
> 
>    First, we need to settle on what characters we will be using. 
> Origincally, python restricted its attention to just 7-bit ascii 
> characters "for program text."  Now (from version 2.3 onwards), python 
> allows "an encoding declaration [to be] used to indicate that string 
> literals and comments use an encoding different from ASCII".
> 
>    I propose that we do something similar, but with a more modern starting
> point:
> 
> new cif character set and encoding:
> 
>    C1:  that the character set for a "new cif" be unicode, and
>    C2:  that the default encoding be UTF-8; and
>    C3:  that other encodings be permitted as an optional 
> system-dependent feature when an explicit encoding
> has been specified by
>      C3.1:  a unicode BOM (byte-order-mark) (see
> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
> into a character stream, or
>      C3.2.  the first or second line being a comment of the form:
>        # -*- coding: <encoding-name> -*-
>      as recognized by GNU Emacs, or
>      C3.3.  the first or second line being a comment of the form:
>        # vim:fileencoding=<encoding-name>
>      as recognized by Bram Moolenaar's VIM
> (see section 2.1.4 of 
> http://docs.python.org/reference/lexical_analysis.html for a more
> information).
> 
> For the rest of this discussion, let us assume unicode conventions
> 
> 
> Next, we need to decide on the rules for handling lines breaks.  I would
> suggest we follow the pythn convention of first considering "physical
> lines" and then introduce rules for joinng those physcial lines into
> "logcal lines".
> 
> Here, past practice with CIF rears its head -- what do we do with trailing 
> white space?  In CIF until now, in order to deal with old fortran systems, 
> we have assumed that we cannot tell the difference between lines that end 
> with one blank or with an arbitrary number of blanks. Many fortran 
> implementations do not support an clean way to detect end of line, and, 
> worse, have no way to cope with lines of arbitrary length. We also still 
> have the system-dependent definitions of line termination.  For our 
> "customer-base" I do not see any practical way around this right now, so, 
> with regret, I propose
> 
>    physical line:
> 
>    PL1: In describing the lexer, the system-dependent end-of-line will be
> given a '\n'.  In source files, any of the standard platform line 
> termination sequences can be used - the Unix form using ASCII LF 
> (linefeed), the Windows form using the ASCII sequence CR LF (return 
> followed by linefeed), or the old Macintosh form using the ASCII CR 
> (return) character. All of these forms can be used equally, regardless of 
> platform.  I addition, all space and tab charcaters, '\x20' '\x09', 
> immediately prior to the system-dependent end-of-line will be removed
> prior to further lexical analysis; and
>    PL2: There may be a system-dependent limit on the maximal length
> of the resulting line, but in all cases, lines of up to 2048 charcaters
> will be accepted.
> 
>    comments:
> 
>    LC1:  A comment starts with a hash character (#) that is not part of a 
> string literal, and ends at the end of the physical line. A comment 
> signifies the end of the logical line unless the implicit line joining 
> rules are invoked. Comments are ignored by the syntax; they are not 
> tokens.
> 
>    logical line:
> 
>    LL1:  A logical line is constructed from one or more physical lines by
> following explicit or implicit joining rules
>    LL2:  Explicit line joining:  Two or more physical lines may be joined 
> into 
> logical lines using reverse solidus characters (\), as follows: when a 
> physical 
> line ends in a reverse solidus that is not part of a string literal or 
> comment, 
> it is joined with the following forming a single logical line, deleting 
> the backslash and the following end-of-line character.
>    LL2.  Implicit line joining: Expressions in parentheses, square brackets 
> or curly braces can be split over more than one physical line without 
> using backslashes.  Implicitly continued lines can carry comments. Blank 
> continuation lines are allowed. There is no end-of-line token between 
> implicit continuation lines. Implicitly continued lines can also occur 
> within triple-quoted strings (see below); in that case they cannot carry 
> comments.
> 
> Strings
> 
>    With the character stream and the lines defined, the next thing we need
> to define are string.  I propose we adopt a subset of the python 
> convention, but without the string prefixes.  :
> 
> String literals can be enclosed in matching single quotes (') or double 
> quotes ("). They can also be enclosed in matching groups of three single 
> or double quotes (these are generally referred to as triple-quoted 
> strings). The reverse solidus (\) character is used to escape characters 
> that otherwise have a special meaning, such as newline, backslash itself, 
> or the quote character.
> 
> In triple-quoted strings, unescaped newlines and quotes are allowed (and 
> are retained), except that three unescaped quotes in a row terminate the 
> string. (A quote is the character used to open the string, i.e. either ' 
> or ".)
> 
> There is more to define, but if we go this far, we should be able to
> have fairly clean lexical scanners that are able to handle nested
> quotation marks in a way that most programmers will understand.
> 
> Regards,
>     Herbert
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.