[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: Brian McMahon <bm@iucr.org>
- Date: Tue, 13 Oct 2009 15:22:08 +0100
- In-Reply-To: <20091013055314.F86319@epsilon.pair.com>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com>
Without having had time to analyse it in detail, I like the pragmatic feel of much of what Herbert says. But I wonder about the reference to "old fortran systems": > Here, past practice with CIF rears its head -- what do we do with trailing > white space? In CIF until now, in order to deal with old fortran systems, > we have assumed that we cannot tell the difference between lines that end > with one blank or with an arbitrary number of blanks... We've ascertained that "modern" Fortran systems can accommodate UTF-8 byte streams - can the "old" ones? In other words, if the principle of maximal disruption applies and we accept UTF-8, are we justified at the same time in sacrificing compatibility with such "old" Fortran-based systems? And if so, does that allow a different handling of "physical lines" ? Regards Brian On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote: > Dear Colleagues, > > Let us "zero-base" this dicsussion and consider just the lexical > analysis appropriate to some future CIF-like language. Let us look at > some of the lexical issues that python deals with and consider what > lessons we may learn there in trying to go from a string of characters > to a string of tokens. > > First, we need to settle on what characters we will be using. > Origincally, python restricted its attention to just 7-bit ascii > characters "for program text." Now (from version 2.3 onwards), python > allows "an encoding declaration [to be] used to indicate that string > literals and comments use an encoding different from ASCII". > > I propose that we do something similar, but with a more modern starting > point: > > new cif character set and encoding: > > C1: that the character set for a "new cif" be unicode, and > C2: that the default encoding be UTF-8; and > C3: that other encodings be permitted as an optional > system-dependent feature when an explicit encoding > has been specified by > C3.1: a unicode BOM (byte-order-mark) (see > http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced > into a character stream, or > C3.2. the first or second line being a comment of the form: > # -*- coding: <encoding-name> -*- > as recognized by GNU Emacs, or > C3.3. the first or second line being a comment of the form: > # vim:fileencoding=<encoding-name> > as recognized by Bram Moolenaar's VIM > (see section 2.1.4 of > http://docs.python.org/reference/lexical_analysis.html for a more > information). > > For the rest of this discussion, let us assume unicode conventions > > > Next, we need to decide on the rules for handling lines breaks. I would > suggest we follow the pythn convention of first considering "physical > lines" and then introduce rules for joinng those physcial lines into > "logcal lines". > > Here, past practice with CIF rears its head -- what do we do with trailing > white space? In CIF until now, in order to deal with old fortran systems, > we have assumed that we cannot tell the difference between lines that end > with one blank or with an arbitrary number of blanks. Many fortran > implementations do not support an clean way to detect end of line, and, > worse, have no way to cope with lines of arbitrary length. We also still > have the system-dependent definitions of line termination. For our > "customer-base" I do not see any practical way around this right now, so, > with regret, I propose > > physical line: > > PL1: In describing the lexer, the system-dependent end-of-line will be > given a '\n'. In source files, any of the standard platform line > termination sequences can be used - the Unix form using ASCII LF > (linefeed), the Windows form using the ASCII sequence CR LF (return > followed by linefeed), or the old Macintosh form using the ASCII CR > (return) character. All of these forms can be used equally, regardless of > platform. I addition, all space and tab charcaters, '\x20' '\x09', > immediately prior to the system-dependent end-of-line will be removed > prior to further lexical analysis; and > PL2: There may be a system-dependent limit on the maximal length > of the resulting line, but in all cases, lines of up to 2048 charcaters > will be accepted. > > comments: > > LC1: A comment starts with a hash character (#) that is not part of a > string literal, and ends at the end of the physical line. A comment > signifies the end of the logical line unless the implicit line joining > rules are invoked. Comments are ignored by the syntax; they are not > tokens. > > logical line: > > LL1: A logical line is constructed from one or more physical lines by > following explicit or implicit joining rules > LL2: Explicit line joining: Two or more physical lines may be joined > into > logical lines using reverse solidus characters (\), as follows: when a > physical > line ends in a reverse solidus that is not part of a string literal or > comment, > it is joined with the following forming a single logical line, deleting > the backslash and the following end-of-line character. > LL2. Implicit line joining: Expressions in parentheses, square brackets > or curly braces can be split over more than one physical line without > using backslashes. Implicitly continued lines can carry comments. Blank > continuation lines are allowed. There is no end-of-line token between > implicit continuation lines. Implicitly continued lines can also occur > within triple-quoted strings (see below); in that case they cannot carry > comments. > > Strings > > With the character stream and the lines defined, the next thing we need > to define are string. I propose we adopt a subset of the python > convention, but without the string prefixes. : > > String literals can be enclosed in matching single quotes (') or double > quotes ("). They can also be enclosed in matching groups of three single > or double quotes (these are generally referred to as triple-quoted > strings). The reverse solidus (\) character is used to escape characters > that otherwise have a special meaning, such as newline, backslash itself, > or the quote character. > > In triple-quoted strings, unescaped newlines and quotes are allowed (and > are retained), except that three unescaped quotes in a row terminate the > string. (A quote is the character used to open the string, i.e. either ' > or ".) > > There is more to define, but if we go this far, we should be able to > have fairly clean lexical scanners that are able to handle nested > quotation marks in a way that most programmers will understand. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):