[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] [THREAD 4] UTF8
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Tue, 13 Oct 2009 10:46:25 -0400 (EDT)
In-Reply-To: <20091013142207.GA17974@emerald.iucr.org>
References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com><20091013055314.F86319@epsilon.pair.com><20091013142207.GA17974@emerald.iucr.org>

Dear Colleagues,

   Brian has asked about UTF-8 and old Fortran systems.  Almost all
Fortran system still in use can deal with UTF-8 becuse most of them
are 8-bit systems, so the extra bytes in the multi-byte sequences
just look like any other printable charcaters.  There are still some
systems around that use 7 bit characters or odd byte sizes.   There
is a UTF-7 that could work, but it is messy and not recommended.
For those systems and entrernal econcoder/decoder would be the best
choice.  I think they wuld rarely be needed.

Most of the world in now on 8-bit character systems, even for Fortran.
Certainly Fortran 95/2003 makes that assumption.

In any case, at the moment, Fortran (old and new) still does a bad job in 
handling trailing blanks on physical lines.  The fix (the Q edit 
descriptor) was simply not widely accepted.  Maybe when we revisit this in 
a few years the world of Fortran will be different, but not yet.

The simple cure (just use C-I/O) is not acceptable to certain major
crytallographic application developers.  On that, too, there are fixes
in the wings, but not solid yet.

For now, I think we need to include trailing blank stripping in the spec.

   Regards,
     Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 13 Oct 2009, Brian McMahon wrote:

> Without having had time to analyse it in detail, I like the
> pragmatic feel of much of what Herbert says.
>
> But I wonder about the reference to "old fortran systems":
>
>> Here, past practice with CIF rears its head -- what do we do with trailing
>> white space?  In CIF until now, in order to deal with old fortran systems,
>> we have assumed that we cannot tell the difference between lines that end
>> with one blank or with an arbitrary number of blanks...
>
> We've ascertained that "modern" Fortran systems can accommodate UTF-8
> byte streams - can the "old" ones? In other words, if the principle of
> maximal disruption applies and we accept UTF-8, are we justified at
> the same time in sacrificing compatibility with such "old"
> Fortran-based systems? And if so, does that allow a different
> handling of "physical lines" ?
>
> Regards
> Brian
>
>
>
> On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote:
>> Dear Colleagues,
>>
>>    Let us "zero-base" this dicsussion and consider just the lexical
>> analysis appropriate to some future CIF-like language.  Let us look at
>> some of the lexical issues that python deals with and consider what
>> lessons we may learn there in trying to go from a string of characters
>> to a string of tokens.
>>
>>    First, we need to settle on what characters we will be using.
>> Origincally, python restricted its attention to just 7-bit ascii
>> characters "for program text."  Now (from version 2.3 onwards), python
>> allows "an encoding declaration [to be] used to indicate that string
>> literals and comments use an encoding different from ASCII".
>>
>>    I propose that we do something similar, but with a more modern starting
>> point:
>>
>> new cif character set and encoding:
>>
>>    C1:  that the character set for a "new cif" be unicode, and
>>    C2:  that the default encoding be UTF-8; and
>>    C3:  that other encodings be permitted as an optional
>> system-dependent feature when an explicit encoding
>> has been specified by
>>      C3.1:  a unicode BOM (byte-order-mark) (see
>> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
>> into a character stream, or
>>      C3.2.  the first or second line being a comment of the form:
>>        # -*- coding: <encoding-name> -*-
>>      as recognized by GNU Emacs, or
>>      C3.3.  the first or second line being a comment of the form:
>>        # vim:fileencoding=<encoding-name>
>>      as recognized by Bram Moolenaar's VIM
>> (see section 2.1.4 of
>> http://docs.python.org/reference/lexical_analysis.html for a more
>> information).
>>
>> For the rest of this discussion, let us assume unicode conventions
>>
>>
>> Next, we need to decide on the rules for handling lines breaks.  I would
>> suggest we follow the pythn convention of first considering "physical
>> lines" and then introduce rules for joinng those physcial lines into
>> "logcal lines".
>>
>> Here, past practice with CIF rears its head -- what do we do with trailing
>> white space?  In CIF until now, in order to deal with old fortran systems,
>> we have assumed that we cannot tell the difference between lines that end
>> with one blank or with an arbitrary number of blanks. Many fortran
>> implementations do not support an clean way to detect end of line, and,
>> worse, have no way to cope with lines of arbitrary length. We also still
>> have the system-dependent definitions of line termination.  For our
>> "customer-base" I do not see any practical way around this right now, so,
>> with regret, I propose
>>
>>    physical line:
>>
>>    PL1: In describing the lexer, the system-dependent end-of-line will be
>> given a '\n'.  In source files, any of the standard platform line
>> termination sequences can be used - the Unix form using ASCII LF
>> (linefeed), the Windows form using the ASCII sequence CR LF (return
>> followed by linefeed), or the old Macintosh form using the ASCII CR
>> (return) character. All of these forms can be used equally, regardless of
>> platform.  I addition, all space and tab charcaters, '\x20' '\x09',
>> immediately prior to the system-dependent end-of-line will be removed
>> prior to further lexical analysis; and
>>    PL2: There may be a system-dependent limit on the maximal length
>> of the resulting line, but in all cases, lines of up to 2048 charcaters
>> will be accepted.
>>
>>    comments:
>>
>>    LC1:  A comment starts with a hash character (#) that is not part of a
>> string literal, and ends at the end of the physical line. A comment
>> signifies the end of the logical line unless the implicit line joining
>> rules are invoked. Comments are ignored by the syntax; they are not
>> tokens.
>>
>>    logical line:
>>
>>    LL1:  A logical line is constructed from one or more physical lines by
>> following explicit or implicit joining rules
>>    LL2:  Explicit line joining:  Two or more physical lines may be joined
>> into
>> logical lines using reverse solidus characters (\), as follows: when a
>> physical
>> line ends in a reverse solidus that is not part of a string literal or
>> comment,
>> it is joined with the following forming a single logical line, deleting
>> the backslash and the following end-of-line character.
>>    LL2.  Implicit line joining: Expressions in parentheses, square brackets
>> or curly braces can be split over more than one physical line without
>> using backslashes.  Implicitly continued lines can carry comments. Blank
>> continuation lines are allowed. There is no end-of-line token between
>> implicit continuation lines. Implicitly continued lines can also occur
>> within triple-quoted strings (see below); in that case they cannot carry
>> comments.
>>
>> Strings
>>
>>    With the character stream and the lines defined, the next thing we need
>> to define are string.  I propose we adopt a subset of the python
>> convention, but without the string prefixes.  :
>>
>> String literals can be enclosed in matching single quotes (') or double
>> quotes ("). They can also be enclosed in matching groups of three single
>> or double quotes (these are generally referred to as triple-quoted
>> strings). The reverse solidus (\) character is used to escape characters
>> that otherwise have a special meaning, such as newline, backslash itself,
>> or the quote character.
>>
>> In triple-quoted strings, unescaped newlines and quotes are allowed (and
>> are retained), except that three unescaped quotes in a row terminate the
>> string. (A quote is the character used to open the string, i.e. either '
>> or ".)
>>
>> There is more to define, but if we go this far, we should be able to
>> have fairly clean lexical scanners that are able to handle nested
>> quotation marks in a way that most programmers will understand.
>>
>> Regards,
>>     Herbert
>>
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

References:

Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)

Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)

Re: [ddlm-group] [THREAD 4] UTF8 (Herbert J. Bernstein)

Re: [ddlm-group] [THREAD 4] UTF8 (Brian McMahon)

Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8

Next by Date: Re: [ddlm-group] [THREAD 4] UTF8

Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8

Next by thread: Re: [ddlm-group] [THREAD 4] UTF8

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] [THREAD 4] UTF8