Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] [THREAD 4] UTF8

Old Fortran? Modern Fortran? You mean there was another one after 1966?

Oh well, back to the IBM 704, and where did I put those punch cards?

These problems are very real for legacy systems and programs. I must admit
my life has been dominated by the "every thing is a file or stream"
philosophy of *nix, so these record length issues don't arise.

But again lets keep the specification, and the implementation of it
separate. Old Fortran-ers may (or may not) have to do a bit more work, but
that is the consequence of legacy software. As far as I can tell modern
Fortran has libraries to deal with utf-8, but you can only enlist the
extended character set in the source code by using \u notation etc
presumably in string definitions.

On 13/10/09 10:22 PM, "Brian McMahon" <bm@iucr.org> wrote:

> Without having had time to analyse it in detail, I like the
> pragmatic feel of much of what Herbert says.
> 
> But I wonder about the reference to "old fortran systems":
>  
>> Here, past practice with CIF rears its head -- what do we do with trailing
>> white space?  In CIF until now, in order to deal with old fortran systems,
>> we have assumed that we cannot tell the difference between lines that end
>> with one blank or with an arbitrary number of blanks...
> 
> We've ascertained that "modern" Fortran systems can accommodate UTF-8
> byte streams - can the "old" ones? In other words, if the principle of
> maximal disruption applies and we accept UTF-8, are we justified at
> the same time in sacrificing compatibility with such "old"
> Fortran-based systems? And if so, does that allow a different
> handling of "physical lines" ?
> 
> Regards
> Brian
> 
> 
> 
> On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote:
>> Dear Colleagues,
>> 
>>    Let us "zero-base" this dicsussion and consider just the lexical
>> analysis appropriate to some future CIF-like language.  Let us look at
>> some of the lexical issues that python deals with and consider what
>> lessons we may learn there in trying to go from a string of characters
>> to a string of tokens.
>> 
>>    First, we need to settle on what characters we will be using.
>> Origincally, python restricted its attention to just 7-bit ascii
>> characters "for program text."  Now (from version 2.3 onwards), python
>> allows "an encoding declaration [to be] used to indicate that string
>> literals and comments use an encoding different from ASCII".
>> 
>>    I propose that we do something similar, but with a more modern starting
>> point:
>> 
>> new cif character set and encoding:
>> 
>>    C1:  that the character set for a "new cif" be unicode, and
>>    C2:  that the default encoding be UTF-8; and
>>    C3:  that other encodings be permitted as an optional
>> system-dependent feature when an explicit encoding
>> has been specified by
>>      C3.1:  a unicode BOM (byte-order-mark) (see
>> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
>> into a character stream, or
>>      C3.2.  the first or second line being a comment of the form:
>>        # -*- coding: <encoding-name> -*-
>>      as recognized by GNU Emacs, or
>>      C3.3.  the first or second line being a comment of the form:
>>        # vim:fileencoding=<encoding-name>
>>      as recognized by Bram Moolenaar's VIM
>> (see section 2.1.4 of
>> http://docs.python.org/reference/lexical_analysis.html for a more
>> information).
>> 
>> For the rest of this discussion, let us assume unicode conventions
>> 
>> 
>> Next, we need to decide on the rules for handling lines breaks.  I would
>> suggest we follow the pythn convention of first considering "physical
>> lines" and then introduce rules for joinng those physcial lines into
>> "logcal lines".
>> 
>> Here, past practice with CIF rears its head -- what do we do with trailing
>> white space?  In CIF until now, in order to deal with old fortran systems,
>> we have assumed that we cannot tell the difference between lines that end
>> with one blank or with an arbitrary number of blanks. Many fortran
>> implementations do not support an clean way to detect end of line, and,
>> worse, have no way to cope with lines of arbitrary length. We also still
>> have the system-dependent definitions of line termination.  For our
>> "customer-base" I do not see any practical way around this right now, so,
>> with regret, I propose
>> 
>>    physical line:
>> 
>>    PL1: In describing the lexer, the system-dependent end-of-line will be
>> given a '\n'.  In source files, any of the standard platform line
>> termination sequences can be used - the Unix form using ASCII LF
>> (linefeed), the Windows form using the ASCII sequence CR LF (return
>> followed by linefeed), or the old Macintosh form using the ASCII CR
>> (return) character. All of these forms can be used equally, regardless of
>> platform.  I addition, all space and tab charcaters, '\x20' '\x09',
>> immediately prior to the system-dependent end-of-line will be removed
>> prior to further lexical analysis; and
>>    PL2: There may be a system-dependent limit on the maximal length
>> of the resulting line, but in all cases, lines of up to 2048 charcaters
>> will be accepted.
>> 
>>    comments:
>> 
>>    LC1:  A comment starts with a hash character (#) that is not part of a
>> string literal, and ends at the end of the physical line. A comment
>> signifies the end of the logical line unless the implicit line joining
>> rules are invoked. Comments are ignored by the syntax; they are not
>> tokens.
>> 
>>    logical line:
>> 
>>    LL1:  A logical line is constructed from one or more physical lines by
>> following explicit or implicit joining rules
>>    LL2:  Explicit line joining:  Two or more physical lines may be joined
>> into 
>> logical lines using reverse solidus characters (\), as follows: when a
>> physical 
>> line ends in a reverse solidus that is not part of a string literal or
>> comment, 
>> it is joined with the following forming a single logical line, deleting
>> the backslash and the following end-of-line character.
>>    LL2.  Implicit line joining: Expressions in parentheses, square brackets
>> or curly braces can be split over more than one physical line without
>> using backslashes.  Implicitly continued lines can carry comments. Blank
>> continuation lines are allowed. There is no end-of-line token between
>> implicit continuation lines. Implicitly continued lines can also occur
>> within triple-quoted strings (see below); in that case they cannot carry
>> comments.
>> 
>> Strings
>> 
>>    With the character stream and the lines defined, the next thing we need
>> to define are string.  I propose we adopt a subset of the python
>> convention, but without the string prefixes.  :
>> 
>> String literals can be enclosed in matching single quotes (') or double
>> quotes ("). They can also be enclosed in matching groups of three single
>> or double quotes (these are generally referred to as triple-quoted
>> strings). The reverse solidus (\) character is used to escape characters
>> that otherwise have a special meaning, such as newline, backslash itself,
>> or the quote character.
>> 
>> In triple-quoted strings, unescaped newlines and quotes are allowed (and
>> are retained), except that three unescaped quotes in a row terminate the
>> string. (A quote is the character used to open the string, i.e. either '
>> or ".)
>> 
>> There is more to define, but if we go this far, we should be able to
>> have fairly clean lexical scanners that are able to handle nested
>> quotation marks in a way that most programmers will understand.
>> 
>> Regards,
>>     Herbert
>> 
>> =====================================================
>>   Herbert J. Bernstein, Professor of Computer Science
>>     Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>> 
>>                   +1-631-244-3035
>>                   yaya@dowling.edu
>> =====================================================
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au





_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.