[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: [email protected], Group finalising DDLm and associated dictionaries <[email protected]>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <[email protected]>
- Date: Tue, 13 Oct 2009 11:14:58 -0400 (EDT)
- In-Reply-To: <C6FAB49A.1209A%[email protected]>
- References: <C6FAB49A.1209A%[email protected]>
Sorry, but there are major, modern, crystallographic sofware packages
wriiten in Fortran. The legacy part is not the software, but being
able to use it in older hardware/os systems, e.g. from 2003, running,
say, Linux, and using, say, g77.
=====================================================
Herbert J. Bernstein, Professor of Computer Science
Dowling College, Kramer Science Center, KSC 121
Idle Hour Blvd, Oakdale, NY, 11769
+1-631-244-3035
[email protected]
=====================================================
On Tue, 13 Oct 2009, Nick Spadaccini wrote:
> Old Fortran? Modern Fortran? You mean there was another one after 1966?
>
> Oh well, back to the IBM 704, and where did I put those punch cards?
>
> These problems are very real for legacy systems and programs. I must admit
> my life has been dominated by the "every thing is a file or stream"
> philosophy of *nix, so these record length issues don't arise.
>
> But again lets keep the specification, and the implementation of it
> separate. Old Fortran-ers may (or may not) have to do a bit more work, but
> that is the consequence of legacy software. As far as I can tell modern
> Fortran has libraries to deal with utf-8, but you can only enlist the
> extended character set in the source code by using \u notation etc
> presumably in string definitions.
>
> On 13/10/09 10:22 PM, "Brian McMahon" <[email protected]> wrote:
>
>> Without having had time to analyse it in detail, I like the
>> pragmatic feel of much of what Herbert says.
>>
>> But I wonder about the reference to "old fortran systems":
>>
>>> Here, past practice with CIF rears its head -- what do we do with trailing
>>> white space? In CIF until now, in order to deal with old fortran systems,
>>> we have assumed that we cannot tell the difference between lines that end
>>> with one blank or with an arbitrary number of blanks...
>>
>> We've ascertained that "modern" Fortran systems can accommodate UTF-8
>> byte streams - can the "old" ones? In other words, if the principle of
>> maximal disruption applies and we accept UTF-8, are we justified at
>> the same time in sacrificing compatibility with such "old"
>> Fortran-based systems? And if so, does that allow a different
>> handling of "physical lines" ?
>>
>> Regards
>> Brian
>>
>>
>>
>> On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote:
>>> Dear Colleagues,
>>>
>>> Let us "zero-base" this dicsussion and consider just the lexical
>>> analysis appropriate to some future CIF-like language. Let us look at
>>> some of the lexical issues that python deals with and consider what
>>> lessons we may learn there in trying to go from a string of characters
>>> to a string of tokens.
>>>
>>> First, we need to settle on what characters we will be using.
>>> Origincally, python restricted its attention to just 7-bit ascii
>>> characters "for program text." Now (from version 2.3 onwards), python
>>> allows "an encoding declaration [to be] used to indicate that string
>>> literals and comments use an encoding different from ASCII".
>>>
>>> I propose that we do something similar, but with a more modern starting
>>> point:
>>>
>>> new cif character set and encoding:
>>>
>>> C1: that the character set for a "new cif" be unicode, and
>>> C2: that the default encoding be UTF-8; and
>>> C3: that other encodings be permitted as an optional
>>> system-dependent feature when an explicit encoding
>>> has been specified by
>>> C3.1: a unicode BOM (byte-order-mark) (see
>>> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced
>>> into a character stream, or
>>> C3.2. the first or second line being a comment of the form:
>>> # -*- coding: <encoding-name> -*-
>>> as recognized by GNU Emacs, or
>>> C3.3. the first or second line being a comment of the form:
>>> # vim:fileencoding=<encoding-name>
>>> as recognized by Bram Moolenaar's VIM
>>> (see section 2.1.4 of
>>> http://docs.python.org/reference/lexical_analysis.html for a more
>>> information).
>>>
>>> For the rest of this discussion, let us assume unicode conventions
>>>
>>>
>>> Next, we need to decide on the rules for handling lines breaks. I would
>>> suggest we follow the pythn convention of first considering "physical
>>> lines" and then introduce rules for joinng those physcial lines into
>>> "logcal lines".
>>>
>>> Here, past practice with CIF rears its head -- what do we do with trailing
>>> white space? In CIF until now, in order to deal with old fortran systems,
>>> we have assumed that we cannot tell the difference between lines that end
>>> with one blank or with an arbitrary number of blanks. Many fortran
>>> implementations do not support an clean way to detect end of line, and,
>>> worse, have no way to cope with lines of arbitrary length. We also still
>>> have the system-dependent definitions of line termination. For our
>>> "customer-base" I do not see any practical way around this right now, so,
>>> with regret, I propose
>>>
>>> physical line:
>>>
>>> PL1: In describing the lexer, the system-dependent end-of-line will be
>>> given a '\n'. In source files, any of the standard platform line
>>> termination sequences can be used - the Unix form using ASCII LF
>>> (linefeed), the Windows form using the ASCII sequence CR LF (return
>>> followed by linefeed), or the old Macintosh form using the ASCII CR
>>> (return) character. All of these forms can be used equally, regardless of
>>> platform. I addition, all space and tab charcaters, '\x20' '\x09',
>>> immediately prior to the system-dependent end-of-line will be removed
>>> prior to further lexical analysis; and
>>> PL2: There may be a system-dependent limit on the maximal length
>>> of the resulting line, but in all cases, lines of up to 2048 charcaters
>>> will be accepted.
>>>
>>> comments:
>>>
>>> LC1: A comment starts with a hash character (#) that is not part of a
>>> string literal, and ends at the end of the physical line. A comment
>>> signifies the end of the logical line unless the implicit line joining
>>> rules are invoked. Comments are ignored by the syntax; they are not
>>> tokens.
>>>
>>> logical line:
>>>
>>> LL1: A logical line is constructed from one or more physical lines by
>>> following explicit or implicit joining rules
>>> LL2: Explicit line joining: Two or more physical lines may be joined
>>> into
>>> logical lines using reverse solidus characters (\), as follows: when a
>>> physical
>>> line ends in a reverse solidus that is not part of a string literal or
>>> comment,
>>> it is joined with the following forming a single logical line, deleting
>>> the backslash and the following end-of-line character.
>>> LL2. Implicit line joining: Expressions in parentheses, square brackets
>>> or curly braces can be split over more than one physical line without
>>> using backslashes. Implicitly continued lines can carry comments. Blank
>>> continuation lines are allowed. There is no end-of-line token between
>>> implicit continuation lines. Implicitly continued lines can also occur
>>> within triple-quoted strings (see below); in that case they cannot carry
>>> comments.
>>>
>>> Strings
>>>
>>> With the character stream and the lines defined, the next thing we need
>>> to define are string. I propose we adopt a subset of the python
>>> convention, but without the string prefixes. :
>>>
>>> String literals can be enclosed in matching single quotes (') or double
>>> quotes ("). They can also be enclosed in matching groups of three single
>>> or double quotes (these are generally referred to as triple-quoted
>>> strings). The reverse solidus (\) character is used to escape characters
>>> that otherwise have a special meaning, such as newline, backslash itself,
>>> or the quote character.
>>>
>>> In triple-quoted strings, unescaped newlines and quotes are allowed (and
>>> are retained), except that three unescaped quotes in a row terminate the
>>> string. (A quote is the character used to open the string, i.e. either '
>>> or ".)
>>>
>>> There is more to define, but if we go this far, we should be able to
>>> have fairly clean lexical scanners that are able to handle nested
>>> quotation marks in a way that most programmers will understand.
>>>
>>> Regards,
>>> Herbert
>>>
>>> =====================================================
>>> Herbert J. Bernstein, Professor of Computer Science
>>> Dowling College, Kramer Science Center, KSC 121
>>> Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>> +1-631-244-3035
>>> [email protected]
>>> =====================================================
>> _______________________________________________
>> ddlm-group mailing list
>> [email protected]
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
> cheers
>
> Nick
>
> --------------------------------
> Associate Professor N. Spadaccini, PhD
> School of Computer Science & Software Engineering
>
> The University of Western Australia t: +61 (0)8 6488 3452
> 35 Stirling Highway f: +61 (0)8 6488 1089
> CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick
> MBDP M002
>
> CRICOS Provider Code: 00126G
>
> e: [email protected]
>
>
>
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: [ddlm-group] Straw poll results
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):

