[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Nick.Spadaccini@uwa.edu.au, Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 13 Oct 2009 11:14:58 -0400 (EDT)
- In-Reply-To: <C6FAB49A.1209A%nick@csse.uwa.edu.au>
- References: <C6FAB49A.1209A%nick@csse.uwa.edu.au>
Sorry, but there are major, modern, crystallographic sofware packages wriiten in Fortran. The legacy part is not the software, but being able to use it in older hardware/os systems, e.g. from 2003, running, say, Linux, and using, say, g77. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Tue, 13 Oct 2009, Nick Spadaccini wrote: > Old Fortran? Modern Fortran? You mean there was another one after 1966? > > Oh well, back to the IBM 704, and where did I put those punch cards? > > These problems are very real for legacy systems and programs. I must admit > my life has been dominated by the "every thing is a file or stream" > philosophy of *nix, so these record length issues don't arise. > > But again lets keep the specification, and the implementation of it > separate. Old Fortran-ers may (or may not) have to do a bit more work, but > that is the consequence of legacy software. As far as I can tell modern > Fortran has libraries to deal with utf-8, but you can only enlist the > extended character set in the source code by using \u notation etc > presumably in string definitions. > > On 13/10/09 10:22 PM, "Brian McMahon" <bm@iucr.org> wrote: > >> Without having had time to analyse it in detail, I like the >> pragmatic feel of much of what Herbert says. >> >> But I wonder about the reference to "old fortran systems": >> >>> Here, past practice with CIF rears its head -- what do we do with trailing >>> white space? In CIF until now, in order to deal with old fortran systems, >>> we have assumed that we cannot tell the difference between lines that end >>> with one blank or with an arbitrary number of blanks... >> >> We've ascertained that "modern" Fortran systems can accommodate UTF-8 >> byte streams - can the "old" ones? In other words, if the principle of >> maximal disruption applies and we accept UTF-8, are we justified at >> the same time in sacrificing compatibility with such "old" >> Fortran-based systems? And if so, does that allow a different >> handling of "physical lines" ? >> >> Regards >> Brian >> >> >> >> On Tue, Oct 13, 2009 at 10:09:18AM -0400, Herbert J. Bernstein wrote: >>> Dear Colleagues, >>> >>> Let us "zero-base" this dicsussion and consider just the lexical >>> analysis appropriate to some future CIF-like language. Let us look at >>> some of the lexical issues that python deals with and consider what >>> lessons we may learn there in trying to go from a string of characters >>> to a string of tokens. >>> >>> First, we need to settle on what characters we will be using. >>> Origincally, python restricted its attention to just 7-bit ascii >>> characters "for program text." Now (from version 2.3 onwards), python >>> allows "an encoding declaration [to be] used to indicate that string >>> literals and comments use an encoding different from ASCII". >>> >>> I propose that we do something similar, but with a more modern starting >>> point: >>> >>> new cif character set and encoding: >>> >>> C1: that the character set for a "new cif" be unicode, and >>> C2: that the default encoding be UTF-8; and >>> C3: that other encodings be permitted as an optional >>> system-dependent feature when an explicit encoding >>> has been specified by >>> C3.1: a unicode BOM (byte-order-mark) (see >>> http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced >>> into a character stream, or >>> C3.2. the first or second line being a comment of the form: >>> # -*- coding: <encoding-name> -*- >>> as recognized by GNU Emacs, or >>> C3.3. the first or second line being a comment of the form: >>> # vim:fileencoding=<encoding-name> >>> as recognized by Bram Moolenaar's VIM >>> (see section 2.1.4 of >>> http://docs.python.org/reference/lexical_analysis.html for a more >>> information). >>> >>> For the rest of this discussion, let us assume unicode conventions >>> >>> >>> Next, we need to decide on the rules for handling lines breaks. I would >>> suggest we follow the pythn convention of first considering "physical >>> lines" and then introduce rules for joinng those physcial lines into >>> "logcal lines". >>> >>> Here, past practice with CIF rears its head -- what do we do with trailing >>> white space? In CIF until now, in order to deal with old fortran systems, >>> we have assumed that we cannot tell the difference between lines that end >>> with one blank or with an arbitrary number of blanks. Many fortran >>> implementations do not support an clean way to detect end of line, and, >>> worse, have no way to cope with lines of arbitrary length. We also still >>> have the system-dependent definitions of line termination. For our >>> "customer-base" I do not see any practical way around this right now, so, >>> with regret, I propose >>> >>> physical line: >>> >>> PL1: In describing the lexer, the system-dependent end-of-line will be >>> given a '\n'. In source files, any of the standard platform line >>> termination sequences can be used - the Unix form using ASCII LF >>> (linefeed), the Windows form using the ASCII sequence CR LF (return >>> followed by linefeed), or the old Macintosh form using the ASCII CR >>> (return) character. All of these forms can be used equally, regardless of >>> platform. I addition, all space and tab charcaters, '\x20' '\x09', >>> immediately prior to the system-dependent end-of-line will be removed >>> prior to further lexical analysis; and >>> PL2: There may be a system-dependent limit on the maximal length >>> of the resulting line, but in all cases, lines of up to 2048 charcaters >>> will be accepted. >>> >>> comments: >>> >>> LC1: A comment starts with a hash character (#) that is not part of a >>> string literal, and ends at the end of the physical line. A comment >>> signifies the end of the logical line unless the implicit line joining >>> rules are invoked. Comments are ignored by the syntax; they are not >>> tokens. >>> >>> logical line: >>> >>> LL1: A logical line is constructed from one or more physical lines by >>> following explicit or implicit joining rules >>> LL2: Explicit line joining: Two or more physical lines may be joined >>> into >>> logical lines using reverse solidus characters (\), as follows: when a >>> physical >>> line ends in a reverse solidus that is not part of a string literal or >>> comment, >>> it is joined with the following forming a single logical line, deleting >>> the backslash and the following end-of-line character. >>> LL2. Implicit line joining: Expressions in parentheses, square brackets >>> or curly braces can be split over more than one physical line without >>> using backslashes. Implicitly continued lines can carry comments. Blank >>> continuation lines are allowed. There is no end-of-line token between >>> implicit continuation lines. Implicitly continued lines can also occur >>> within triple-quoted strings (see below); in that case they cannot carry >>> comments. >>> >>> Strings >>> >>> With the character stream and the lines defined, the next thing we need >>> to define are string. I propose we adopt a subset of the python >>> convention, but without the string prefixes. : >>> >>> String literals can be enclosed in matching single quotes (') or double >>> quotes ("). They can also be enclosed in matching groups of three single >>> or double quotes (these are generally referred to as triple-quoted >>> strings). The reverse solidus (\) character is used to escape characters >>> that otherwise have a special meaning, such as newline, backslash itself, >>> or the quote character. >>> >>> In triple-quoted strings, unescaped newlines and quotes are allowed (and >>> are retained), except that three unescaped quotes in a row terminate the >>> string. (A quote is the character used to open the string, i.e. either ' >>> or ".) >>> >>> There is more to define, but if we go this far, we should be able to >>> have fairly clean lexical scanners that are able to handle nested >>> quotation marks in a way that most programmers will understand. >>> >>> Regards, >>> Herbert >>> >>> ===================================================== >>> Herbert J. Bernstein, Professor of Computer Science >>> Dowling College, Kramer Science Center, KSC 121 >>> Idle Hour Blvd, Oakdale, NY, 11769 >>> >>> +1-631-244-3035 >>> yaya@dowling.edu >>> ===================================================== >> _______________________________________________ >> ddlm-group mailing list >> ddlm-group@iucr.org >> http://scripts.iucr.org/mailman/listinfo/ddlm-group > > cheers > > Nick > > -------------------------------- > Associate Professor N. Spadaccini, PhD > School of Computer Science & Software Engineering > > The University of Western Australia t: +61 (0)8 6488 3452 > 35 Stirling Highway f: +61 (0)8 6488 1089 > CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick > MBDP M002 > > CRICOS Provider Code: 00126G > > e: Nick.Spadaccini@uwa.edu.au > > > > > > _______________________________________________ > ddlm-group mailing list > ddlm-group@iucr.org > http://scripts.iucr.org/mailman/listinfo/ddlm-group > _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Prev by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Next by Date: [ddlm-group] Straw poll results
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):