[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] [THREAD 4] UTF8
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] [THREAD 4] UTF8
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Tue, 13 Oct 2009 10:09:18 -0400 (EDT)
- Cc: Nick.Spadaccini@uwa.edu.au
- In-Reply-To: <504270.84370.qm@web87013.mail.ird.yahoo.com>
- References: <C6F976F1.1206C%nick@csse.uwa.edu.au><504270.84370.qm@web87013.mail.ird.yahoo.com>
Dear Colleagues, Let us "zero-base" this dicsussion and consider just the lexical analysis appropriate to some future CIF-like language. Let us look at some of the lexical issues that python deals with and consider what lessons we may learn there in trying to go from a string of characters to a string of tokens. First, we need to settle on what characters we will be using. Origincally, python restricted its attention to just 7-bit ascii characters "for program text." Now (from version 2.3 onwards), python allows "an encoding declaration [to be] used to indicate that string literals and comments use an encoding different from ASCII". I propose that we do something similar, but with a more modern starting point: new cif character set and encoding: C1: that the character set for a "new cif" be unicode, and C2: that the default encoding be UTF-8; and C3: that other encodings be permitted as an optional system-dependent feature when an explicit encoding has been specified by C3.1: a unicode BOM (byte-order-mark) (see http://en.wikipedia.org/wiki/Byte-order_mark) has been introduced into a character stream, or C3.2. the first or second line being a comment of the form: # -*- coding: <encoding-name> -*- as recognized by GNU Emacs, or C3.3. the first or second line being a comment of the form: # vim:fileencoding=<encoding-name> as recognized by Bram Moolenaar's VIM (see section 2.1.4 of http://docs.python.org/reference/lexical_analysis.html for a more information). For the rest of this discussion, let us assume unicode conventions Next, we need to decide on the rules for handling lines breaks. I would suggest we follow the pythn convention of first considering "physical lines" and then introduce rules for joinng those physcial lines into "logcal lines". Here, past practice with CIF rears its head -- what do we do with trailing white space? In CIF until now, in order to deal with old fortran systems, we have assumed that we cannot tell the difference between lines that end with one blank or with an arbitrary number of blanks. Many fortran implementations do not support an clean way to detect end of line, and, worse, have no way to cope with lines of arbitrary length. We also still have the system-dependent definitions of line termination. For our "customer-base" I do not see any practical way around this right now, so, with regret, I propose physical line: PL1: In describing the lexer, the system-dependent end-of-line will be given a '\n'. In source files, any of the standard platform line termination sequences can be used - the Unix form using ASCII LF (linefeed), the Windows form using the ASCII sequence CR LF (return followed by linefeed), or the old Macintosh form using the ASCII CR (return) character. All of these forms can be used equally, regardless of platform. I addition, all space and tab charcaters, '\x20' '\x09', immediately prior to the system-dependent end-of-line will be removed prior to further lexical analysis; and PL2: There may be a system-dependent limit on the maximal length of the resulting line, but in all cases, lines of up to 2048 charcaters will be accepted. comments: LC1: A comment starts with a hash character (#) that is not part of a string literal, and ends at the end of the physical line. A comment signifies the end of the logical line unless the implicit line joining rules are invoked. Comments are ignored by the syntax; they are not tokens. logical line: LL1: A logical line is constructed from one or more physical lines by following explicit or implicit joining rules LL2: Explicit line joining: Two or more physical lines may be joined into logical lines using reverse solidus characters (\), as follows: when a physical line ends in a reverse solidus that is not part of a string literal or comment, it is joined with the following forming a single logical line, deleting the backslash and the following end-of-line character. LL2. Implicit line joining: Expressions in parentheses, square brackets or curly braces can be split over more than one physical line without using backslashes. Implicitly continued lines can carry comments. Blank continuation lines are allowed. There is no end-of-line token between implicit continuation lines. Implicitly continued lines can also occur within triple-quoted strings (see below); in that case they cannot carry comments. Strings With the character stream and the lines defined, the next thing we need to define are string. I propose we adopt a subset of the python convention, but without the string prefixes. : String literals can be enclosed in matching single quotes (') or double quotes ("). They can also be enclosed in matching groups of three single or double quotes (these are generally referred to as triple-quoted strings). The reverse solidus (\) character is used to escape characters that otherwise have a special meaning, such as newline, backslash itself, or the quote character. In triple-quoted strings, unescaped newlines and quotes are allowed (and are retained), except that three unescaped quotes in a row terminate the string. (A quote is the character used to open the string, i.e. either ' or ".) There is more to define, but if we go this far, we should be able to have fairly clean lexical scanners that are able to handle nested quotation marks in a way that most programmers will understand. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Mon, 12 Oct 2009, SIMON WESTRIP wrote: > "OK Only one-byte UTF-8 is allowed. Voila. Problem solved." Please forgive me, but for the first time in my life I think I might have to type 'lol' :-) (Sorry if this is inappropriate - I'll try to add something constructive tomorrow.) Cheers Simon ________________________________ From: Nick Spadaccini <nick@csse.uwa.edu.au> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> Sent: Monday, 12 October, 2009 17:14:41 Subject: Re: [ddlm-group] [THREAD 4] UTF8 On 12/10/09 11:38 PM, "James Hester" <jamesrhester@gmail.com> wrote: > I've started a separate thread for the UTF8 discussion. > > John has floated the option of delinking the file encoding from the > syntax specification, so CIF1.2 files could have either ASCII or UTF8 > encodings. I believe that this is unnecessary for the following reasons > > 1. Encoding can be automatically determined: If a given CIF1.2 file > contains any bytes with values >127 then it can/should only be UTF8. Is it? Doesn't CBF/imgCIF or whatever have binary that is the "Hammersley" coding algorithm? > 2. The fact that CIF1.2 syntax allows UTF8 encoding does not mean that > any given string-valued data item could be presented in UTF8: > dictionary writers are free to restrict the character set of data > values. Would such dictionary-based regulation give the PDB and IUCr > sufficient control over UTF8 introduction (John/Brian/Simon?). OK Only one-byte UTF-8 is allowed. Voila. Problem solved. > 3. An additional UTF8 encoding magic number could complicate the > simple magic number scheme we currently have in place. I don't think it does. It would simplify the case for those parsers not supporting yet UTF-8. It would tell them to terminate the process. cheers Nick -------------------------------- Associate Professor N. Spadaccini, PhD School of Computer Science & Software Engineering The University of Western Australia t: +61 (0)8 6488 3452 35 Stirling Highway f: +61 (0)8 6488 1089 CRAWLEY, Perth, WA 6009 AUSTRALIA w3: www.csse.uwa.edu.au/~nick MBDP M002 CRICOS Provider Code: 00126G e: Nick.Spadaccini@uwa.edu.au _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] [THREAD 4] UTF8 (James Hester)
- Re: [ddlm-group] [THREAD 4] UTF8 (Brian McMahon)
- References:
- Re: [ddlm-group] [THREAD 4] UTF8 (Nick Spadaccini)
- Re: [ddlm-group] [THREAD 4] UTF8 (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
- Next by Date: Re: [ddlm-group] [THREAD 4] UTF8
- Prev by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Next by thread: Re: [ddlm-group] [THREAD 4] UTF8
- Index(es):