[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
[ddlm-group] options/text vs binary/end-of-line
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: [ddlm-group] options/text vs binary/end-of-line
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Fri, 18 Jun 2010 09:09:49 -0400 (EDT)
- In-Reply-To: <alpine.BSF.2.00.1006180703230.91255@epsilon.pair.com>
- References: <alpine.BSF.2.00.1005111250250.60002@epsilon.pair.com><alpine.BSF.2.00.1005181330210.38662@epsilon.pair.com><AANLkTimOLbOkIqCwqgsKJ36eVctlZccsAN4XAjYDr4Qd@mail.gmail.com><20100614142541.GA356@emerald.iucr.org><8F77913624F7524AACD2A92EAF3BFA54165DF3381E@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikeIbft9SKfvpgTpGZVpo47Vg_acYBbXi-eUvU-@mail.gmail.com><alpine.BSF.2.00.1006152223480.59900@epsilon.pair.com><AANLkTimmOPFkQhY1KY24Dg5kz3MUB4mO2sjoM848bqjV@mail.gmail.com><alpine.BSF.2.00.1006160719520.58405@epsilon.pair.com><881462.27872.qm@web87009.mail.ird.yahoo.com><AANLkTin51hXra-cIPzH3VMcUxJHMaUPWL71Kf1zM8SNt@mail.gmail.com><alpine.BSF.2.00.1006172025070.91418@epsilon.pair.com><AANLkTimEn-5bOcLNsa1DSOjDS7XqFmqVKA-W-6Z4NxFO@mail.gmail.com><alpine.BSF.2.00.1006172107430.91418@epsilon.pair.com><AANLkTilJUtXpw5UFQv0Y04Knrv9wCPLr5eertWPCcTzz@mail.gmail.com><alpine.BSF.2.00.1006180703230.91255@epsilon.pair.com>
Now to deal with the real issues -- should CIF2 allow multiple optional representations? is CIF2 a binary file or a text file? and how do we treat end-of-line? The code point for the end of line in a "normal" unix-style UTF-8 file is U+000A (LF or NL), but all of the following are also used as line terminators (see http://en.wikipedia.org/wiki/Newline): U+000C (FF) U+000D (CR) U+000D U+000A (CF LF) U+0085 NEL U+2028 LS U+2029 PS There are system dependent problems and conflicts with some of these characters: NEL is sometimes used for an ellipsis character. The proponents of a rigid binary CIF2 format for the actual files, as opposed the going back to CIF being a text file with mutliple system-dependent encodings need to consider whether they are going to restrict "valid" CIF2 to the world of unix, or shall we perhaps allow people working with text editors on MS windows machines and Macs to produce "valid" CIF2 files directly, bend a little and, instead of mandating the external representation of a CIF2 so rigidly, allow some reasonable range to text files that map cleanly to and from the sequences of unicode code points currently specified in the proposal? To be specific, I propose that the paragraph that now reads: "CIF2 files are standard variable length binary files, but for historical reasons will have a maximum record length of 2048 bytes. In a general sense the contents of the file are characters that are encoded in UTF8, however there are some restrictions on the character set for token delimiters, separators and for data names." be changed to read "CIF2 is a specification for the interchange of text files. Text files have many possible system dependent represenations and encodings. To ensure clarity in the specification of CIF2, this document is written in terms of a sequence of unicode code points, and all fully compliant CIF2 processing systems should, at a minimum be able to process text files as unicode code points represented in UTF-8, subject to the XML-based restrictions below. This approach is not meant to prevent people from preparing valid CIF2 files with non-UTF-8-based text editors, but, if a non-UTF-8 file format is produced, it is important to clearly specify the intended mapping to UTF-8. This is particularly important in dealing with end-of-line indicators (see http://en.wikipedia.org/wiki/Newline). When handling CIF2 files produced under MS windows, CR-LF sequences should be accepted as an alternative to LF, and when handling CIF2 files produced under Mac OS, CR should be accepted as an alternative to LF. This document will only refer to LF as a line terminator and will assume that some appropriate system-dependent text processing system will handle the necessary conversion. To ensure compatibility with older Fortran text processing software, lines in CIF2 files should be restricted to no more than 2048 code points in length, not including the line temrinator itself. Not that the UTF-8 encoding of such a line may well be much longer." ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- Re: [ddlm-group] options/text vs binary/end-of-line. . (Bollinger, John C)
- References:
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] [SPAM] ASSP UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Brian McMahon)
- Re: [ddlm-group] UTF-8 BOM (Bollinger, John C)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (SIMON WESTRIP)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Re: [ddlm-group] UTF-8 BOM (James Hester)
- Re: [ddlm-group] UTF-8 BOM (Herbert J. Bernstein)
- Prev by Date: Re: [ddlm-group] UTF-8 BOM
- Next by Date: Re: [ddlm-group] UTF-8 BOM
- Prev by thread: Re: [ddlm-group] UTF-8 BOM
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .
- Index(es):