Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .

I prefer the XML treatment of newline (ie translated to 0x000A for
processing purposes).  I would be in favour of restricting newline to
<0x000A>, <0x000D> or <0x000D 0x000A>, which means that only these
combinations have the syntactic significance of a newline.  From
memory, this significance is restricted to:

1. end of comment
2. whitespace
3. use in <eol><semicolon> digraph

I would also restrict the appearance of the remaining Unicode newline
characters to delimited datavalues, to maintain consistent display of
data files.

On Sat, Jun 19, 2010 at 3:21 AM, Bollinger, John C
<John.Bollinger@stjude.org> wrote:
> On Friday, June 18, 2010 8:10 AM, Herbert J. Bernstein wrote:
>>Now to deal with the real issues -- should CIF2 allow multiple
>>optional representations? is CIF2 a binary file or a text file? and
>>how do we treat end-of-line?
>
> There seem to be several voices in support of CIF being a text standard as opposed to a binary one.  That is my preference, though I would couple it with a requirement on fully-compliant CIF processors to support UTF-8 as the default for both reading and writing, and an explicit disclaimer of any requirement to support other encodings.
>
>>The code point for the end of line in a "normal" unix-style UTF-8 file is
>>U+000A (LF or NL), but all of the following are also used as line
>>terminators (see http://en.wikipedia.org/wiki/Newline):
>
> [elided]
>
>>The proponents of a rigid binary CIF2 format for the actual files,
>>as opposed the going back to CIF being a text file with mutliple
>>system-dependent encodings need to consider whether they are going
>>to restrict "valid" CIF2 to the world of unix, or shall we perhaps
>>allow people working with text editors on MS windows machines and
>>Macs to produce "valid" CIF2 files directly, bend a little and,
>>instead of mandating the external representation of a CIF2 so
>>rigidly, allow some reasonable range to text files that map
>>cleanly to and from the sequences of unicode code points currently
>>specified in the proposal?
>
> The terminology section of the draft spec defines "newline" and "\n" to mean whatever the local end-of-line convention is.  In response to an earlier query of mine, the group, with limited comment, acknowledged that this would include cases where the convention is to end lines with U+2028 or U+2029, which the spec takes care to note do not otherwise need to be supported in that role.  In this sense, the current CIF2 draft incorporates a requirement for environment-dependent variant encodings!
>
> Windows and Mac OS <= 9 environments are thus afforded equal status with Unix environments (including OS X).   Nevertheless, I much prefer XML's approach to this issue: a defined set of EOL sequences is supported, all of them normalized to U+000A upon read, as if by an initial pre-processing pass.  This eases the life of XML consumers, as they can rely on conformant XML processors to read XML prepared in any environment, and always to represent EOL to them in the same way.  For their part, processors are free to use any of the supported EOL conventions when writing XML, and even to mix them.
>
> CIF 1.1 allowed CIF readers to support multiple EOL conventions, so there should be no objection to allowing CIF 2.0 processors to do the same.  I would be completely satisfied, however, to follow XML's lead by *requiring* CIF processors to provide such support.  The specific sequences that should be supported are negotiable, but should include at least <U+000A>, <U+000D>, and <U+000D U+000A>.  I had thought that one of the obstacles here was the perception that this would not be supportable in Fortran (though I have Fortran code that does it, I think in a standard-conformant way).  I am pleased if that is not a concern.
>
>>To be specific, I propose that
>
> [elided]
>
>>["]To ensure compatibility with older Fortran text processing software,
>>lines in CIF2 files should be restricted to no more than 2048
>>code points in length, not including the line temrinator itself.
>>Not that the UTF-8 encoding of such a line may well be much longer."
>
> I agree that in a text context, it makes more sense to restrict line length by number of code points / Unicode characters than by number of bytes.  Restricting by bytes still mostly works, though, because the number of encoded code points cannot exceed the number of bytes under any of the standard UTFs (or any other standard encoding that comes to mind).  I am perfectly content to recast the length restriction in terms of Unicode characters.
>
>
> John
> --
> John C. Bollinger, Ph.D.
> Department of Structural Biology
> St. Jude Children's Research Hospital
>
>
> Email Disclaimer:  www.stjude.org/emaildisclaimer
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.