Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Now thate we are in agreement about allowing users to work with text as 
text using system-dependent editors and API's please review the surrent 
state of support for UTF-8 versus UCS-2 and UTF-16, e.g. at

http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and_environments

You will see that we are a few years premature in trying to be UTF-8 
purists instead of being reasonably friendly to the unicode 16-bot 
encodings as well.  Indeed, we are a bit premature in insisting on 
Unicode.  EUC-CN and SHIFT-JIS are still very heavily used, as are some 
non-Unicode Cyrillic systems.  Things are far enough along in terms of 
unicode support that we can get away with specifying the file in terms of 
unicode code-points, but the reality is that CIF users are gong to use 
multiple encodings, including non-unicode encodings for at least the next 
several years. That does not mean the IUCr journals will have to accept 
non-UTF-8 encodings -- that can now be handled by external filters on 
almost all systems, but it is unwise to tell people they are doing 
something illegitimate by using heir favorite text editor or application 
to actually produce the file, when it really is a perfectly valid CIF, 
just in a different encoding.


If we are to be a text-based system, then you really need to put the 
multiple-encoding wording back into my paragraph, or we will be alienating 
a signficant fraction of CIF users for no good reason.

If we are flexible now and encourage UTF-8 use, rather than trying to 
enforce UTF-8 use, I expect we will avoid a current political and 
practical problem and be wel-positioned over the next decade as UTF-8 use 
becomes more widely accepted.

Please put the multiple encoding wording back in.  We need it.

Regards,
   Herbert
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Tue, 22 Jun 2010, James Hester wrote:

> I agree with your paragraph.  I'm ready for your next step...
>
> On Tue, Jun 22, 2010 at 10:23 AM, Herbert J. Bernstein
> <yaya@bernstein-plus-sons.com> wrote:
>> OK, so we are at least in agreement with the concept of a text file.
>> Now let's deal with what that means to users:
>>
>> I means that they can edit a file on some reasonable range of
>> machines with a text editor, read it with the text-reading
>> libraries for some reasonable range of programming languages
>> on some reasonable range of machine, and write it with
>> text editors and the text-writing libraries of programming
>> languages on some reaonable range of machines and they
>> have some reaonable way to print the file on piece of paper
>> and read it seeing the essential content of the file.
>>
>> Do we all agree to those implcations of saying we are dealing
>> with a text file?
>>
>> (Yes, this is a trick question -- to find out if we have a
>> text interchange format or if we are just dealing with
>> a binary file under false colors).
>>
>> Regards,
>>  Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Jun 2010, James Hester wrote:
>>
>>> As Simon says, to agree to this wording requires agreeing to multiple
>>> encodings.  We have not agreed to that yet.  I would however agree to
>>> the following wording, which has removed any reference to encoding,
>>> and inserted John's suggestion for EOL treatment.
>>>
>>> "CIF2 is a specification for the interchange of text files.This
>>> document is therefore written
>>> in terms of a sequence of Unicode code points.  Particular care must
>>> be taken with treatment of newline in text files. This document will
>>> only refer to <0x000A> as a line terminator, as CIF2 processors are
>>> required to map <0x000D>, <0x000A> and <0x000D><0x000A> to this
>>> character.
>>>
>>> To ensure compatibility with older Fortran text processing software,
>>> lines in CIF2 files should be restricted to no more than 2048
>>> code points in length, not including the line terminator itself."
>>>
>>> On Tue, Jun 22, 2010 at 3:44 AM, Herbert J. Bernstein
>>> <yaya@bernstein-plus-sons.com> wrote:
>>>>
>>>> Dear Colleagues,
>>>>
>>>>   The IUCr is an international organization.  Is it really politically
>>>> wise to insist that CIF2 tags be restricted to unaccented roman letters?
>>>>
>>>>   Before we go much further, may we please have a vote on explicitly
>>>> changing CIF2 from the current draft wording that it is a binary
>>>> format to the wording I suggested making it a text format.  Most of the
>>>> rest of the issues we are dealing with hinge on that basic decision.
>>>>
>>>>   The wording I proposed was:
>>>>
>>>> "CIF2 is a specification for the interchange of text files.  Text files
>>>> have many possible system dependent represenations and encodings.  To
>>>> ensure clarity in the specification of CIF2, this document is written
>>>> in terms of a sequence of unicode code points, and all fully compliant
>>>> CIF2 processing systems should, at a minimum be able to process
>>>> text files as unicode code points represented in UTF-8, subject to the
>>>> XML-based restrictions below.  This approach is not meant to prevent
>>>> people from preparing valid CIF2 files with non-UTF-8-based text
>>>> editors, but, if a non-UTF-8 file format is produced, it is important
>>>> to clearly specify the intended mapping to UTF-8.  This is particularly
>>>> important in dealing with end-of-line indicators (see
>>>> http://en.wikipedia.org/wiki/Newline).  When handling CIF2 files
>>>> produced under MS windows, CR-LF sequences should be accepted as
>>>> an alternative to LF, and when handling CIF2 files produced under
>>>> Mac OS, CR should be accepted as an alternative to LF.  This document
>>>> will only refer to LF as a line terminator and will assume that some
>>>> appropriate system-dependent text processing system will handle
>>>> the necessary conversion.
>>>>
>>>> To ensure compatibility with older Fortran text processing software,
>>>> lines in CIF2 files should be restricted to no more than 2048
>>>> code points in length, not including the line temrinator itself.
>>>> Not that the UTF-8 encoding of such a line may well be much longer."
>>>>
>>>> If anybody objects to some specific wording in this text, let us
>>>> settle on revised wording.  We need to get this basic issue
>>>> clarified in writing or we will be going in circles forever.
>>>>
>>>>
>>>>   Regards,
>>>>     Herbert
>>>>
>>>>
>>>>
>>>> At 11:30 AM -0500 6/21/10, Bollinger, John C wrote:
>>>>>
>>>>> On Monday, June 21, 2010 1:13 AM, James Hester wrote:
>>>>>
>>>>>> I prefer the XML treatment of newline (ie translated to 0x000A for
>>>>>> processing purposes).  I would be in favour of restricting newline to
>>>>>> <0x000A>, <0x000D> or <0x000D 0x000A>, which means that only these
>>>>>> combinations have the syntactic significance of a newline.
>>>>>
>>>>> I would be satisfied with that approach.
>>>>>
>>>>>>  From
>>>>>> memory, this significance is restricted to:
>>>>>>
>>>>>> 1. end of comment
>>>>>> 2. whitespace
>>>>>> 3. use in <eol><semicolon> digraph
>>>>>
>>>>> The significance also extends to 'single'- and "double"-quote
>>>>> delimited data values, in that these cannot contain end-of-line.
>>>>>
>>>>>> I would also restrict the appearance of the remaining Unicode newline
>>>>>> characters to delimited datavalues, to maintain consistent display of
>>>>>> data files.
>>>>>
>>>>> I'm seeing more and more upside to restricting *all* non-ASCII
>>>>> characters to delimited data values.  I don't have any objection to
>>>>> restricting U+0085, U+2028, and U+2029 (did I miss any?) to such
>>>>> contexts.
>>>>>
>>>>>
>>>>> John
>>>>> --
>>>>> John C. Bollinger, Ph.D.
>>>>> Department of Structural Biology
>>>>> St. Jude Children's Research Hospital
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>>>>
>>>>> _______________________________________________
>>>>> ddlm-group mailing list
>>>>> ddlm-group@iucr.org
>>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>>
>>>> --
>>>> =====================================================
>>>>  Herbert J. Bernstein, Professor of Computer Science
>>>>    Dowling College, Kramer Science Center, KSC 121
>>>>         Idle Hour Blvd, Oakdale, NY, 11769
>>>>
>>>>                  +1-631-244-3035
>>>>                  yaya@dowling.edu
>>>> =====================================================
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>>
>>>
>>>
>>> --
>>> T +61 (02) 9717 9907
>>> F +61 (02) 9717 3145
>>> M +61 (04) 0249 4148
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.