Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .

Herbert writes in response to David:

On Wed, Jun 23, 2010 at 12:30 AM, Herbert J. Bernstein
<yaya@bernstein-plus-sons.com> wrote:
> Dear Colleagues,
>
> David wrote:
>
> "The great virtue of CIF1 is that I can read it and edit it in wordpad and
> what it writes out is a valid ASDII [sic] cif that I can read into any
> program that allows for a cif input.  How does this work when the world (and
> CIF2) supports multiple encodings?"
>
> This is a misunderstanding of both the current and future handling of
> CIF.
>
> Right now CIF _is_ handled with multiple encodings.  That is why the
> process of working with it is essentially painless for a wide range
> of computers with very, very different encodings used around the
> world.  Most systems, when properly informed that they ware working
> with text and not binary do a very nice job when talking to other
> systems of matching up encodings for a text file transfer.

Herbert, please give some concrete examples of what this work is that
currently happens with CIF1 in multiple encodings, and what these
multiple encodings are that are so painlessly interconverted, so that
we can assess how valid this point is.  And why are CIF1 systems so
clever, when so many others have had to suffer multiple encoding pain?

> James' proposal would remove that painless feature and, by insisting on
> a single encoding, suddenly make users who could provide a workable
> CIF2 file doing very much what they have always done -- which is to
> edit with whatever tools they currently use -- make them do major
> disruptive upgrades they may not be ready for and sometimes cannot
> afford, all for no practical reason.

There is absolutely nothing disruptive about moving from ASCII to
UTF8, given that UTF8 is a proper superset of ASCII.  While you
maintain that not all CIF1 is ASCII, and indeed the CIF1 standard does
not (unfortunately, in my opinion) mandate ASCII, I have strong doubts
that significant numbers of CIF users rely on a non-ASCII encoding for
working with CIFs, hence my request for some examples.  Perhaps the
IUCr could provide some numbers on how many non-ASCII files they
receive?

>
> It is only by allowing some reasonable range of multiple encoding for CIF2
> as we do for CIF1 that we can avoid such confusion and pointless expense.

My viewpoint is precisely opposite: it is only by mandating a single
encoding that we can avoid confusion and pointless expense.
Fortunately, for CIF1 ASCII was the defacto if not the dejure standard
encoding, pending examples from Herbert that show otherwise.

> Regards,
>  Herbert
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>        Idle Hour Blvd, Oakdale, NY, 11769
>
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
>
> On Tue, 22 Jun 2010, David Brown wrote:
>
>> I find both Herbert and James' arguments equally appealling and
>> convincing,
>> and therefore the choice has to be based on philosophy.  If many encodings
>> are allowed, how do I know when I download a CIF whether I can read it in
>> my
>> favorite Word Perfect editor?  Would it require a different .cif extension
>> on the filename for each different encoding (e.g., .cifdoc or cifwpd, or
>> is
>> there something at the head of the file (a BOM?) that my editor would
>> instantly recognize and take care of?  Do I have to try reading with all
>> my
>> editors to find one that can accept the CIF or does the CIF come with an
>> ASCII readme file to tell me how to read it?  The great virtue of CIF1 is
>> that I can read it and edit it in wordpad and what it writes out is a
>> valid
>> ASDII cif that I can read into any program that allows for a cif input.
>> How
>> does this work when the world (and CIF2) supports multiple encodings?
>> Because I have not received any reassurance that I would be able to read
>> any
>> CIF2 into any of my graphics etc. programs, I favour James' more cautious
>> and conservative approach.  Later when the world has settled on a single
>> encoding, we could extend CIF2 to include that one as well.  Or should we
>> just stay with ASCII until the world has decided which way it wants to
>> go?
>> It has served us well, if not always elegantly.
>>
>> David
>>
>>
>>
>>
>> James Hester wrote:
>>
>> Herbert, it looks like neither of us is really addressing the others'
>> concerns, so I'd like to solicit contributions from other
>> participants, preferably on the thread I've created for that purpose.
>> Meanwhile, regarding your email:
>>
>> Choosing one encoding does not imply disrepect, any more than
>> accepting only English manuscripts shows disrespect, or being unable
>> to read/write publications in Japanese, Chinese, Korean or Russian
>> shows disrespect.  Rather, information is most effectively
>> disseminated when a single common language is chosen.  Likewise,
>> choosing a single encoding is the most effective way to ensure that
>> the text file contents are not lost.  CIF is an information transfer
>> and archiving protocol, so this is a worthy goal.
>>
>> On Tue, Jun 22, 2010 at 12:56 PM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>
>> I am most seriously advocating that we respect our Japanese, Chinese,
>> Korean
>> and Russian colleagues and accord appropriate respect to
>> their way of doing science.  Those users and all crystalographers
>> who happen to still be using Windows 98, Windows 2000 and windows
>> XP and older unixes on which it is easier to work with UTF-16 (and EUC-CN
>> and Shift-JIS) than with UTF-8 are the people I have in mind.
>>
>> In what sense is it 'easier' to work with UTF-16?  Does the editing
>> software not have a 'UTF8' output option, but does have a 'UTF16'
>> output option?  What will these users do when they receive a UTF8
>> encoded CIF?
>>
>> In practice, we are both proposing that CIF2 be defined in terms of
>> a stream on unicode code points as text, but once you do that,
>> thereby giving our Russian, Chinese, Japanese and Korean colleagues
>> the same support in CIF that they have in XML, so that they are
>> highly likely to start working with CIF in their native languages,
>> I just want to encourage them to use whatever text editing tools
>> they are comfortable with, and show respect for the practices
>> they are used to.
>>
>> If we all respect everyone's else's typical practice, and as a
>> consequence try to include that practice in our standard, then our
>> standards risk becoming complete spaghetti and useless. Where would we
>> be if, instead of producing Unicode, standards bodies had 'respected'
>> all the alternative encodings that used to be around?  Back in
>> encoding hell, still.  What's more, there is nothing sacred about
>> typical practice, and users themselves may not be satisifed with their
>> typical practice.
>>
>> I would suggest you watch a native Chinese speaker working with
>> English and Chinese text in a Chinese text editor for a while,
>> and you may realize that trying to force somebody into the
>> transition from EUC-CN to UTF-8 before they are ready, and,
>> more importantly before more UTF-8-aware Chinese text editors
>> are ready is not a very good idea.
>>
>> Chinese text editors that are UTF8 aware are widely available.  I
>> believe NJStar is a popular one.  Perhaps you could give a few
>> examples of popular Chinese text editors that do not allow UTF8 input
>> or output?
>>
>> It is a matter of respect.
>>
>> Respect has very little to do with it.  It is a matter of reliable
>> information transfer and archiving.
>>
>> all the best,
>> James.
>>
>> Regards,
>>  Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Jun 2010, James Hester wrote:
>>
>> Dear Herbert: my question is simple, and not answered by that link:
>> what group of users or programmers *will not* be able to work with a
>> UTF8 file?  That is what matters.
>>
>> And yes, I am very much in favour of saying to people that their UTF16
>> encoded 'CIF2' file is most definitely not a CIF2 file.  That way,
>> nobody is confused when this UTF16 'CIF2' file pops up in some archive
>> somewhere.  And are you seriously advocating allowing KOI and JIS
>> encoded files to be considered legitimate CIF2 files??
>>
>> In any case, talk about JIS and Cyrillic encodings is a red herring.
>> Those wishing to insert Japanese or Russian text have had no option at
>> all in the CIF world, yet you make it sound as if they have been able
>> to use their favourite editor to produce CIFs, until we restricted
>> them to UTF8.  Quite the opposite: we are now cautiously expanding
>> their range of options, and perhaps they can now use their favourite
>> editor.
>>
>> I am proposing one clearly-defined method of communicating text in
>> Japanese or Russian (or anything else).  We have the opportunity to
>> avoid the indescribably annoying pitfalls of multiple possible
>> encodings that users of JIS and Cyrillic have had to endure (and I
>> have had plenty of occasions to feel the pain first-hand), and I
>> frankly don't understand why anybody would want to recreate that
>> situation.
>>
>> On Tue, Jun 22, 2010 at 11:38 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>
>> Now thate we are in agreement about allowing users to work with text as
>> text
>> using system-dependent editors and API's please review the surrent state
>> of
>> support for UTF-8 versus UCS-2 and UTF-16, e.g. at
>>
>>
>>
>> http://en.wikipedia.org/wiki/UTF-16/UCS-2#Use_in_major_operating_systems_and
>> _environments
>>
>> You will see that we are a few years premature in trying to be UTF-8
>> purists
>> instead of being reasonably friendly to the unicode 16-bot encodings as
>> well.  Indeed, we are a bit premature in insisting on Unicode.  EUC-CN
>> and
>> SHIFT-JIS are still very heavily used, as are some non-Unicode Cyrillic
>> systems.  Things are far enough along in terms of unicode support that we
>> can get away with specifying the file in terms of unicode code-points,
>> but
>> the reality is that CIF users are gong to use multiple encodings,
>> including
>> non-unicode encodings for at least the next several years. That does not
>> mean the IUCr journals will have to accept non-UTF-8 encodings -- that
>> can
>> now be handled by external filters on almost all systems, but it is
>> unwise
>> to tell people they are doing something illegitimate by using heir
>> favorite
>> text editor or application to actually produce the file, when it really
>> is a
>> perfectly valid CIF, just in a different encoding.
>>
>>
>> If we are to be a text-based system, then you really need to put the
>> multiple-encoding wording back into my paragraph, or we will be
>> alienating a
>> signficant fraction of CIF users for no good reason.
>>
>> If we are flexible now and encourage UTF-8 use, rather than trying to
>> enforce UTF-8 use, I expect we will avoid a current political and
>> practical
>> problem and be wel-positioned over the next decade as UTF-8 use becomes
>> more
>> widely accepted.
>>
>> Please put the multiple encoding wording back in.  We need it.
>>
>> Regards,
>>  Herbert
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Jun 2010, James Hester wrote:
>>
>> I agree with your paragraph.  I'm ready for your next step...
>>
>> On Tue, Jun 22, 2010 at 10:23 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>
>> OK, so we are at least in agreement with the concept of a text file.
>> Now let's deal with what that means to users:
>>
>> I means that they can edit a file on some reasonable range of
>> machines with a text editor, read it with the text-reading
>> libraries for some reasonable range of programming languages
>> on some reasonable range of machine, and write it with
>> text editors and the text-writing libraries of programming
>> languages on some reaonable range of machines and they
>> have some reaonable way to print the file on piece of paper
>> and read it seeing the essential content of the file.
>>
>> Do we all agree to those implcations of saying we are dealing
>> with a text file?
>>
>> (Yes, this is a trick question -- to find out if we have a
>> text interchange format or if we are just dealing with
>> a binary file under false colors).
>>
>> Regards,
>>  Herbert
>>
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>   Dowling College, Kramer Science Center, KSC 121
>>        Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                 +1-631-244-3035
>>                 yaya@dowling.edu
>> =====================================================
>>
>> On Tue, 22 Jun 2010, James Hester wrote:
>>
>> As Simon says, to agree to this wording requires agreeing to multiple
>> encodings.  We have not agreed to that yet.  I would however agree to
>> the following wording, which has removed any reference to encoding,
>> and inserted John's suggestion for EOL treatment.
>>
>> "CIF2 is a specification for the interchange of text files.This
>> document is therefore written
>> in terms of a sequence of Unicode code points.  Particular care must
>> be taken with treatment of newline in text files. This document will
>> only refer to <0x000A> as a line terminator, as CIF2 processors are
>> required to map <0x000D>, <0x000A> and <0x000D><0x000A> to this
>> character.
>>
>> To ensure compatibility with older Fortran text processing software,
>> lines in CIF2 files should be restricted to no more than 2048
>> code points in length, not including the line terminator itself."
>>
>> On Tue, Jun 22, 2010 at 3:44 AM, Herbert J. Bernstein
>> <yaya@bernstein-plus-sons.com> wrote:
>>
>> Dear Colleagues,
>>
>>   The IUCr is an international organization.  Is it really
>> politically
>> wise to insist that CIF2 tags be restricted to unaccented roman
>> letters?
>>
>>   Before we go much further, may we please have a vote on explicitly
>> changing CIF2 from the current draft wording that it is a binary
>> format to the wording I suggested making it a text format.  Most of
>> the
>> rest of the issues we are dealing with hinge on that basic decision.
>>
>>   The wording I proposed was:
>>
>> "CIF2 is a specification for the interchange of text files.  Text
>> files
>> have many possible system dependent represenations and encodings.  To
>> ensure clarity in the specification of CIF2, this document is written
>> in terms of a sequence of unicode code points, and all fully
>> compliant
>> CIF2 processing systems should, at a minimum be able to process
>> text files as unicode code points represented in UTF-8, subject to
>> the
>> XML-based restrictions below.  This approach is not meant to prevent
>> people from preparing valid CIF2 files with non-UTF-8-based text
>> editors, but, if a non-UTF-8 file format is produced, it is important
>> to clearly specify the intended mapping to UTF-8.  This is
>> particularly
>> important in dealing with end-of-line indicators (see
>> http://en.wikipedia.org/wiki/Newline).  When handling CIF2 files
>> produced under MS windows, CR-LF sequences should be accepted as
>> an alternative to LF, and when handling CIF2 files produced under
>> Mac OS, CR should be accepted as an alternative to LF.  This document
>> will only refer to LF as a line terminator and will assume that some
>> appropriate system-dependent text processing system will handle
>> the necessary conversion.
>>
>> To ensure compatibility with older Fortran text processing software,
>> lines in CIF2 files should be restricted to no more than 2048
>> code points in length, not including the line temrinator itself.
>> Not that the UTF-8 encoding of such a line may well be much longer."
>>
>> If anybody objects to some specific wording in this text, let us
>> settle on revised wording.  We need to get this basic issue
>> clarified in writing or we will be going in circles forever.
>>
>>
>>   Regards,
>>     Herbert
>>
>>
>>
>> At 11:30 AM -0500 6/21/10, Bollinger, John C wrote:
>>
>> On Monday, June 21, 2010 1:13 AM, James Hester wrote:
>>
>> I prefer the XML treatment of newline (ie translated to 0x000A for
>> processing purposes).  I would be in favour of restricting newline
>> to
>> <0x000A>, <0x000D> or <0x000D 0x000A>, which means that only these
>> combinations have the syntactic significance of a newline.
>>
>> I would be satisfied with that approach.
>>
>>  From
>> memory, this significance is restricted to:
>>
>> 1. end of comment
>> 2. whitespace
>> 3. use in <eol><semicolon> digraph
>>
>> The significance also extends to 'single'- and "double"-quote
>> delimited data values, in that these cannot contain end-of-line.
>>
>> I would also restrict the appearance of the remaining Unicode
>> newline
>> characters to delimited datavalues, to maintain consistent display
>> of
>> data files.
>>
>> I'm seeing more and more upside to restricting *all* non-ASCII
>> characters to delimited data values.  I don't have any objection to
>> restricting U+0085, U+2028, and U+2029 (did I miss any?) to such
>> contexts.
>>
>>
>> John
>> --
>> John C. Bollinger, Ph.D.
>> Department of Structural Biology
>> St. Jude Children's Research Hospital
>>
>>
>>
>>
>> Email Disclaimer:  www.stjude.org/emaildisclaimer
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> --
>> =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>         Idle Hour Blvd, Oakdale, NY, 11769
>>
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>> =====================================================
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
>



-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group


Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.