Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .

  • To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
  • Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
  • From: James Hester <jamesrhester@xxxxxxxxx>
  • Date: Thu, 26 Aug 2010 18:22:09 +1000
  • In-Reply-To: <639601.73559.qm@web87008.mail.ird.yahoo.com>
  • References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><AANLkTikih0j6-vyLDPMOqcTkoiK545yE28y4fU9JTUa2@mail.gmail.com><20100623103310.GD15883@emerald.iucr.org><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><a06240802c848414681ef@><381469.52475.qm@web87004.mail.ird.yahoo.com><a06240801c84949b70cb7@><AANLkTilZj2UEffRwmvCrgnVbxrGwmsoqb9S7tw31MWSo@mail.gmail.com><984921.99613.qm@web87011.mail.ird.yahoo.com><AANLkTimLmnpS-HHP9en-zwUDeVKtbHSUJa36tUCOlQtL@mail.gmail.com><826180.50656.qm@web87010.mail.ird.yahoo.com><563298.52532.qm@web87005.mail.ird.yahoo.com><520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com><AANLkTilqKa_vZJEmfjEtd_MzKhH1CijEIglJzWpFQrrC@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local><AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local><AANLkTintziXhwVCEFD0yUtTDo9KG8ut=oL4OgmkjmEBe@mail.gmail.com><639601.73559.qm@web87008.mail.ird.yahoo.com>
Hi Simon and others,

What Simon describes accords closely with my perception of the
situation, except that your final point regarding CIF2 requiring users
to abandon text editors will depend on how we resolve the encoding
issue.  For me the logical conclusion from the points you make is to
stick to UTF8-only encoding which will keep the large majority of
users and developers happy.  Unfortunately others have the perception
that UTF8-only will be overly restrictive, and lacking hard data we
are having trouble deciding which of these two perceptions are
correct.  Clearly UTF8-only is not overly restrictive *now* because it
is *less* restrictive than the (de-facto) CIF1 situation of ASCII-only
which has served us well.  UTF8 may be restrictive in the future when
users of non Latin-1 code points find that they don't know how or
can't use their favourite text editors for putting those code points
into a CIF, but I'm not sure even the users themselves could answer
the question now as to how likely that is going to be.

What I would suggest as a cautious compromise is to leave the door
open for adding non UTF8 encodings in the future, but not describing
any scheme for doing this at present.  One way to leave the door open
like this would be to declare that the first line of a CIF2 file is
'special', and is reserved for future expansion.  Our discussions on
Scheme B are sufficiently far advanced to indicate that conventions
relating to encoding schemes could be managed in the first line. The
question of how strictly something like Scheme B should be applied
remains open, and could be addressed once more in-field experience has
been gained.

On Thu, Aug 26, 2010 at 9:08 AM, SIMON WESTRIP
<simonwestrip@btinternet.com> wrote:
> Dear all
> Recent contributions have stimulated me to revisit some of the fundamental
> issues of the possible changes in CIF2 with respect to CIF1,
> in particular, the impact on current practice (as I perceive it, based on my
> experience). The following is a summary of my thoughts, trying to
> look at this from two perspectives (forgive me if I repeat earlier
> opinions):
> 1) User perspective
> To date, in the 'core' CIF world (i.e. single-crystal and its extensions),
> users treat CIFs as text files, and expect to be able to read them as such
> using
> plain-text editors, and indeed edit them if necessary for e.g. publication
> purposes. Furthermore, they expect them to be readable by applications that
> claim that
> ability (e.g. graphics software).
> The situation is slghtly different with mmCIF (and the pdb variants), where
> users tend to treat these CIFs as data sources that can be read by
> applications without
> any need to examine the raw CIF themselves, let alone edit them.
> Although the above statements only encompass two user groups and are based
> on my personal experience, I believe these groups are the largest when
> talking about CIF users?
> So what is the impact on such users of introducing the use of non-ASCII text
> and thus raising the text encoding issue?
> In the latter case, probably minimal, inasmuch as the users dont interact
> directly with the raw CIF and rely on CIF processing software to manage the
> data.
> In the former case, it is quite possible that a user will no longer be able
> to edit the raw CIF using the same plain-text editor they have always used
> for such purposes.
> For example, if a user receives a CIF that has been encoded in UTF16 by some
> remote CIF processing system, and opens it in a non-UTF16-aware plain-text
> editor,
> they will not be presented with what they would expect, even if the
> character set in that particular CIF doesnt extend beyond ASCII;
> furthermore, even 'advanced' test editors would struggle if the encoding
> were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally
> applicable to CIF1, but by 'opening up' multiple encodings, the probability
> of their usage increases?
> So as soon as we move beyond ASCII, we have to accept that a large group of
> CIF users will, at the very least, have to be aware that CIF is no longer
> the 'text' format
> that they once understood it to be?
> 2) Developer perspective
> I beleive that developers presented with a documented standard will follow
> that standard and prefer to work with no uncertainties, especially if they
> are
> unfamiliar with the format (perhaps just need to be able to read a CIF to
> extract data relevant to their application/database...?)
> Taking the example of XML, in my experience developers seem to follow the
> standard quite strictly. Most everyday applications that process XML are
> intolerant of
> violations of the standard. Fortunately, it is largely only developers that
> work with raw XML, so the standard works well.
> In contrast to XML, with HTML/javascript the approach to the 'standard' is
> far more tolerant. Though these languages are standardized, in order to
> compete, the leading application
> developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML,
> are remarkably forgiving of syntax violations in javascript, and alter the
> standard to
> achieve their own ends or facilitate user requirements). I suspect this
> results largely from the evolution of the languages: just as in the early
> days of CIF, encouragement of
> use and the end results were more important than adherence to the documented
> standard?
> Note that these same applications that are so tolerant of HTML/javascript
> violations are far less forgiving of malformed XML. So is the lesson here
> that developers expect
> new standards to be unambiguous and will code accordingly (especially if the
> new standard was partly designed to address the shortcomings of its
> ancestors)?
> Again, forgive me if these all sounds familiar - however, before arguing one
> way or the other with regard to specifics, perhaps the wider group would
> like to confirm or otherwise the main points I'm trying to assert, in
> particular, with respect to *user* practice:
> 1) CIF2 will require users to change the way they view CIF - i.e. they may
> be forced to use CIF2-compliant text editors/application software, and
> abandon their current practice.
> With respect to developers, recent coverage has been very insightful, but
> just out of interest, would I be wrong in stating that:
> 2) Developers, especially those that don't specialize in CIF, are likely to
> want a clear-cut universal standard that does not require any heuristic
> interpretatation.
> Cheers
> Simon
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
cif2-encoding mailing list

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.