Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .

"to present the various ideas to the community in the form of
a completed standard with supporting software and see if they accept
it"

I tend to agree - the stumbling block is the "completed standard"
(at least w.r.t. encoding?)

:-)



From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Thursday, 26 August, 2010 0:57:44
Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. .

While I disagree with these estimates of how various communities will
react, the best way to find out is not for us to debate among ourselves,
but to present the various ideas to the community in the form of
a completed standard with supporting software and see if they accept
it.  In the case of core CIF, that community has accepted what they
were offered.  In the case of mmCIF, that community has essentially
rejected what they were offered.  So, after all these years of
effort on CIF2, isn't it past time to finish something, put it out
there and see if it flies.

As for my own views:
I remind you that XML is the end result of the essentially failed
SGML effort followed by the highly successful HTML effort.  XML saved
the SGML effort by adopting a large part of the simplicity and
flexibility of HTML.  Please bear that in mind.

=====================================================
Herbert J. Bernstein, Professor of Computer Science
  Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                +1-631-244-3035
                yaya@dowling.edu
=====================================================

On Wed, 25 Aug 2010, SIMON WESTRIP wrote:

> Dear all
>
> Recent contributions have stimulated me to revisit some of the fundamental
> issues of the possible changes in CIF2 with respect to CIF1,
> in particular, the impact on current practice (as I perceive it, based on my
> experience). The following is a summary of my thoughts, trying to
> look at this from two perspectives (forgive me if I repeat earlier
> opinions):
>
> 1) User perspective
>
> To date, in the 'core' CIF world (i.e. single-crystal and its extensions),
> users treat CIFs as text files, and expect to be able to read them as such
> using
> plain-text editors, and indeed edit them if necessary for e.g. publication
> purposes. Furthermore, they expect them to be readable by applications that
> claim that
> ability (e.g. graphics software).
>
> The situation is slghtly different with mmCIF (and the pdb variants), where
> users tend to treat these CIFs as data sources that can be read by
> applications without
> any need to examine the raw CIF themselves, let alone edit them.
>
> Although the above statements only encompass two user groups and are based
> on my personal experience, I believe these groups are the largest when
> talking about CIF users?
>
> So what is the impact on such users of introducing the use of non-ASCII text
> and thus raising the text encoding issue?
>
> In the latter case, probably minimal, inasmuch as the users dont interact
> directly with the raw CIF and rely on CIF processing software to manage the
> data.
>
> In the former case, it is quite possible that a user will no longer be able
> to edit the raw CIF using the same plain-text editor they have always used
> for such purposes.
> For example, if a user receives a CIF that has been encoded in UTF16 by some
> remote CIF processing system, and opens it in a non-UTF16-aware plain-text
> editor,
> they will not be presented with what they would expect, even if the
> character set in that particular CIF doesnt extend beyond ASCII;
> furthermore, even 'advanced' test editors would struggle if the encoding
> were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally
> applicable to CIF1, but by 'opening up' multiple encodings, the probability
> of their usage increases?
>
> So as soon as we move beyond ASCII, we have to accept that a large group of
> CIF users will, at the very least, have to be aware that CIF is no longer
> the 'text' format
> that they once understood it to be?
>
> 2) Developer perspective
>
> I beleive that developers presented with a documented standard will follow
> that standard and prefer to work with no uncertainties, especially if they
> are
> unfamiliar with the format (perhaps just need to be able to read a CIF to
> extract data relevant to their application/database...?)
>
> Taking the example of XML, in my experience developers seem to follow the
> standard quite strictly. Most everyday applications that process XML are
> intolerant of
> violations of the standard. Fortunately, it is largely only developers that
> work with raw XML, so the standard works well.
>
> In contrast to XML, with HTML/javascript the approach to the 'standard' is
> far more tolerant. Though these languages are standardized, in order to
> compete, the leading application
> developers have had to adopt flexibility (e.g. browsers accept 'dirty' HTML,
> are remarkably forgiving of syntax violations in javascript, and alter the
> standard to
> achieve their own ends or facilitate user requirements). I suspect this
> results largely from the evolution of the languages: just as in the early
> days of CIF, encouragement of
> use and the end results were more important than adherence to the documented
> standard?
>
> Note that these same applications that are so tolerant of HTML/javascript
> violations are far less forgiving of malformed XML. So is the lesson here
> that developers expect
> new standards to be unambiguous and will code accordingly (especially if the
> new standard was partly designed to address the shortcomings of its
> ancestors)?
>
>
> Again, forgive me if these all sounds familiar - however, before arguing one
> way or the other with regard to specifics, perhaps the wider group would
> like to confirm or otherwise the main points I'm trying to assert, in
> particular, with respect to *user* practice:
>
> 1) CIF2 will require users to change the way they view CIF - i.e. they may
> be forced to use CIF2-compliant text editors/application software, and
> abandon their current practice.
>
> With respect to developers, recent coverage has been very insightful, but
> just out of interest, would I be wrong in stating that:
>
> 2) Developers, especially those that don't specialize in CIF, are likely to
> want a clear-cut universal standard that does not require any heuristic
> interpretatation.
>
> Cheers
>
> Simon
>
>
>
> ____________________________________________________________________________
> From: James Hester <jamesrhester@gmail.com>
> To: Group for discussing encoding and content validation schemes for CIF2
> <cif2-encoding@iucr.org>
> Sent: Tuesday, 24 August, 2010 4:38:27
> Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line
> . .. .. .. .. .. .. .. .. .. .. .. .. .. .
>
> Thanks John for a detailed response.
>
> At the top of this email I will address this whole issue of optional
> behaviour.  I was clearly too telegraphic in previous posts, as
> Herbert thinks that optional whitespace counts as an optional feature,
> so I go into some detail below.
>
> By "optional features" I mean those aspects of the standard that are
> not mandatory for both readers and writers, and in addition I am not
> concerned with features that do not relate directly to the information
> transferred, e.g. optional warnings.  For example, unless "optional
> whitespace" means that the *reader* may throw a syntax error when
> whitespace is encountered at some particular point where whitespace is
> optional, I do not view optional whitespace as an optional feature -
> it is only optional for the writer.  With this definition of "optional
> feature" it follows logically that, if a standard has such "optional
> features", not all standard-conformant files will be readable by all
> standard-conformant readers.  This is as true of HTML, XML and CIF1 as
> it is of CIF2.  Whatever the relevance of HTML and XML to CIF, the
> existence of successful standards with optional features proves only
> that a standard can achieve widespread acceptance while having
> optional features - whether these optional features are a help or a
> hindrance would require some detailed analysis.
>
> So: any standard containing optional features requires the addition of
> external information in order to resolve the choice of optional
> features before successful information interchange can take place.
>
> Into this situation we place software developers.  These are the
> people who play a big role in deciding which optional parts of the
> standard are used, as they are the ones that write the software that
> attempts to read and write the files.  Developers will typically
> choose to support optional features based on how likely they are to be
> used, which depends in part on how likely they are perceived to be
> implemented in other software.  This is a recursive, potentially
> unstable situation, which will eventually resolve itself in one of
> three ways:
>
> (1) A "standard" subset of optional features develops and is
> approximately always implemented in readers.  Special cases:
>   (a) No optional features are implemented
>   (b) All optional features are implemented
> (2) A variety of "standard" subsets develop, dividing users into
> different communities. These communities can't always read each
> other's files without additional conversion software, but there is
> little impetus to write this software, because if there were, the
> developers would have included support for the missing options in the
> first place.  The most obvious example of such communities would be
> thosed based on options relating to natural languages, if those
> communities do not care about accessibility of their files to
> non-users of their language and encoding.
> (3) A truly chaotic situation develops, with no discernable resolution
> and a plethora of incompatible files and software.
>
> Outcome 1 is the most desirable, as all files are now readable by all
> readers, meaning no additional negotiation is necessary, just as if we
> had mandated that set of optional features.  Outcome 2 is less
> desirable, as more software needs to be written and the standard by
> itself is not necessarily enough information to read a given file.
> Outcome 3 is obviously pretty unwelcome, but unlikely as it would
> require a lot of competing influences, which would eventually change
> and allow resolution into (1) or (2).  Think HTML and Microsoft.
>
> Now let us apply the above analysis to CIF: some are advocating not
> exhaustively listing or mandating the possible CIF2 encodings (CIF1
> did not list or mandate encoding either), leading to a range of
> "optional features" as I have defined it above (where support for any
> given encoding is a single "optional feature").  For CIF1, we had a
> type 1 outcome (only ASCII encoding was supported and produced).
>
> So: my understanding of the previous discussion is that, while we
> agree that it would be ideal if everyone used only UTF8, some perceive
> that the desire to use a different encoding will be sufficiently
> strong that mandating UTF8 will be ineffective and/or inconvenient.
> So, while I personally would advocate mandating UTF8, the other point
> of view would have us allowing non UTF8 encoding but hoping that
> everyone will eventually move to UTF8.
>
> In which case I would like to suggest that we use network effects to
> influence the recursive feedback loop experienced by programmers
> described above, so that the community settles on UTF8 in the same way
> as it has settled on ASCII for CIF1.  That is, we "load the dice" so
> that other encodings are disfavoured.  Here are some ways to "load the
> dice":
>
> (1) Mandate UTF8 only.
> (2) Make support for UTF8 mandatory in CIF processors
> (3) Force non UTF8 files to jump through extra hoops (which I think is
> necessary anyway)
> (4) Educate programmers on the drawbacks of non UTF8 encodings and
> strongly urge them not to support reading non UTF8 CIF files
> (5) Strongly recommend that the IUCr, wwPDB, and other centralised
> repositories reject non-UTF8-encoded CIF files
> (6) Make available hyperlinked information on system tools for dealing
> with UTF8 files on popular platforms, which could be used in error
> messages produced by programs (see (4))
>
> I would be interested in hearing comments on the acceptability of
> these options from the rest of the group (I think we know how we all
> feel about (1)!).
>
> Now, returning to John's email: I will answer each of the points
> inline, at the same time attempting to get all the attributions
> correct.
>
> (James) I had not fully appreciated that Scheme B is intended to be
> applied only at the moment of transfer or archiving, and envisions
> users normally saving files in their preferred encoding with no hash
> codes or encoding hints required (I will call the inclusion of such
> hints and hashes as 'decoration').
>
> (John) "Envisions users normally [...]" is a bit stronger than my
> position or the intended orientation of Scheme B.  "Accommodates"
> would be my choice of wording.
>
> (James now) No problem with that wording, my point is that such
> undecorated files will be called CIF2 files and so are a target for
> CIF2 software developers, thus "unloading" the dice away from UTF8 and
> closer to encoding chaos.
>
> (James)  A direct result of allowing undecorated files to reside on
> disk is that CIF software producers will need to write software that
> will function with arbitrary encodings with no decoration to help
> them, as that is the form that users' files will be most often be in.
>
> (John) The standard can do no more to prevent users from storing
> undecorated CIFs than it can to prevent users from storing CIF text
> encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding.
> More generally, all the standard can do is define the characteristics
> of a conformant CIF -- it can never prevent CIF-like but
> non-conformant files from being created, used, exchanged, or archived
> as if they were conformant CIFs.  Regardless of the standard's
> ultimate position on this issue, software authors will have to be
> guided by practical considerations and by the real-world requirements
> placed on their programs.  In particular, they will have to decide
> whether to accept "CIF" input that in fact violates the standard in
> various ways, and / or they will have to decide which optional CIF
> behaviors they will support.  As such, I don't see a significant
> distinction between the alternatives before us as regards the
> difficulty, complexity, or requirements of CIF2 software.
>
> (James now) I have described the way the standard works to restrict
> encodings in the discussion at the top of this email.  Briefly, CIF
> software developers develop programs that conform with the CIF2
> standard.  If that standard says 'UTF8', they program for UTF8.  If
> you want to work in ISO-8859-15 etc, you have to do extra work.
>
> Working in favour of such extra work would be a compelling use case,
> which I have yet to see (I note that the 'UTF8 only' standard posted
> to ccp4-bb and pdb-l produced no comments).  My strong perception is
> that any need for other encodings is overwhelmed by the utility of
> settling on a single encoding, but that perception would need
> confirmation from a proper survey of non-ASCII users.
>
> So, no we can't stop people saving CIF-like files in other encodings,
> but we can discourage it by creating significant barriers in terms of
> software availability.  Just like we can't stop CIF1 users saving
> files in JIS X 0208, but that doesn't happen at any level that causes
> problems (if it happens at all, which I doubt).
>
> (John) Furthermore, no formulation of CIF is inherently reliable or
> unreliable, because reliability (in this sense) is a characteristic of
> data transfer, not of data themselves.  Scheme B targets the
> activities that require reliability assurance, and disregards those
> that don't.  In a practical sense, this isn't any different from
> scheme A, because it is only when the encoding is potentially
> uncertain -- to wit, in the context of data transfer -- that either
> scheme need be applied (see also below).  I suppose I would be willing
> to make scheme B a general requirement of the CIF format, but I don't
> see any advantage there over the current formulation.  The actual
> behavior of people and the practical requirements on CIF software
> would not appreciably change.
>
> (James now) I would suggest that Scheme B does not target all
> activites requiring reliability assurance, as it does not address the
> situation where people use a mix of CIF-aware software and text tools
> in a single encoding environment.
>
> The real, significant change that occurs when you accept Scheme B is
> that CIF files can now be in any encoding and undecorated.
> Programmers are then likely to provide programs that might or might
> not work with various encodings, and users feel justifiably that their
> undecorated files should be supported.  The software barrier that was
> encouraging UTF8-only has been removed, and the problem of mismatched
> encodings that we have been trying to avoid becomes that much more
> likely to occur.  Scheme B has very few teeth to enforce decoration at
> the point of transfer, as the software at either end is now probably
> happy with an undecorated file.  Requiring decoration as a condition
> of being a CIF2 file means that software will tend to reject
> undecorated files, thereby limiting the damage that would be caused by
> open slather encoding.
>
> (James)  Furthermore, given the ease with which files can be
> transferred between users (email attachment, saved in shared,
> network-mounted directory, drag and drop onto USB stick etc.) it is
> unlikely that Scheme B or anything involving extra effort would be
> applied unless the recipient demanded it.
>
> (John) For hand-created or hand-edited CIFs, I agree.  CIFs
> manipulated via a CIF2-compliant editor could be relied upon to
> conform to scheme B, however, provided that is standardized.  But the
> same applies to scheme A, given that few operating environments
> default to UTF-8 for text.
>
> (James now) That is my goal: that any CIF that passes through a
> CIF-compliant program must be decorated before input and output (if
> not UTF8).  What hand-edited, hand-created CIFs actually have in the
> way of decoration doesn't bother me much, as these are very rare and
> of no use unless they can be read into a CIF program, at which point
> they should be rejected until properly decorated.  And I reiterate,
> the process of applying decoration can be done interactively to
> minimise the chances of incorrect assignment of encoding.
>
> (James)  And given how many times that file might have changed hands
> across borders and operating systems within a single group
> collaboration, there would only be a qualified guarantee that the
> character to binary mapping has not been mangled en route, making any
> scheme applied subsequently rather pointless.
>
> (John) That also does not distinguish among the alternatives before
> us.  I appreciate the desire for an absolute guarantee of reliability,
> but none is available.  Qualified guarantees are the best we can
> achieve (and that's a technical assessment, not an aphorism).
>
> (James now) Oh, but I believe it does distinguish, because if CIF
> software reads only UTF8 (because that is what the standard says),
> then the file will tend to be in UTF8 at all points in time, with
> reduced possibilities for encoding errors.  I think it highly likely
> that each group that handles a CIF will at some stage run it through
> CIF-aware software, which means encoding mistakes are likely to be
> caught much earlier.
>
> (James) We would thus go from a situation where we had a single,
> reliable and sometimes slightly inconvenient encoding (UTF8), to one
> where a CIF processor should be prepared for any given CIF file to be
> one of a wide range of encodings which need to be guessed.
>
> (John) Under scheme A or the present draft text, we have "a single,
> reliable [...] encoding" only in the sense that the standard
> *specifies* that that encoding be used.  So far, however, I see little
> will to produce or use processors that are restricted to UTF-8, and I
> have every expectation that authors will continue to produce CIFs in
> various encodings regardless of the standard's ultimate stance.  Yes,
> it might be nice if everyone and every system converged on UTF-8 for
> text encoding, but CIF2 cannot force that to happen, not even among
> crystallographers.
>
> (James now) You see little will to do this: but as far as I can tell,
> there is even less will not to do it.  Authors will not "continue" to
> produce CIFs in various encodings, as they haven't started doing so
> yet.  As I've said above, CIF2 can certainly, if not force, encourage
> UTF8 adoption.  What's more, non-ASCII characters are only gradually
> going to find their way into CIF2 files, as the dictionaries and large
> scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters
> in names, and the users gradually adapt to this new way of doing
> things.  I have no sense that CIF users will feel a strong desire to
> use non UTF8 schemes, when they have been happy in an ASCII-only
> regime up until now.  But I'm curious: on what basis are you saying
> that there is little will to use processors that are restricted to
> UTF8?
>
> (John) In practice, then, we really have a situation where the
> practical / useful CIF2 processor must be prepared to handle a variety
> of encodings (details dependent on system requirements), which may
> need to be guessed, with no standard mechanism for helping the
> processor make that determination or for allowing it to check its
> guess.  Scheme B improves that situation by standardizing a general
> reliability assurance mechanism, which otherwise would be missing.  In
> view of the practical situation, I see no down side at all.  A CIF
> processor working with scheme B is *more* able, not less.
>
> (James) I would much prefer a scheme which did not compromise
> reliability in such a significant way.
>
> (John) There is no such compromise, because in practice, we're not
> starting from a reliable position.
>
> (James now) I think your statement that our current position is not
> reliable arises out of a perception that users are likely to use a
> variety of encodings regardless of what the standard says.  I think
> this danger is way overstated, but I'd like to see you expand on why
> you think there is such a likelihood of multiple encodings being used
>
> (James) My previous (somewhat clunky) attempts to adjust Scheme B were
> directed at trying to force any file with the CIF2.0 magic number to
> be either decorated or UTF-8, meaning that software has a reasonably
> high confidence in file integrity.
>
> An alternative way of thinking about this is that CIF files also act
> as the mechanism of information transfer between software programs.
> [... W]hen a separate program is asked to input that CIF, the
> information has been transferred, even if that software is running on
> the same computer.
>
> (John) So in that sense, one could argue that Scheme B already applies
> to all CIFs, its assertion to the contrary notwithstanding.  Honestly,
> though, I don't think debating semantic details of terms such as "data
> transfer" is useful because in practice, and independent of scheme A,
> B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to
> choose what form of reliability assurance to accept or demand, if any.
>
> (James now) I was only debating semantic details in order to expose
> the fact that data transfer occurs between programs, not just between
> systems, and that therefore Scheme B should apply within a single
> system, so therefore, all CIF2 files should be decorated.  As for who
> should be demanding reliability assurance, the receiver may not be in
> a position to demand some level of reliability if the file creator is
> not in direct contact.  Again, we can build this reliability into the
> standard and save the extra negotiation or loss of information that is
> otherwise involved.
>
> (James) Now, moving on to the detailed contours of Scheme B and
> addressing the particular points that John and I have been discussing.
>  My original criticisms are the ones preceded by numerals.
>
> [(James now) I've deleted those points where we have reached
> agreement.  Those points are:
> (1) Restrict encodings to those for which the first line of a CIF file
> provides unambiguous encoding for ASCII codepoints
> (2) Put the hash value on the first line]
>
> (James a long time ago) (4) Assumption that all recipients will be
> able to handle all encodings
>
> (John) There is no such assumption.  Rather, there is an
> acknowledgement that some systems may be unable to handle some CIFs.
> That is already the case with CIF1, and it is not completely resolved
> by standardizing on UTF-8 (i.e. scheme A).
>
> (James) There is no such thing as 'optional' for an information
> interchange standard.  A file that conforms to the standard must be
> readable by parsers written according to the standard. If reading a
> standard-conformant file might fail or (worse) the file might be
> misinterpreted, information cannot always reliably be exchanged using
> this standard, so that optional behaviour needs to be either
> discarded, or made mandatory. There is thus no point in including
> optional behaviour in the standard. So: if the standard allows files
> to be written in encoding XYZ, then all readers should be able to read
> files written in encoding XYZ.  I view the CIF1 stance of allowing any
> encoding as a mistake, but a benign one, as in the case of CIF1 ASCII
> was so entrenched that it was the defacto standard for the characters
> appearing in CIF1 files.  In short, we have to specify a limited set
> of acceptable encodings.
>
> (John) As Herb astutely observed, those assertions reflect a
> fundamental source of our disagreement.  I think we can all agree that
> a standard that permits conforming software to misinterpret conforming
> data is undesirable.
>
> Surely we can also agree that an information interchange standard does
> not serve its purpose if it does not support information being
> successfully interchanged.  It does not follow, however, that the
> artifacts by which any two parties realize an information interchange
> must be interpretable by all other conceivable parties, nor does it
> follow that that would be a supremely advantageous characteristic if
> it were achievable.  It also does not follow that recognizable failure
> of any particular attempt at interchange must at all costs be avoided,
> or that a data interchange standard must take no account of its usage
> context.
>
> (James now) This is where we must make a policy decision: is a CIF2
> file to be a universally understandable file?  I agree that excluding
> optional behaviour is not an absolute requirement, but I also consider
> that optional behaviour should not be introduced without solid
> justification, given the real cost in interoperability and portability
> of the standard.  You refer to two parties who wish to exchange
> information: those parties are always free to agree on private
> enhancements to the CIF2 standard (or to create their very own
> protocol), if they are in contact.  I do not see why this use case
> need concern us here.  Herbert can say to John 'I'm emailing you a
> CIF2 file but encoded in UTF16'.  John has his extremely excellent
> software which handles UTF16 and these two parties are happy.
>
> John mentions a 'usage context'.  If the standard is to include some
> account of usage context, then that context has to be specified
> sufficiently for a CIF2 programmer to understand what aspects of that
> context to consider, and not left open to misinterpretation.  Perhaps
> you could enlarge on what particular context should be included?
>
> (John) Optional and alternative behaviors are not fundamentally
> incompatible with a data interchange standard, as XML and HTML
> demonstrate.  Or consider the extreme variability of CIF text content:
> whether a particular CIF is suitable for a particular purpose depends
> intimately on exactly which data are present in it, and even to some
> extent on which data names are used to present them, even though ALL
> are optional as far as the format is concerned.  If I say 'This CIF is
> unsuitable for my present purpose because it does not contain
> _symmetry_space_group_name_H-M', that does not mean the CIF standard
> is broken.  Yet, it is not qualitatively different for me to say 'This
> CIF is unsuitable because it is encoded in CCSID 500' despite CIF2
> (hypothetically) permitting arbitrary encodings.
>
> (James now)  The difference is quantitative and qualitative.
> Quantitative, because the number of CIF2 files that are unsuitable
> because of missing tags will always be less than or equal to the
> number of CIF2 files that are unsuitable because of a missing tag and
> unknown encoding.  Thus, by reducing ambiguity at the lower levels of
> the standard, we improve the utility at the higher levels.  The
> difference is also qualitative, in that (a) if we have tags with
> non-ASCII characters, they could conceivably be confused with other
> tags if the encoding is not correct and so you will have a situation
> where a file that is not suitable actually appears suitable, because
> the desired tag appears. Likewise, the value taken by a tag may be
> wrong.
>
> (James a long time ago) (iii) restrict possible encodings to
> internationally recognised ones with well-specified Unicode mappings.
> This addresses point (4)
>
> (John) I don't see the need for this, and to some extent I think it
> could be harmful.  For example, if Herb sees a use for a scheme of
> this sort in conjunction with imgCIF (unknown at this point whether he
> does), then he might want to be able to specify an encoding specific
> to imgCIF, such as one that provides for multiple text segments, each
> with its own character encoding.  To the extent that imgCIF is an
> international standard, perhaps that could still satisfy the
> restriction, but I don't think that was the intended meaning of
> "internationally recognised".
>
> (James now)  Indeed.  My intent with this specification was to ensure
> that third parties would be able to recover the encoding. If imgCIF is
> going to cause us to make such an open-ended specification, it is
> probably a sign that imgCIF needs to be addressed separately.  For
> example, should we think about redefining it as a container format,
> with a CIF header and UTF16 body (but still part of the
> "Crystallographic Information Framework")?
>
> (John) As for "well-specified Unicode mappings", I think maybe I'm
> missing something.  CIF text is already limited to Unicode characters,
> and any encoding that can serve for a particular piece of CIF text
> must map at least the characters actually present in the text.  What
> encodings or scenarios would be excluded, then, by that aspect of this
> suggestion?
>
> (James) My intention was to make sure that not only the particular
> user who created the file knew this mapping, but that the mapping was
> publically available.  Certainly only Unicode encodable code points
> will appear, but the recipient needs to be able to recover the mapping
> from the file bytes to Unicode without relying on e.g. files that will
> be supplied on request by someone whose email address no longer works.
>
> (John) This issue is relevant only to the parties among whom a
> particular CIF is exchanged.  The standard would not particularly
> assist those parties by restricting the permitted encodings, because
> they can safely ignore such restrictions if they mutually agree to do
> so (whether statically or dynamically), and they (specifically, the
> CIF originator) must anyway comply with them if no such agreement is
> implicit or can be reached.
>
> (James) Again, any two parties in current contact can send each other
> files in whatever format and encoding they wish.  My concern is that
> CIF software writers are not drawn into supporting obscure or adhoc
> encodings.
>
> (John) B) Scheme B does not use quite the same language as scheme A
> with respect to detectable encodings.  As a result, it supports
> (without tagging or hashing) not just UTF-8, but also all UTF-16 and
> UTF-32 variants.  This is intentional.
>
> (James) I am concerned that the vast majority of users based in
> English speaking countries (and many non English speaking countries)
> will be quite annoyed if they have to deal with UTF-16/32 CIF2 files
> that are no longer accessible to the simple ASCII-based tools and
> software that they are used to.  Because of this, allowing undecorated
> UTF16/32 would be far more disruptive than forcing people to use UTF8
> only. Thus my stipulation on maintaining compatibility with ASCII for
> undecorated files.
>
> (John) Supporting UTF-16/32 without tagging or hashing is not a key
> provision of scheme B, and I could live without it, but I don't think
> that would significantly change the likelihood of a user unexpectedly
> encountering undecorated UTF-16/32 CIFs.  It would change only whether
> such files were technically CIF-conformant, which doesn't much matter
> to the user on the spot.  In any case, it is not the lack of
> decoration that is the basic problem here.
>
> (James now)  Yes, that is true.  A decorated UTF16 file is just as
> unreadable as an undecorated one in ASCII tools.  However, per my
> comments at the start of this email, I think an extra bit of hoop
> jumping for non UTF8 encoded files has the desirable property of
> encouraging UTF8 use.
>
> (John) C) Scheme B is not aimed at ensuring that every conceivable
> receiver be able to interpret every scheme-B-compliant CIF.  Instead,
> it provides receivers the ability to *judge* whether they can
> interpret particular CIFs, and afterwards to *verify* that they have
> done so correctly.  Ensuring that receivers can interpret CIFs is thus
> a responsibility of the sender / archive maintainer, possibly in
> cooperation with the receiver / retriever.
>
> (James) As I've said before, I don't see the paradigm of live
> negotiation between senders and receivers as very useful, as it fails
> to account for CIFs being passed between different software (via
> reading/writing to a file system), or CIFs where the creator is no
> longer around, or technically unsophisticated senders where, for
> example, the software has produced an undecorated CIF in some native
> encoding and the sender has absolutely no idea why the receiver (if
> they even have contact with the receiver!) can't read the file
> properly.   I prefer to see the standard that we set as a substitute
> for live negotiation, so leaving things up to the users is in that
> sense an abrogation of our responsibility.
>
> (John) That scenario will undoubtedly occur occasionally regardless of
> the outcome of this discussion.  If it is our responsibility to avoid
> it at all costs then we are doomed to fail in that regard.  Software
> *will* under some circumstances produce undecorated, non-UTF-8 "CIFs"
> because that is sometimes convenient, efficient, and appropriate for
> the program's purpose.
>
> I think, though, those comments reflect a bit of a misconception.  The
> overall purpose of CIF supporting multiple encodings would be to allow
> specific CIFs to be better adapted for specific purposes.  Such
> purposes include, but are not limited to
>
> () exchanging data with general-purpose program(s) on the same system
> () exchanging data with crystallography program(s) on the same system
> () supporting performance or storage objectives of specific programs or
> systems
> () efficiently supporting problem or data domains in which Latin text
> is a minority of the content (e.g. imgCIF)
> () storing data in a personal archive
> () exchanging data with known third parties
> () publishing data to a general audience
>
> *Few, if any, of those uses would be likely to involve live
> negotiation.*  That's why I assigned primary responsibility for
> selecting encodings to the entity providing the CIF.  I probably
> should not even have mentioned cooperation of the receiver; I did so
> more because it is conceivable than because it is likely.
>
> (James now) OK, fair enough. My issues then with the paradigm of
> provider-based encoding selection is that it only works where the
> provider is capable of making this choice, and it puts that
> responsibility on all providers, large and small.  Of course, I am
> keen to construct a CIF ecology where providers always automatically
> choose UTF8 as the "safe" choice.
>
> (John) Under any scheme I can imagine, some CIFs will not be well
> suited to some purposes.  I want to avoid the situation that *no*
> conformant CIF can be well suited to some reasonable purposes.  I am
> willing to forgo the result that *every* conformant CIF is suited to
> certain other, also reasonable purposes.
>
> (James now) Fair enough.  However, so far the only reasonable purpose
> that I can see for which a UTF8 file would not be suitable is
> exchanging data with general-purpose programs that do not cope with
> UTF8, and it may well be that with a bit of research the list of such
> programs would turn out to be rather short.
>
>
>
> --
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> cif2-encoding mailing list
> cif2-encoding@iucr.org
> http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>
>
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.