[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@xxxxxxxx>
- Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- From: "Herbert J. Bernstein" <yaya@xxxxxxxxxxxxxxxxxxxxxxx>
- Date: Wed, 25 Aug 2010 21:02:26 -0400 (EDT)
- In-Reply-To: <902931.65953.qm@web87003.mail.ird.yahoo.com>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com><8F77913624F7524AACD2A92EAF3BFA54166122952D@SJMEMXMBS11.stjude.sjcrh.local><33483.93964.qm@web87012.mail.ird.yahoo.com> <8F77913624F7524AACD2A92EAF3BFA541661229533@SJMEMXMBS11.stjude.sjcrh.local><AANLkTilqKa_vZJEmfjEtd_MzKhH1CijEIglJzWpFQrrC@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229542@SJMEMXMBS11.stjude.sjcrh.local><AANLkTikTee4PicHKjnnbAdipegyELQ6UWLXz9Zm08aVL@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA541661229552@SJMEMXMBS11.stjude.sjcrh.local><AANLkTinZ4KNsnREOOU6sVFdGYR_aQHcjdWr_ko648NGm@mail.gmail.com><8F77913624F7524AACD2A92EAF3BFA5416659DED8C@SJMEMXMBS11.stjude.sjcrh.local><AANLkTintziXhwVCEFD0yUtTDo9KG8ut=oL4OgmkjmEBe@mail.gmail.com><639601.73559.qm@web87008.mail.ird.yahoo.com><alpine.BSF.2.00.1008251951480.38129@epsilon.pair.co m><902931.65953.qm@web87003.mail.ird.yahoo.com>
With software, we do "release candidates". I would suggest that the proponents of the UTF8-only approach prepare their CIF2 release candidate and that those of us who favor a more general encoding approach prepare our release candidate, that we put both forward to the communities involved and see what reaction we get. ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > "to present the various ideas to the community in the form of > a completed standard with supporting software and see if they accept > it" > > I tend to agree - the stumbling block is the "completed standard" > (at least w.r.t. encoding?) > > :-) > > > ____________________________________________________________________________ > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> > To: Group for discussing encoding and content validation schemes for CIF2 > <cif2-encoding@iucr.org> > Sent: Thursday, 26 August, 2010 0:57:44 > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > While I disagree with these estimates of how various communities will > react, the best way to find out is not for us to debate among ourselves, > but to present the various ideas to the community in the form of > a completed standard with supporting software and see if they accept > it. In the case of core CIF, that community has accepted what they > were offered. In the case of mmCIF, that community has essentially > rejected what they were offered. So, after all these years of > effort on CIF2, isn't it past time to finish something, put it out > there and see if it flies. > > As for my own views: > I remind you that XML is the end result of the essentially failed > SGML effort followed by the highly successful HTML effort. XML saved > the SGML effort by adopting a large part of the simplicity and > flexibility of HTML. Please bear that in mind. > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Wed, 25 Aug 2010, SIMON WESTRIP wrote: > > > Dear all > > > > Recent contributions have stimulated me to revisit some of the fundamental > > issues of the possible changes in CIF2 with respect to CIF1, > > in particular, the impact on current practice (as I perceive it, based on > my > > experience). The following is a summary of my thoughts, trying to > > look at this from two perspectives (forgive me if I repeat earlier > > opinions): > > > > 1) User perspective > > > > To date, in the 'core' CIF world (i.e. single-crystal and its extensions), > > users treat CIFs as text files, and expect to be able to read them as such > > using > > plain-text editors, and indeed edit them if necessary for e.g. publication > > purposes. Furthermore, they expect them to be readable by applications > that > > claim that > > ability (e.g. graphics software). > > > > The situation is slghtly different with mmCIF (and the pdb variants), > where > > users tend to treat these CIFs as data sources that can be read by > > applications without > > any need to examine the raw CIF themselves, let alone edit them. > > > > Although the above statements only encompass two user groups and are based > > on my personal experience, I believe these groups are the largest when > > talking about CIF users? > > > > So what is the impact on such users of introducing the use of non-ASCII > text > > and thus raising the text encoding issue? > > > > In the latter case, probably minimal, inasmuch as the users dont interact > > directly with the raw CIF and rely on CIF processing software to manage > the > > data. > > > > In the former case, it is quite possible that a user will no longer be > able > > to edit the raw CIF using the same plain-text editor they have always used > > for such purposes. > > For example, if a user receives a CIF that has been encoded in UTF16 by > some > > remote CIF processing system, and opens it in a non-UTF16-aware plain-text > > editor, > > they will not be presented with what they would expect, even if the > > character set in that particular CIF doesnt extend beyond ASCII; > > furthermore, even 'advanced' test editors would struggle if the encoding > > were e.g. UTF16BE (i.e. has no BOM). Granted, this example is equally > > applicable to CIF1, but by 'opening up' multiple encodings, the > probability > > of their usage increases? > > > > So as soon as we move beyond ASCII, we have to accept that a large group > of > > CIF users will, at the very least, have to be aware that CIF is no longer > > the 'text' format > > that they once understood it to be? > > > > 2) Developer perspective > > > > I beleive that developers presented with a documented standard will follow > > that standard and prefer to work with no uncertainties, especially if they > > are > > unfamiliar with the format (perhaps just need to be able to read a CIF to > > extract data relevant to their application/database...?) > > > > Taking the example of XML, in my experience developers seem to follow the > > standard quite strictly. Most everyday applications that process XML are > > intolerant of > > violations of the standard. Fortunately, it is largely only developers > that > > work with raw XML, so the standard works well. > > > > In contrast to XML, with HTML/javascript the approach to the 'standard' is > > far more tolerant. Though these languages are standardized, in order to > > compete, the leading application > > developers have had to adopt flexibility (e.g. browsers accept 'dirty' > HTML, > > are remarkably forgiving of syntax violations in javascript, and alter the > > standard to > > achieve their own ends or facilitate user requirements). I suspect this > > results largely from the evolution of the languages: just as in the early > > days of CIF, encouragement of > > use and the end results were more important than adherence to the > documented > > standard? > > > > Note that these same applications that are so tolerant of HTML/javascript > > violations are far less forgiving of malformed XML. So is the lesson here > > that developers expect > > new standards to be unambiguous and will code accordingly (especially if > the > > new standard was partly designed to address the shortcomings of its > > ancestors)? > > > > > > Again, forgive me if these all sounds familiar - however, before arguing > one > > way or the other with regard to specifics, perhaps the wider group would > > like to confirm or otherwise the main points I'm trying to assert, in > > particular, with respect to *user* practice: > > > > 1) CIF2 will require users to change the way they view CIF - i.e. they may > > be forced to use CIF2-compliant text editors/application software, and > > abandon their current practice. > > > > With respect to developers, recent coverage has been very insightful, but > > just out of interest, would I be wrong in stating that: > > > > 2) Developers, especially those that don't specialize in CIF, are likely > to > > want a clear-cut universal standard that does not require any heuristic > > interpretatation. > > > > Cheers > > > > Simon > > > > > > > >___________________________________________________________________________ > _ > > From: James Hester <jamesrhester@gmail.com> > > To: Group for discussing encoding and content validation schemes for CIF2 > > <cif2-encoding@iucr.org> > > Sent: Tuesday, 24 August, 2010 4:38:27 > > Subject: Re: [Cif2-encoding] [ddlm-group] options/text vs > binary/end-of-line > > . .. .. .. .. .. .. .. .. .. .. .. .. .. . > > > > Thanks John for a detailed response. > > > > At the top of this email I will address this whole issue of optional > > behaviour. I was clearly too telegraphic in previous posts, as > > Herbert thinks that optional whitespace counts as an optional feature, > > so I go into some detail below. > > > > By "optional features" I mean those aspects of the standard that are > > not mandatory for both readers and writers, and in addition I am not > > concerned with features that do not relate directly to the information > > transferred, e.g. optional warnings. For example, unless "optional > > whitespace" means that the *reader* may throw a syntax error when > > whitespace is encountered at some particular point where whitespace is > > optional, I do not view optional whitespace as an optional feature - > > it is only optional for the writer. With this definition of "optional > > feature" it follows logically that, if a standard has such "optional > > features", not all standard-conformant files will be readable by all > > standard-conformant readers. This is as true of HTML, XML and CIF1 as > > it is of CIF2. Whatever the relevance of HTML and XML to CIF, the > > existence of successful standards with optional features proves only > > that a standard can achieve widespread acceptance while having > > optional features - whether these optional features are a help or a > > hindrance would require some detailed analysis. > > > > So: any standard containing optional features requires the addition of > > external information in order to resolve the choice of optional > > features before successful information interchange can take place. > > > > Into this situation we place software developers. These are the > > people who play a big role in deciding which optional parts of the > > standard are used, as they are the ones that write the software that > > attempts to read and write the files. Developers will typically > > choose to support optional features based on how likely they are to be > > used, which depends in part on how likely they are perceived to be > > implemented in other software. This is a recursive, potentially > > unstable situation, which will eventually resolve itself in one of > > three ways: > > > > (1) A "standard" subset of optional features develops and is > > approximately always implemented in readers. Special cases: > > (a) No optional features are implemented > > (b) All optional features are implemented > > (2) A variety of "standard" subsets develop, dividing users into > > different communities. These communities can't always read each > > other's files without additional conversion software, but there is > > little impetus to write this software, because if there were, the > > developers would have included support for the missing options in the > > first place. The most obvious example of such communities would be > > thosed based on options relating to natural languages, if those > > communities do not care about accessibility of their files to > > non-users of their language and encoding. > > (3) A truly chaotic situation develops, with no discernable resolution > > and a plethora of incompatible files and software. > > > > Outcome 1 is the most desirable, as all files are now readable by all > > readers, meaning no additional negotiation is necessary, just as if we > > had mandated that set of optional features. Outcome 2 is less > > desirable, as more software needs to be written and the standard by > > itself is not necessarily enough information to read a given file. > > Outcome 3 is obviously pretty unwelcome, but unlikely as it would > > require a lot of competing influences, which would eventually change > > and allow resolution into (1) or (2). Think HTML and Microsoft. > > > > Now let us apply the above analysis to CIF: some are advocating not > > exhaustively listing or mandating the possible CIF2 encodings (CIF1 > > did not list or mandate encoding either), leading to a range of > > "optional features" as I have defined it above (where support for any > > given encoding is a single "optional feature"). For CIF1, we had a > > type 1 outcome (only ASCII encoding was supported and produced). > > > > So: my understanding of the previous discussion is that, while we > > agree that it would be ideal if everyone used only UTF8, some perceive > > that the desire to use a different encoding will be sufficiently > > strong that mandating UTF8 will be ineffective and/or inconvenient. > > So, while I personally would advocate mandating UTF8, the other point > > of view would have us allowing non UTF8 encoding but hoping that > > everyone will eventually move to UTF8. > > > > In which case I would like to suggest that we use network effects to > > influence the recursive feedback loop experienced by programmers > > described above, so that the community settles on UTF8 in the same way > > as it has settled on ASCII for CIF1. That is, we "load the dice" so > > that other encodings are disfavoured. Here are some ways to "load the > > dice": > > > > (1) Mandate UTF8 only. > > (2) Make support for UTF8 mandatory in CIF processors > > (3) Force non UTF8 files to jump through extra hoops (which I think is > > necessary anyway) > > (4) Educate programmers on the drawbacks of non UTF8 encodings and > > strongly urge them not to support reading non UTF8 CIF files > > (5) Strongly recommend that the IUCr, wwPDB, and other centralised > > repositories reject non-UTF8-encoded CIF files > > (6) Make available hyperlinked information on system tools for dealing > > with UTF8 files on popular platforms, which could be used in error > > messages produced by programs (see (4)) > > > > I would be interested in hearing comments on the acceptability of > > these options from the rest of the group (I think we know how we all > > feel about (1)!). > > > > Now, returning to John's email: I will answer each of the points > > inline, at the same time attempting to get all the attributions > > correct. > > > > (James) I had not fully appreciated that Scheme B is intended to be > > applied only at the moment of transfer or archiving, and envisions > > users normally saving files in their preferred encoding with no hash > > codes or encoding hints required (I will call the inclusion of such > > hints and hashes as 'decoration'). > > > > (John) "Envisions users normally [...]" is a bit stronger than my > > position or the intended orientation of Scheme B. "Accommodates" > > would be my choice of wording. > > > > (James now) No problem with that wording, my point is that such > > undecorated files will be called CIF2 files and so are a target for > > CIF2 software developers, thus "unloading" the dice away from UTF8 and > > closer to encoding chaos. > > > > (James) A direct result of allowing undecorated files to reside on > > disk is that CIF software producers will need to write software that > > will function with arbitrary encodings with no decoration to help > > them, as that is the form that users' files will be most often be in. > > > > (John) The standard can do no more to prevent users from storing > > undecorated CIFs than it can to prevent users from storing CIF text > > encoded in ISO-8859-15, Shift-JIS or any other non-UTF-8 encoding. > > More generally, all the standard can do is define the characteristics > > of a conformant CIF -- it can never prevent CIF-like but > > non-conformant files from being created, used, exchanged, or archived > > as if they were conformant CIFs. Regardless of the standard's > > ultimate position on this issue, software authors will have to be > > guided by practical considerations and by the real-world requirements > > placed on their programs. In particular, they will have to decide > > whether to accept "CIF" input that in fact violates the standard in > > various ways, and / or they will have to decide which optional CIF > > behaviors they will support. As such, I don't see a significant > > distinction between the alternatives before us as regards the > > difficulty, complexity, or requirements of CIF2 software. > > > > (James now) I have described the way the standard works to restrict > > encodings in the discussion at the top of this email. Briefly, CIF > > software developers develop programs that conform with the CIF2 > > standard. If that standard says 'UTF8', they program for UTF8. If > > you want to work in ISO-8859-15 etc, you have to do extra work. > > > > Working in favour of such extra work would be a compelling use case, > > which I have yet to see (I note that the 'UTF8 only' standard posted > > to ccp4-bb and pdb-l produced no comments). My strong perception is > > that any need for other encodings is overwhelmed by the utility of > > settling on a single encoding, but that perception would need > > confirmation from a proper survey of non-ASCII users. > > > > So, no we can't stop people saving CIF-like files in other encodings, > > but we can discourage it by creating significant barriers in terms of > > software availability. Just like we can't stop CIF1 users saving > > files in JIS X 0208, but that doesn't happen at any level that causes > > problems (if it happens at all, which I doubt). > > > > (John) Furthermore, no formulation of CIF is inherently reliable or > > unreliable, because reliability (in this sense) is a characteristic of > > data transfer, not of data themselves. Scheme B targets the > > activities that require reliability assurance, and disregards those > > that don't. In a practical sense, this isn't any different from > > scheme A, because it is only when the encoding is potentially > > uncertain -- to wit, in the context of data transfer -- that either > > scheme need be applied (see also below). I suppose I would be willing > > to make scheme B a general requirement of the CIF format, but I don't > > see any advantage there over the current formulation. The actual > > behavior of people and the practical requirements on CIF software > > would not appreciably change. > > > > (James now) I would suggest that Scheme B does not target all > > activites requiring reliability assurance, as it does not address the > > situation where people use a mix of CIF-aware software and text tools > > in a single encoding environment. > > > > The real, significant change that occurs when you accept Scheme B is > > that CIF files can now be in any encoding and undecorated. > > Programmers are then likely to provide programs that might or might > > not work with various encodings, and users feel justifiably that their > > undecorated files should be supported. The software barrier that was > > encouraging UTF8-only has been removed, and the problem of mismatched > > encodings that we have been trying to avoid becomes that much more > > likely to occur. Scheme B has very few teeth to enforce decoration at > > the point of transfer, as the software at either end is now probably > > happy with an undecorated file. Requiring decoration as a condition > > of being a CIF2 file means that software will tend to reject > > undecorated files, thereby limiting the damage that would be caused by > > open slather encoding. > > > > (James) Furthermore, given the ease with which files can be > > transferred between users (email attachment, saved in shared, > > network-mounted directory, drag and drop onto USB stick etc.) it is > > unlikely that Scheme B or anything involving extra effort would be > > applied unless the recipient demanded it. > > > > (John) For hand-created or hand-edited CIFs, I agree. CIFs > > manipulated via a CIF2-compliant editor could be relied upon to > > conform to scheme B, however, provided that is standardized. But the > > same applies to scheme A, given that few operating environments > > default to UTF-8 for text. > > > > (James now) That is my goal: that any CIF that passes through a > > CIF-compliant program must be decorated before input and output (if > > not UTF8). What hand-edited, hand-created CIFs actually have in the > > way of decoration doesn't bother me much, as these are very rare and > > of no use unless they can be read into a CIF program, at which point > > they should be rejected until properly decorated. And I reiterate, > > the process of applying decoration can be done interactively to > > minimise the chances of incorrect assignment of encoding. > > > > (James) And given how many times that file might have changed hands > > across borders and operating systems within a single group > > collaboration, there would only be a qualified guarantee that the > > character to binary mapping has not been mangled en route, making any > > scheme applied subsequently rather pointless. > > > > (John) That also does not distinguish among the alternatives before > > us. I appreciate the desire for an absolute guarantee of reliability, > > but none is available. Qualified guarantees are the best we can > > achieve (and that's a technical assessment, not an aphorism). > > > > (James now) Oh, but I believe it does distinguish, because if CIF > > software reads only UTF8 (because that is what the standard says), > > then the file will tend to be in UTF8 at all points in time, with > > reduced possibilities for encoding errors. I think it highly likely > > that each group that handles a CIF will at some stage run it through > > CIF-aware software, which means encoding mistakes are likely to be > > caught much earlier. > > > > (James) We would thus go from a situation where we had a single, > > reliable and sometimes slightly inconvenient encoding (UTF8), to one > > where a CIF processor should be prepared for any given CIF file to be > > one of a wide range of encodings which need to be guessed. > > > > (John) Under scheme A or the present draft text, we have "a single, > > reliable [...] encoding" only in the sense that the standard > > *specifies* that that encoding be used. So far, however, I see little > > will to produce or use processors that are restricted to UTF-8, and I > > have every expectation that authors will continue to produce CIFs in > > various encodings regardless of the standard's ultimate stance. Yes, > > it might be nice if everyone and every system converged on UTF-8 for > > text encoding, but CIF2 cannot force that to happen, not even among > > crystallographers. > > > > (James now) You see little will to do this: but as far as I can tell, > > there is even less will not to do it. Authors will not "continue" to > > produce CIFs in various encodings, as they haven't started doing so > > yet. As I've said above, CIF2 can certainly, if not force, encourage > > UTF8 adoption. What's more, non-ASCII characters are only gradually > > going to find their way into CIF2 files, as the dictionaries and large > > scale adopters of CIF2 (the IUCr) start to allow non-ASCII characters > > in names, and the users gradually adapt to this new way of doing > > things. I have no sense that CIF users will feel a strong desire to > > use non UTF8 schemes, when they have been happy in an ASCII-only > > regime up until now. But I'm curious: on what basis are you saying > > that there is little will to use processors that are restricted to > > UTF8? > > > > (John) In practice, then, we really have a situation where the > > practical / useful CIF2 processor must be prepared to handle a variety > > of encodings (details dependent on system requirements), which may > > need to be guessed, with no standard mechanism for helping the > > processor make that determination or for allowing it to check its > > guess. Scheme B improves that situation by standardizing a general > > reliability assurance mechanism, which otherwise would be missing. In > > view of the practical situation, I see no down side at all. A CIF > > processor working with scheme B is *more* able, not less. > > > > (James) I would much prefer a scheme which did not compromise > > reliability in such a significant way. > > > > (John) There is no such compromise, because in practice, we're not > > starting from a reliable position. > > > > (James now) I think your statement that our current position is not > > reliable arises out of a perception that users are likely to use a > > variety of encodings regardless of what the standard says. I think > > this danger is way overstated, but I'd like to see you expand on why > > you think there is such a likelihood of multiple encodings being used > > > > (James) My previous (somewhat clunky) attempts to adjust Scheme B were > > directed at trying to force any file with the CIF2.0 magic number to > > be either decorated or UTF-8, meaning that software has a reasonably > > high confidence in file integrity. > > > > An alternative way of thinking about this is that CIF files also act > > as the mechanism of information transfer between software programs. > > [... W]hen a separate program is asked to input that CIF, the > > information has been transferred, even if that software is running on > > the same computer. > > > > (John) So in that sense, one could argue that Scheme B already applies > > to all CIFs, its assertion to the contrary notwithstanding. Honestly, > > though, I don't think debating semantic details of terms such as "data > > transfer" is useful because in practice, and independent of scheme A, > > B, or Z, it is incumbent on the CIF receiver (/ reader / retriever) to > > choose what form of reliability assurance to accept or demand, if any. > > > > (James now) I was only debating semantic details in order to expose > > the fact that data transfer occurs between programs, not just between > > systems, and that therefore Scheme B should apply within a single > > system, so therefore, all CIF2 files should be decorated. As for who > > should be demanding reliability assurance, the receiver may not be in > > a position to demand some level of reliability if the file creator is > > not in direct contact. Again, we can build this reliability into the > > standard and save the extra negotiation or loss of information that is > > otherwise involved. > > > > (James) Now, moving on to the detailed contours of Scheme B and > > addressing the particular points that John and I have been discussing. > > My original criticisms are the ones preceded by numerals. > > > > [(James now) I've deleted those points where we have reached > > agreement. Those points are: > > (1) Restrict encodings to those for which the first line of a CIF file > > provides unambiguous encoding for ASCII codepoints > > (2) Put the hash value on the first line] > > > > (James a long time ago) (4) Assumption that all recipients will be > > able to handle all encodings > > > > (John) There is no such assumption. Rather, there is an > > acknowledgement that some systems may be unable to handle some CIFs. > > That is already the case with CIF1, and it is not completely resolved > > by standardizing on UTF-8 (i.e. scheme A). > > > > (James) There is no such thing as 'optional' for an information > > interchange standard. A file that conforms to the standard must be > > readable by parsers written according to the standard. If reading a > > standard-conformant file might fail or (worse) the file might be > > misinterpreted, information cannot always reliably be exchanged using > > this standard, so that optional behaviour needs to be either > > discarded, or made mandatory. There is thus no point in including > > optional behaviour in the standard. So: if the standard allows files > > to be written in encoding XYZ, then all readers should be able to read > > files written in encoding XYZ. I view the CIF1 stance of allowing any > > encoding as a mistake, but a benign one, as in the case of CIF1 ASCII > > was so entrenched that it was the defacto standard for the characters > > appearing in CIF1 files. In short, we have to specify a limited set > > of acceptable encodings. > > > > (John) As Herb astutely observed, those assertions reflect a > > fundamental source of our disagreement. I think we can all agree that > > a standard that permits conforming software to misinterpret conforming > > data is undesirable. > > > > Surely we can also agree that an information interchange standard does > > not serve its purpose if it does not support information being > > successfully interchanged. It does not follow, however, that the > > artifacts by which any two parties realize an information interchange > > must be interpretable by all other conceivable parties, nor does it > > follow that that would be a supremely advantageous characteristic if > > it were achievable. It also does not follow that recognizable failure > > of any particular attempt at interchange must at all costs be avoided, > > or that a data interchange standard must take no account of its usage > > context. > > > > (James now) This is where we must make a policy decision: is a CIF2 > > file to be a universally understandable file? I agree that excluding > > optional behaviour is not an absolute requirement, but I also consider > > that optional behaviour should not be introduced without solid > > justification, given the real cost in interoperability and portability > > of the standard. You refer to two parties who wish to exchange > > information: those parties are always free to agree on private > > enhancements to the CIF2 standard (or to create their very own > > protocol), if they are in contact. I do not see why this use case > > need concern us here. Herbert can say to John 'I'm emailing you a > > CIF2 file but encoded in UTF16'. John has his extremely excellent > > software which handles UTF16 and these two parties are happy. > > > > John mentions a 'usage context'. If the standard is to include some > > account of usage context, then that context has to be specified > > sufficiently for a CIF2 programmer to understand what aspects of that > > context to consider, and not left open to misinterpretation. Perhaps > > you could enlarge on what particular context should be included? > > > > (John) Optional and alternative behaviors are not fundamentally > > incompatible with a data interchange standard, as XML and HTML > > demonstrate. Or consider the extreme variability of CIF text content: > > whether a particular CIF is suitable for a particular purpose depends > > intimately on exactly which data are present in it, and even to some > > extent on which data names are used to present them, even though ALL > > are optional as far as the format is concerned. If I say 'This CIF is > > unsuitable for my present purpose because it does not contain > > _symmetry_space_group_name_H-M', that does not mean the CIF standard > > is broken. Yet, it is not qualitatively different for me to say 'This > > CIF is unsuitable because it is encoded in CCSID 500' despite CIF2 > > (hypothetically) permitting arbitrary encodings. > > > > (James now) The difference is quantitative and qualitative. > > Quantitative, because the number of CIF2 files that are unsuitable > > because of missing tags will always be less than or equal to the > > number of CIF2 files that are unsuitable because of a missing tag and > > unknown encoding. Thus, by reducing ambiguity at the lower levels of > > the standard, we improve the utility at the higher levels. The > > difference is also qualitative, in that (a) if we have tags with > > non-ASCII characters, they could conceivably be confused with other > > tags if the encoding is not correct and so you will have a situation > > where a file that is not suitable actually appears suitable, because > > the desired tag appears. Likewise, the value taken by a tag may be > > wrong. > > > > (James a long time ago) (iii) restrict possible encodings to > > internationally recognised ones with well-specified Unicode mappings. > > This addresses point (4) > > > > (John) I don't see the need for this, and to some extent I think it > > could be harmful. For example, if Herb sees a use for a scheme of > > this sort in conjunction with imgCIF (unknown at this point whether he > > does), then he might want to be able to specify an encoding specific > > to imgCIF, such as one that provides for multiple text segments, each > > with its own character encoding. To the extent that imgCIF is an > > international standard, perhaps that could still satisfy the > > restriction, but I don't think that was the intended meaning of > > "internationally recognised". > > > > (James now) Indeed. My intent with this specification was to ensure > > that third parties would be able to recover the encoding. If imgCIF is > > going to cause us to make such an open-ended specification, it is > > probably a sign that imgCIF needs to be addressed separately. For > > example, should we think about redefining it as a container format, > > with a CIF header and UTF16 body (but still part of the > > "Crystallographic Information Framework")? > > > > (John) As for "well-specified Unicode mappings", I think maybe I'm > > missing something. CIF text is already limited to Unicode characters, > > and any encoding that can serve for a particular piece of CIF text > > must map at least the characters actually present in the text. What > > encodings or scenarios would be excluded, then, by that aspect of this > > suggestion? > > > > (James) My intention was to make sure that not only the particular > > user who created the file knew this mapping, but that the mapping was > > publically available. Certainly only Unicode encodable code points > > will appear, but the recipient needs to be able to recover the mapping > > from the file bytes to Unicode without relying on e.g. files that will > > be supplied on request by someone whose email address no longer works. > > > > (John) This issue is relevant only to the parties among whom a > > particular CIF is exchanged. The standard would not particularly > > assist those parties by restricting the permitted encodings, because > > they can safely ignore such restrictions if they mutually agree to do > > so (whether statically or dynamically), and they (specifically, the > > CIF originator) must anyway comply with them if no such agreement is > > implicit or can be reached. > > > > (James) Again, any two parties in current contact can send each other > > files in whatever format and encoding they wish. My concern is that > > CIF software writers are not drawn into supporting obscure or adhoc > > encodings. > > > > (John) B) Scheme B does not use quite the same language as scheme A > > with respect to detectable encodings. As a result, it supports > > (without tagging or hashing) not just UTF-8, but also all UTF-16 and > > UTF-32 variants. This is intentional. > > > > (James) I am concerned that the vast majority of users based in > > English speaking countries (and many non English speaking countries) > > will be quite annoyed if they have to deal with UTF-16/32 CIF2 files > > that are no longer accessible to the simple ASCII-based tools and > > software that they are used to. Because of this, allowing undecorated > > UTF16/32 would be far more disruptive than forcing people to use UTF8 > > only. Thus my stipulation on maintaining compatibility with ASCII for > > undecorated files. > > > > (John) Supporting UTF-16/32 without tagging or hashing is not a key > > provision of scheme B, and I could live without it, but I don't think > > that would significantly change the likelihood of a user unexpectedly > > encountering undecorated UTF-16/32 CIFs. It would change only whether > > such files were technically CIF-conformant, which doesn't much matter > > to the user on the spot. In any case, it is not the lack of > > decoration that is the basic problem here. > > > > (James now) Yes, that is true. A decorated UTF16 file is just as > > unreadable as an undecorated one in ASCII tools. However, per my > > comments at the start of this email, I think an extra bit of hoop > > jumping for non UTF8 encoded files has the desirable property of > > encouraging UTF8 use. > > > > (John) C) Scheme B is not aimed at ensuring that every conceivable > > receiver be able to interpret every scheme-B-compliant CIF. Instead, > > it provides receivers the ability to *judge* whether they can > > interpret particular CIFs, and afterwards to *verify* that they have > > done so correctly. Ensuring that receivers can interpret CIFs is thus > > a responsibility of the sender / archive maintainer, possibly in > > cooperation with the receiver / retriever. > > > > (James) As I've said before, I don't see the paradigm of live > > negotiation between senders and receivers as very useful, as it fails > > to account for CIFs being passed between different software (via > > reading/writing to a file system), or CIFs where the creator is no > > longer around, or technically unsophisticated senders where, for > > example, the software has produced an undecorated CIF in some native > > encoding and the sender has absolutely no idea why the receiver (if > > they even have contact with the receiver!) can't read the file > > properly. I prefer to see the standard that we set as a substitute > > for live negotiation, so leaving things up to the users is in that > > sense an abrogation of our responsibility. > > > > (John) That scenario will undoubtedly occur occasionally regardless of > > the outcome of this discussion. If it is our responsibility to avoid > > it at all costs then we are doomed to fail in that regard. Software > > *will* under some circumstances produce undecorated, non-UTF-8 "CIFs" > > because that is sometimes convenient, efficient, and appropriate for > > the program's purpose. > > > > I think, though, those comments reflect a bit of a misconception. The > > overall purpose of CIF supporting multiple encodings would be to allow > > specific CIFs to be better adapted for specific purposes. Such > > purposes include, but are not limited to > > > > () exchanging data with general-purpose program(s) on the same system > > () exchanging data with crystallography program(s) on the same system > > () supporting performance or storage objectives of specific programs or > > systems > > () efficiently supporting problem or data domains in which Latin text > > is a minority of the content (e.g. imgCIF) > > () storing data in a personal archive > > () exchanging data with known third parties > > () publishing data to a general audience > > > > *Few, if any, of those uses would be likely to involve live > > negotiation.* That's why I assigned primary responsibility for > > selecting encodings to the entity providing the CIF. I probably > > should not even have mentioned cooperation of the receiver; I did so > > more because it is conceivable than because it is likely. > > > > (James now) OK, fair enough. My issues then with the paradigm of > > provider-based encoding selection is that it only works where the > > provider is capable of making this choice, and it puts that > > responsibility on all providers, large and small. Of course, I am > > keen to construct a CIF ecology where providers always automatically > > choose UTF8 as the "safe" choice. > > > > (John) Under any scheme I can imagine, some CIFs will not be well > > suited to some purposes. I want to avoid the situation that *no* > > conformant CIF can be well suited to some reasonable purposes. I am > > willing to forgo the result that *every* conformant CIF is suited to > > certain other, also reasonable purposes. > > > > (James now) Fair enough. However, so far the only reasonable purpose > > that I can see for which a UTF8 file would not be suitable is > > exchanging data with general-purpose programs that do not cope with > > UTF8, and it may well be that with a bit of research the list of such > > programs would turn out to be rather short. > > > > > > > > -- > > T +61 (02) 9717 9907 > > F +61 (02) 9717 3145 > > M +61 (04) 0249 4148 > > _______________________________________________ > > cif2-encoding mailing list > > cif2-encoding@iucr.org > > http://scripts.iucr.org/mailman/listinfo/cif2-encoding > > > > > >
_______________________________________________ cif2-encoding mailing list cif2-encoding@iucr.org http://scripts.iucr.org/mailman/listinfo/cif2-encoding
Reply to: [list | sender only]
- References:
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- Re: [Cif2-encoding] [ddlm-group] options/text vsbinary/end-of-line . .. .. .. .. .. .. .. .. .. .. .. .. .. . (Bollinger, John C)
- Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. . (James Hester)
- Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. . (SIMON WESTRIP)
- Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. . (SIMON WESTRIP)
- Prev by Date: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- Next by Date: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- Prev by thread: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- Next by thread: Re: [Cif2-encoding] [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. .. .. .. .. .. .. .. .. .
- Index(es):