[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
- Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Sun, 27 Jun 2010 09:23:29 -0400 (EDT)
- In-Reply-To: <200973.94532.qm@web87008.mail.ird.yahoo.com>
- References: <AANLkTilyJE2mCxprlBYaSkysu1OBjY7otWrXDWm3oOT9@mail.gmail.com><alpine.BSF.2.00.1006231033360.56372@epsilon.pair.com><alpine.BSF.2.00.1006231406010.30894@epsilon.pair.com><a06240802c848414681ef@192.168.2.104><381469.52475.qm@web87004.mail.ird.yahoo.com><a06240801c84949b70cb7@192.168.27.100><AANLkTilZj2UEffRwmvCrgnVbxrGwmsoqb9S7tw31MWSo@mail.gmail.com><984921.99613.qm@web87011.mail.ird.yahoo.com><AANLkTimLmnpS-HHP9en-zwUDeVKtbHSUJa36tUCOlQtL@mail.gmail.com><826180.50656.qm@web87010.mail.ird.yahoo.com><a0624 0803c84a8e4d89fc@[192.168.2.104]><563298.52532.qm@web87005.mail.ird.yahoo.com><8F77913624F7524AACD2A92EAF3BFA54166122952C@SJMEMXMBS11.stjude.sjcrh.local> <520427.68014.qm@web87001.mail.ird.yahoo.com><a06240800c84ac1b696bf@[192.168.2.104]><614241.93385.qm@web87016.mail.ird.yahoo.com><alpine.BSF.2.00.1006251827270.70846@epsilon.pair.com> <663654.63888.qm@web87001.mail.ird.yahoo.com><200973.94532.qm@web87008.mail.ird.yahoo.com>
Dear Simon, The trick is to put some reasonable selection of accented characters into the check string. Most of the non-accented roman characters are common to a very wide range of encodings. It happens that accented lower case o's work fairly well for detecting a lot of the most common encodings. If you also explicitly state the intended encoding, the chances of a misidentification are probably as low as you are going to get. Nothing is perfect, but the combination of an encoding field and a transmission check is, I think, well worth considering. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Sun, 27 Jun 2010, SIMON WESTRIP wrote: > Dear Herbert > > I've been looking at heuristic approaches to detecting text encodings as its > something I need for other work. > However, my tests are not too encouraging. > So I revisited your transmission check idea. > > If the file signature contains a string of text as you describe, encoded in > whatever the underlying encoding is, > then software that does not know the encoding could decode this string using > a variety of encodings until it found a match? > Equally, a human reading the CIF would immediately know whether the text > editor has recognized the encoding correctly? > I suppose key to this is finding a string that is encoding-dependent as far > as possible, i.e. minimizing the number of matches? > > Have I interpretted this correctly? It seems quite neat to me. > > Cheers > > Simon > > > > > ____________________________________________________________________________ > From: SIMON WESTRIP <simonwestrip@btinternet.com> > To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> > Sent: Saturday, 26 June, 2010 0:16:39 > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. > .. .. .. . > > > We don't need everybody to be doing the same thing. We need everybody > to be able to send everybody else their information in a form in which > other people can correctly undertstand what they have been sent. > > I totally agree with this - which is why I have advocated that the standard > should be totally unambigous and > at the same time be as accessible as possible. I beleive that I have > expressed before an acceptance that we > may have to adopt a certain degree of heuristic encoding determination in > order to accommodate user practice; > I do not shy away from this. I am, however, seeking a way to avoid, if > possible, the amiguity that code-page based > encodings present. > > Cheers > > Simon > > PS. My comments about about my 'garden shed' were meant to be 'light > hearted' - its been a long day! > Please forgive me if this was inappropriate. I ought also to stress that in > all this I do not speak for the IUCr, officially. > > > > ____________________________________________________________________________ > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> > To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org> > Sent: Friday, 25 June, 2010 23:38:10 > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .. > .. .. .. . > > Dear Simon, > > The IUCr never has and probably never will make use of every feature in > core CIF and mmCIF, much less what is allowed in all of CIF as it now > exists. That's reasonable. You are publishing journals. Other people > are maintaining archives, or producing experimental data sets, or > procssing data to refine structures. > > To now limit CIF to just the features that are needed for the publication > process and can be managed by one bloke is neither in the interests of the > IUCr nor of the broader user community. What we should be pursuing is a > reaonable degree of commonality and interchange capability, not an > inadequate lowest common denominator. That would result in a standard that > is simply ignored, and "standard" with many non-interoperable dialects -- > which seems to be where we are headed. I would find that to be regretable. > > We don't need everybody to be doing the same thing. We need everybody > to be able to send everybody else their information in a form in which > other people can correctly undertstand what they have been sent. > > Regards, > Herbert > > ===================================================== > Herbert J. Bernstein, Professor of Computer Science > Dowling College, Kramer Science Center, KSC 121 > Idle Hour Blvd, Oakdale, NY, 11769 > > +1-631-244-3035 > yaya@dowling.edu > ===================================================== > > On Fri, 25 Jun 2010, SIMON WESTRIP wrote: > > > I really do not think the IUCr will adopt a policy of rejecting something > > that conforms to the CIF2 standard, > > whatever it may turn out to be. Indeed, I suspect they will be amongst the > > first to support the standard with > > as many tools as they can provide. I know very little about the imgCIF > > issue, but I suspect this is more of a case of > > not having the resources rather than any other motive. Brian has often > asked > > me whether I could make use of imgCIF > > in publCIF, but I'm afraid this has not been a priority. In an ideal world > I > > would like to produce tools that were all thing to all people, > > and a CIF that encapsulates everything needed for publication as well as > > everything needed to review or validate the > > structures described therein, but I'm one bloke working at home in his > > garden shed :-) > > > > Cheers > > > > Simon > > > >___________________________________________________________________________ > _ > > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com> > > To: Group finalising DDLm and associated dictionaries > <ddlm-group@iucr.org> > > Sent: Friday, 25 June, 2010 21:30:10 > > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. > .. > > .. .. .. . > > > > Then the solution is obvious -- have a CIF standard with some optional > > feature that others of us will use, and have the IUCr instruct authors > > depositing manuscripts that it does not are about and will not use or > > check those features, just as the IUCr does not accept illustrations > > as imgcif binaries. > > > > > > At 1:23 PM -0700 6/25/10, SIMON WESTRIP wrote: > > > >I don't understand. How is it worse to provide authors an > > >opportunity to specify the encoding they have used, even though they > > >may specify wrongly, than it is to deny them an opportunity to > > >specify the encoding at all? > > > > > >I dont think it is worse to provide them with an opportunity to > > >specify their encoding - I just dont think they should need to. > > > > > >>How is it a worse or more impactful mistake for an author to > > >>include an incorrect encoding tag than it is for them to use an > > >>encoding different from some small set that you are prepared to > > >>accept? > > > > > >I am not saying that it is a worse or more impactful mistake - > > >rather, if these signatures are to be part of the standard, then I > > >can foresee errors being raised by an incorrect flag even when the > > >rest of the CIF is encoded according to the specification. In my > > >experience, authors already find CIF slightly annoying in that they > > >have to adhere to seemingly pedantic rules (e.g. 'Monoclinic' should > > >be 'monoclinic' because the dictionary enumeration is case > > >sensitive, or <0.001 is not a number type). Requiring manually > > >edited encoding signatures which will have to be checked is of no > > >real help to anyone (no more than a 'hint')? Again, I feal that we > > >have to respect that in the world of CIF, users have been required > > >to edit raw CIF - this is rarely the case with xml, where end users > > >are rightly unaware of the encoding they are using as they > > >invariably work with tools that shield them from the raw xml. In the > > >short/medium term at least, I do not see this situation changing. > > > > > >The reason I am prepared to accept 'some small set' is that I would > > >like that set to be unambiguously identifiable, so that authors do > > >not have to worry about such things, and in the hope that > > >non-CIF-aware software might still do a good job of decoding the > > >text, without employing heuristics, thereby minimizing the impact on > > >curent practise of specifying an encoding at all in the new spec. > > > > > >You might note that I often refer to CIF users as authors - this is > > >my experience I'm afraid. It would be nice if the IUCr could exert > > >as much first-hand control over CIF content as say the PDB, whose > > >online data collection tools are used to populate mmCIFs, and whose > > >users seem quite happy for them to do that. So I stress, my views on > > >this are only based on experience with CIFs submitted to IUCr > > >journals by authors. > > > > > >>>We're also further restricting the number of non-CIF-aware > > >>>programs that can be used to read the text. > > > > > >>Can you expand on that? I don't follow you. > > > > > >I was referring to the practice of editing CIFs with any available > > >text editor - however I concede that having an encoding flag makes > > >no difference to non-CIF-aware programs - they will simply save the > > >CIF in whatever is their default encoding if that is how they work. > > > > > >Cheers > > > > > >Simon > > > > > > > > > > > >From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG> > > >To: Group finalising DDLm and associated dictionaries > <ddlm-group@iucr.org> > > >Sent: Friday, 25 June, 2010 19:59:56 > > >Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. > > >.. .. .. .. .. .. . > > > > > >On Friday, June 25, 2010 12:41 PM, SIMON WESTRIP wrote: > > >>Its using a field for specifying the encoding that worries me. > > >>Who is to make such a declaration in the CIF - an author who may be > > >>blissfully unaware of the encoding they're using? > > >>Or an author who is preparing a new CIF by editing an old one, > > >>again unaware that the text editor they are using is about to save > > > >the CIF in some other encoding? At least with UTF BOM's we have a > > >fighting chance - I'd rather only accept these. > > > > > >I don't understand. How is it worse to provide authors an > > >opportunity to specify the encoding they have used, even though they > > >may specify wrongly, than it is to deny them an opportunity to > > >specify the encoding at all? > > > > > >How is it a worse or more impactful mistake for an author to include > > >an incorrect encoding tag than it is for them to use an encoding > > >different from some small set that you are prepared to accept? > > > > > >>We're also further restricting the number of non-CIF-aware programs > > >>that can be used to read the text. > > > > > >Can you expand on that? I don't follow you. > > > > > >>You've also mentioned that we should learn from HTML - just because > > >>HTML has an encoding declaration does not mean it is correct, > > >>which is why browsers seem to apply there own heuristics to > > >>determine the encoding. > > > > > >I see no way to write the specification that can eliminate all > > >possibility of encoding-related errors. None. All we can do is > > >choose which errors are possible. In so doing, there are a lot of > > >competing factors consider, such as likelihood of various errors to > > >be committed, coverage and robustness of the resulting spec, implied > > >responsibilities of various parties, user convenience, and cultural > > >sensitivity. I think when James's summary is ready it will help us > > >sort through all that. > > > > > > > > >Regards, > > > > > >John > > >-- > > >John C. Bollinger, Ph.D. > > >Department of Structural Biology > > >St. Jude Children's Research Hospital > > > > > >Email Disclaimer: > > ><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer > > >_______________________________________________ > > >ddlm-group mailing list > > ><mailto:ddlm-group@iucr.org>ddlm-group@iucr.org > >><http://scripts.iucr.org/mailman/listinfo/ddlm-group>http://scripts.iucr.o > > > rg/mailman/listinfo/ddlm-group > > > > > > > > >_______________________________________________ > > >ddlm-group mailing list > > >ddlm-group@iucr.org > > >http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > > -- > > ===================================================== > > Herbert J. Bernstein, Professor of Computer Science > > Dowling College, Kramer Science Center, KSC 121 > > Idle Hour Blvd, Oakdale, NY, 11769 > > > > +1-631-244-3035 > > yaya@dowling.edu > > ===================================================== > > _______________________________________________ > > ddlm-group mailing list > > ddlm-group@iucr.org > > http://scripts.iucr.org/mailman/listinfo/ddlm-group > > > > > >
_______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group
Reply to: [list | sender only]
- Follow-Ups:
- References:
- Re: [ddlm-group] options/text vs binary/end-of-line. .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (James Hester)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .... .. .. .. . (Bollinger, John C)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (Herbert J. Bernstein)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. . (SIMON WESTRIP)
- Prev by Date: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- Next by Date: [ddlm-group] Summary of encoding discussion so far
- Prev by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- Next by thread: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .
- Index(es):