[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .

Dear Simon,

   The trick is to put some reasonable selection of accented characters
into the check string.  Most of the non-accented roman characters are
common to a very wide range of encodings.  It happens that accented
lower case o's work fairly well for detecting a lot of the most
common encodings.  If you also explicitly state the intended
encoding, the chances of a misidentification are probably
as low as you are going to get.  Nothing is perfect, but the
combination of an encoding field and a transmission check
is, I think, well worth considering.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Sun, 27 Jun 2010, SIMON WESTRIP wrote:

> Dear Herbert
> 
> I've been looking at heuristic approaches to detecting text encodings as its
> something I need for other work.
> However, my tests are not too encouraging.
> So I revisited your transmission check idea.
> 
> If the file signature contains a string of text as you describe, encoded in
> whatever the underlying encoding is,
> then software that does not know the encoding could decode this string using
> a variety of encodings until it found a match?
> Equally, a human reading the CIF would immediately know whether the text
> editor has recognized the encoding correctly?
> I suppose key to this is finding a string that is encoding-dependent as far
> as possible, i.e. minimizing the number of matches?
> 
> Have I interpretted this correctly? It seems quite neat to me.
> 
> Cheers
> 
> Simon
> 
> 
> 
> 
> ____________________________________________________________________________
> From: SIMON WESTRIP <simonwestrip@btinternet.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Saturday, 26 June, 2010 0:16:39
> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ..
> .. .. .. .
> 
> >  We don't need everybody to be doing the same thing.  We need everybody
> to be able to send everybody else their information in a form in which
> other people can correctly undertstand what they have been sent.
> 
> I totally agree with this - which is why I have advocated that the standard
> should be totally unambigous and
> at the same time be as accessible as possible. I beleive that I have
> expressed before an acceptance that we
> may have to adopt a certain degree of heuristic encoding determination in
> order to accommodate user practice;
> I do not shy away from this. I am, however, seeking a way to avoid, if
> possible, the amiguity that code-page based
> encodings present.
> 
> Cheers
> 
> Simon
> 
> PS. My comments about about my 'garden shed' were meant to be 'light
> hearted' - its been a long day!
> Please forgive me if this was inappropriate. I ought also to stress that in
> all this I do not speak for the IUCr, officially.
> 
> 
> 
> ____________________________________________________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Friday, 25 June, 2010 23:38:10
> Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. ..
> .. .. .. .
> 
> Dear Simon,
> 
>   The IUCr never has and probably never will make use of every feature in
> core CIF and mmCIF, much less what is allowed in all of CIF as it now
> exists.  That's reasonable.  You are publishing journals.  Other people
> are maintaining archives, or producing experimental data sets, or
> procssing data to refine structures.
> 
>   To now limit CIF to just the features that are needed for the publication
> process and can be managed by one bloke is neither in the interests of the
> IUCr nor of the broader user community.  What we should be pursuing is a
> reaonable degree of commonality and interchange capability, not an
> inadequate lowest common denominator.  That would result in a standard that
> is simply ignored, and "standard" with many non-interoperable dialects --
> which seems to be where we are headed.  I would find that to be regretable.
> 
>   We don't need everybody to be doing the same thing.  We need everybody
> to be able to send everybody else their information in a form in which
> other people can correctly undertstand what they have been sent.
> 
>   Regards,
>     Herbert
> 
> =====================================================
> Herbert J. Bernstein, Professor of Computer Science
>   Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
> 
>                 +1-631-244-3035
>                 yaya@dowling.edu
> =====================================================
> 
> On Fri, 25 Jun 2010, SIMON WESTRIP wrote:
> 
> > I really do not think the IUCr will adopt a policy of rejecting something
> > that conforms to the CIF2 standard,
> > whatever it may turn out to be. Indeed, I suspect they will be amongst the
> > first to support the standard with
> > as many tools as they can provide. I know very little about the imgCIF
> > issue, but I suspect this is more of a case of
> > not having the resources rather than any other motive. Brian has often
> asked
> > me whether I could make use of imgCIF
> > in publCIF, but I'm afraid this has not been a priority. In an ideal world
> I
> > would like to produce tools that were all thing to all people,
> > and a CIF that encapsulates everything needed for publication as well as
> > everything needed to review or validate the
> > structures described therein, but I'm one bloke working at home in his
> > garden shed :-)
> >
> > Cheers
> >
> > Simon
> >
> >___________________________________________________________________________
> _
> > From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> > To: Group finalising DDLm and associated dictionaries
> <ddlm-group@iucr.org>
> > Sent: Friday, 25 June, 2010 21:30:10
> > Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. ..
> ..
> > .. .. .. .
> >
> > Then the solution is obvious -- have a CIF standard with some optional
> > feature that others of us will use, and have the IUCr instruct authors
> > depositing manuscripts that it does not are about and will not use or
> > check those features, just as the IUCr does not accept illustrations
> > as imgcif binaries.
> >
> >
> > At 1:23 PM -0700 6/25/10, SIMON WESTRIP wrote:
> > >  >I don't understand.  How is it worse to provide authors an
> > >opportunity to specify the encoding they have used, even though they
> > >may specify wrongly, than it is to deny them an opportunity to
> > >specify the encoding at all?
> > >
> > >I dont think it is worse to provide them with an opportunity to
> > >specify their encoding - I just dont think they should need to.
> > >
> > >>How is it a worse or more impactful mistake for an author to
> > >>include an incorrect encoding tag than it is for them to use an
> > >>encoding different from some small set that you are prepared to
> > >>accept?
> > >
> > >I am not saying that it is a worse or more impactful mistake -
> > >rather, if these signatures are to be part of the standard, then I
> > >can foresee errors being raised by an incorrect flag even when the
> > >rest of the CIF is encoded according to the specification. In my
> > >experience, authors already find CIF slightly annoying in that they
> > >have to adhere to seemingly pedantic rules (e.g. 'Monoclinic' should
> > >be 'monoclinic' because the dictionary enumeration is case
> > >sensitive, or <0.001 is not a number type). Requiring manually
> > >edited encoding signatures which will have to be checked is of no
> > >real help to anyone (no more than a 'hint')? Again, I feal that we
> > >have to respect that in the world of CIF, users have been required
> > >to edit raw CIF - this is rarely the case with xml, where end users
> > >are rightly unaware of the encoding they are using as they
> > >invariably work with tools that shield them from the raw xml. In the
> > >short/medium term at least, I do not see this situation changing.
> > >
> > >The reason I am prepared to accept 'some small set' is that I would
> > >like that set to be unambiguously identifiable, so that authors do
> > >not have to worry about such things, and in the hope that
> > >non-CIF-aware software might still do a good job of decoding the
> > >text, without employing heuristics, thereby minimizing the impact on
> > >curent practise of specifying an encoding at all in the new spec.
> > >
> > >You might note that I often refer to CIF users as authors - this is
> > >my experience I'm afraid. It would be nice if the IUCr could exert
> > >as much first-hand control over CIF content as say the PDB, whose
> > >online data collection tools are used to populate mmCIFs, and whose
> > >users seem quite happy for them to do that. So I stress, my views on
> > >this are only based on experience with CIFs submitted to IUCr
> > >journals by authors.
> > >
> > >>>We're also further restricting the number of non-CIF-aware
> > >>>programs that can be used to read the text.
> > >
> > >>Can you expand on that?  I don't follow you.
> > >
> > >I was referring to the practice of editing CIFs with any available
> > >text editor - however I concede that having an encoding flag makes
> > >no difference to non-CIF-aware programs - they will simply save the
> > >CIF in whatever is their default encoding if that is how they work.
> > >
> > >Cheers
> > >
> > >Simon
> > >
> > >
> > >
> > >From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
> > >To: Group finalising DDLm and associated dictionaries
> <ddlm-group@iucr.org>
> > >Sent: Friday, 25 June, 2010 19:59:56
> > >Subject: Re: [ddlm-group] options/text vs binary/end-of-line. .. ..
> > >.. .. .. .. .. .. .
> > >
> > >On Friday, June 25, 2010 12:41 PM, SIMON WESTRIP wrote:
> > >>Its using a field for specifying the encoding that worries me.
> > >>Who is to make such a declaration in the CIF - an author who may be
> > >>blissfully unaware of the encoding they're using?
> > >>Or an author who is preparing a new CIF by editing an old one,
> > >>again unaware that the text editor they are using is about to save
> > >  >the CIF in some other encoding? At least with UTF BOM's we have a
> > >fighting chance - I'd rather only accept these.
> > >
> > >I don't understand.  How is it worse to provide authors an
> > >opportunity to specify the encoding they have used, even though they
> > >may specify wrongly, than it is to deny them an opportunity to
> > >specify the encoding at all?
> > >
> > >How is it a worse or more impactful mistake for an author to include
> > >an incorrect encoding tag than it is for them to use an encoding
> > >different from some small set that you are prepared to accept?
> > >
> > >>We're also further restricting the number of non-CIF-aware programs
> > >>that can be used to read the text.
> > >
> > >Can you expand on that?  I don't follow you.
> > >
> > >>You've also mentioned that we should learn from HTML - just because
> > >>HTML has an encoding declaration does not mean it is correct,
> > >>which is why browsers seem to apply there own heuristics to
> > >>determine the encoding.
> > >
> > >I see no way to write the specification that can eliminate all
> > >possibility of encoding-related errors.  None.  All we can do is
> > >choose which errors are possible.  In so doing, there are a lot of
> > >competing factors consider, such as likelihood of various errors to
> > >be committed, coverage and robustness of the resulting spec, implied
> > >responsibilities of various parties, user convenience, and cultural
> > >sensitivity.  I think when James's summary is ready it will help us
> > >sort through all that.
> > >
> > >
> > >Regards,
> > >
> > >John
> > >--
> > >John C. Bollinger, Ph.D.
> > >Department of Structural Biology
> > >St. Jude Children's Research Hospital
> > >
> > >Email Disclaimer:
> > ><http://www.stjude.org/emaildisclaimer>www.stjude.org/emaildisclaimer
> > >_______________________________________________
> > >ddlm-group mailing list
> > ><mailto:ddlm-group@iucr.org>ddlm-group@iucr.org
> >><http://scripts.iucr.org/mailman/listinfo/ddlm-group>http://scripts.iucr.o
> 
> > rg/mailman/listinfo/ddlm-group
> > >
> > >
> > >_______________________________________________
> > >ddlm-group mailing list
> > >ddlm-group@iucr.org
> > >http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> >
> > --
> > =====================================================
> >   Herbert J. Bernstein, Professor of Computer Science
> >     Dowling College, Kramer Science Center, KSC 121
> >         Idle Hour Blvd, Oakdale, NY, 11769
> >
> >                   +1-631-244-3035
> >                   yaya@dowling.edu
> > =====================================================
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://scripts.iucr.org/mailman/listinfo/ddlm-group
> >
> >
> 
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]