[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [ddlm-group] Semantics of whitespace-delimited values

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values
From: "Bollinger, John C" <John.Bollinger@STJUDE.ORG>
Date: Fri, 24 Jul 2015 17:51:59 +0000
Accept-Language: en-US
authentication-results: iucr.org; dkim=none (message not signed) header.d=none;
cy1pr0401mb0939: X-MS-Exchange-Organization-RulesExecuted
In-Reply-To: <55B269EB.4010402@rcsb.org>
References: <CY1PR0401MB0937C0A5DC4841C60F1704A3E0810@CY1PR0401MB0937.namprd04.prod.outlook.com><1904091894.1956488.1437749394149.JavaMail.yahoo@mail.yahoo.com><CY1PR0401MB093788F4E5EEDAB8B8DCE306E0810@CY1PR0401MB0937.namprd04.prod.outlook.com><55B269EB.4010402@rcsb.org>
Hi John,

More or less the whole point is that the interpretation of values is a matter of convention, with much, but not all, of that being bound up in dictionaries.  With respect to your current understanding,

(1) This is perfectly fine.  Indeed, if you do not otherwise distinguish between quoted and unquoted values, then it is necessary to do as you describe to recognize the special meanings for dot and question mark that apply when they are presented unquoted.  Doing it this way builds certain aspects of general and local semantic convention into your parser implementation, but as long as it serves your purposes there's nothing wrong with that.  Furthermore, it is harmless to apply the same treatment to both recognized and unrecognized data items, supposing that you're not going to do anything with the unrecognized ones anyway.

(2) You are not obligated to accept any specific format for numbers, and in particular, you are not obligated to accept a parenthesized standard uncertainty with your numbers.  This presents a problem only if it is unclear to other data producers and consumers what numeric format is required for each data item you handle.  In practice, this means the expected format for numbers should be part of each data definition, which I believe is the case for items defined in DDL2 dictionaries.

(3) You are not obligated to assign data types on input, based on quoting or otherwise.  I have argued that you *can* use quoting status as one of the factors determining data typing, based largely on the existence of published (in ITVG) CIF 1.1 convention saying you should do so, but convention does not carry the force of a rule.  If you want to use quoting status as a criterion for data typing, but you do not want to assign types on input, then you have the alternative of carrying the quoting status with the value string, as a separate attribute of the overall value.  All parsers I know *do* use quoting status on input for typing the special null values, however, as indeed you said yours does.  As with numbers specifically, any general distinction based on quoting status must be documented in a data definition, or otherwise generally accepted, for it to be viable.  Your particular software has a sufficiently central role that what it accepts and does not accept can be taken as *defining* what is "generally accepted", at least for the mmCIF and PDBx data items that it handles.


Cheers,

John

> -----Original Message-----
> From: ddlm-group [mailto:ddlm-group-bounces@iucr.org] On Behalf Of
> john.westbrook@rcsb.org
> Sent: Friday, July 24, 2015 11:38 AM
> To: ddlm-group@iucr.org
> Subject: Re: [ddlm-group] Semantics of whitespace-delimited values
> 
> 
> This discussion is becoming rather confusing to me.   So with respect to my
> understanding and current
> PDBx/mmCIF usage:  (1) a dot (.)  and a question mark (?) are treated as
> special tokens in the grammar for handing null and missing values, (2)
> parenthetically appended uncertainties are not used in PDBx (e.g.
> xxx.xx(xx)) rather uncertainties are represented in separated data items, (3)
> On input we do interpret or otherwise apply any data typing based on the
> quoting.  All data type interpretation at input is done via the data dictionary.
> 
> Regards,
> 
> John
> 
> On 7/24/15 12:17 PM, Bollinger, John C wrote:
> > That the BNF in ITVG contains a production for numeric values does not
> itself make CIF 1.1 numeric typing any less of a convention.
> > I half wish it did.  The production just provides the details of the
> > conventional format; it does not say that the same value must be
> > interpreted differently if you put it inside quotation marks.  It
> > doesn’t even really say that values matching that production must be
> > interpreted as numbers.  It only says, by way of its use in other
> productions, that a whitespace-delimited string having that form is a well-
> formed data value.  In that sense it is superfluous, and it makes the grammar
> ambiguous (reflecting the genuine CIF ambiguity that this is all about).
> >
> > It therefore does not follow that a strict CIF 2.0 parser, or even a strict CIF
> 1.1 parser, is required to reject “6.6666(6)” as a
> > number.  That is conventional, but the usefulness of that convention is in
> doubt.   It does not even follow that in _atom_site_label
> > 12, the “12” must be interpreted as a number.
> >
> > In any case, I think I’ve failed to make my point, because you (Simon)
> > “agree” with the opposite of my position.  A distinction between
> > quoted and unquoted data values is a de facto inherent aspect of CIF
> > 1.1 format, else the conventions for ., ?, and numeric format could
> > not work.  The conventions rely on that distinction to ascribe specific
> different significance to certain data values when they are presented
> unquoted than when they are presented quoted, but the underlying, basic
> distinction cannot be matter of convention.  However broadly that inherent
> distinction applies, I do not want to change it in CIF 2.0.
> >
> > It is plausible -- and to me appealing -- to interpret the distinction
> > between quoted and unquoted data values to apply to all CIF
> > 1.1 values, with the general practice being to ignore it except in
> > certain cases.  That seems to fit more naturally with having the
> > conventions on top, too.  The main alternative is to say that it
> > applies only to the value data forms explicitly called out in the
> > conventions.  The only advantage I see in the narrow interpretation is to
> avoid embarrassment arising from at this point discovering a new, or at least
> forgotten, aspect of CIF.  That’s offset for me by the inherent contrivance
> inherent in interpreting the feature to exactly fit the convention it supports.
> >
> > John
> >
> > *From:*ddlm-group [mailto:ddlm-group-bounces@iucr.org] *On Behalf Of
> > *SIMON WESTRIP
> > *Sent:* Friday, July 24, 2015 9:50 AM
> > *To:* Group finalising DDLm and associated dictionaries
> > *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
> >
> > Upon reflection, I think I've been misleading in referring to numeric
> > typing in CIF1.1 as a 'common semantic feature' (in IntTabs G the BNF
> > contains a <numeric> rule). So unless this is to be changed, the
> specification of CIF2 does not necessarily need to change in this respect
> (though for completeness and to avoid confusion I still think it ought to be
> included in the EBNF if possible).
> >
> > So I suppose *strictly* a CIF2 parser, like CIF1.1, should not
> > recognize e.g. _cell_length_a "6.6666(6)" as a number, and should
> > always treat e.g. _atom_site_label 12 as a number? Or perhaps the
> interpretation is that _atom_site_label 12 could be a number or a string, but
> _atom_site_label '12' is definitely a string and cannot be a number?
> >
> > I agree "We could (*should*) say that CIF 2.0 removes such distinctions,
> except for (.) and (?)"
> >
> >
> >
> > Cheers
> >
> >
> >
> > Simon
> >
> > ----------------------------------------------------------------------
> > --------------------------------------------------------------
> >
> > *From:*"Bollinger, John C" <John.Bollinger@STJUDE.ORG
> > <mailto:John.Bollinger@STJUDE.ORG>>
> > *To:* Group finalising DDLm and associated dictionaries
> > <ddlm-group@iucr.org <mailto:ddlm-group@iucr.org>>
> > *Sent:* Friday, 24 July 2015, 15:13
> > *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
> >
> > I have no objection to APIs being tolerant of numeric data in this way.
> >
> > I see no particular advantage to adding special productions to the
> > EBNF to match unquoted (.) and (?), however, as the current EBNF
> > already will match them as values just fine.  EBNF is good for
> > describing the language grammar and syntax, but it is not the right
> > mechanism for expressing semantics.  Putting these explicitly in the EBNF is
> in any case a secondary issue.  The primary one is whether the CIF format
> distinguishes quoted values from unquoted ones generally, or whether it
> distinguishes only certain special cases of quoting vs. non-quoting.
> >
> > We seem to agree that there’s no good way around the (.) and (?) cases,
> but I suspect we differ about the more general question.
> > Even though it is desirable for CIF parsers to be flexible about
> > numbers, the published CIF conventions say to distinguish between
> > quoted and unquoted values with respect to numeric interpretation.
> > That’s only a convention, so CIF software is not obligated to follow
> > it, but following it must be **allowed**, at least in CIF 1.1.  That
> > means that there indeed must be an actionable distinction between
> quoted and unquoted numbers.  With that being the case, I am inclined to
> make it a general distinction, even if it is one that is typically ignored, rather
> than a special case.  Moreover, I am inclined to say that there always has
> been such a distinction; it just hasn’t been used outside the numeric- and
> null-value cases.
> >
> > We could say that CIF 2.0 removes such distinctions, except for (.)
> > and (?), but I don’t really see the need for another break from CIF 1.1.
> >
> > John
> >
> > *From:*ddlm-group [mailto:ddlm-group-bounces@iucr.org] *On Behalf Of
> > *SIMON WESTRIP
> > *Sent:* Friday, July 24, 2015 7:27 AM
> > *To:* Group finalising DDLm and associated dictionaries
> > *Subject:* [ddlm-group] Semantics of whitespace-delimited values
> >
> > I agree - indeed the 'less-tolerant' applications in my little survey
> > used third-party APIs to read the CIF so it is probably the case that the API
> is being 'intolerant' rather than the application.
> >
> > I'd also be happy to see period and question mark in the EBNF -
> > afterall these tokens when white-space delimited should never be
> > interpreted as the string values "?" or ".", so fundamentally they could be
> regarded as structural tokens, regardless of any other semantics associated
> with them.
> >
> > In general, with respect to the imminent introduction of CIF2, the
> > assumption will likely be that the common semantic features of
> > CIF1 will apply to CIF2, which is fair enough. However, personally I
> > would prefer that such semantics were kept distinctly separate from
> > the specification. For example, the CIF1 line-folding 'semantics' are
> > now part of the specification that a parser is expected to be aware
> > of, while the CIF1 character encoding semantics are purely conventions
> > that may be useful in certain domains (the parser really doesn't need to be
> aware of them). So if the CIF1 convention with respect to period and
> question marks is generally thought to be an inherent part of CIF, then it
> would be better placed in the specification that parsers should be aware of
> (i.e. parsers should be aware of these 'null' tokens and not simply return a "."
> or "?" with no context)?
> >
> > The same applies to numbers - if a parser is expected to unequivocally
> > identify numbers from the syntax, then this is no longer a 'common
> > semantic feature'. I believe that a parser needs minimally to identify a
> 'value', which can be interpreted further down the line.
> >
> > So perhaps the question boils down to: which (if any) of the semantic
> features of CIF1 would we expect a CIF2 parser to be aware of?
> >
> > Cheers
> >
> > Simon
> >
> > ----------------------------------------------------------------------
> > --------------------------------------------------------------
> >
> > *From:*James Hester <jamesrhester@gmail.com
> > <mailto:jamesrhester@gmail.com>>
> > *To:* SIMON WESTRIP <simonwestrip@btinternet.com
> > <mailto:simonwestrip@btinternet.com>>; Group finalising DDLm and
> > associated dictionaries <ddlm-group@iucr.org
> > <mailto:ddlm-group@iucr.org>>
> > *Sent:* Friday, 24 July 2015, 6:37
> > *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
> >
> > Let me take up one of Simon's comments:
> >
> > "...we could suggest that CIF applications started to turn to the
> > dictionary rather than syntax to determine the exact nature of a data
> item".
> >
> > What is not perhaps appreciated is that the application programmer
> > accessing the CIF file searching for a numeric value for a particular
> > dataname has already consulted the dictionary when writing the program
> > (with the minor exception of e.g. pretty-printers as noted before).
> > Consulting the very same dictionary at runtime is pointless as these
> > meanings are never supposed to change. So I would suggest to Simon
> > that there is no problem nudging *application* programmers to accept
> > the dictionary definitions as they already have done so in order to write
> correct calculations, the only problem will be nudging CIF APIs to do their
> best to return a number if asked (and your survey would suggest that the
> majority already do).  My position is very strongly pro application
> programmer - if they are asking my API for a number, I am not going to
> second-guess them unless they have asked me to by providing a dictionary
> as well.
> >
> > I'd be happy to see period and question mark added to the EBNF as
> primitive productions, this is a simple change.
> >
> > On 9 July 2015 at 07:46, SIMON WESTRIP <simonwestrip@btinternet.com
> <mailto:simonwestrip@btinternet.com>> wrote:
> >
> > Dear all
> >
> > I extended the mini survey of current applications a little and looked  closer
> at some of the less-liberal parsers:
> >
> > one of the applications I've looked at did not complain when I
> > included some non-ASCII text in the CIF, while another complained
> > about a data value constructed as '''z''' (valid CIF1), and one
> > displayed rather quirky behaviour with regard to semicolon-delimited
> > strings, rejecting the contained 'value' if it had a leading newline
> > but not if it had a leading space -
> >
> > all of these particular applications complained about delimited numbers to
> the extent that the application stopped processing.
> >
> > Based on this (albeit limited) survey of some current well- known CIF
> > applications, regarding the introduction of CIF2 it would definitely
> > be prudent to indicate that 'yes indeed' the interpretation of CIF1.1
> > wihitespace-delimited values retains significance in CIF2. However, if
> > possible I think it would be in the interests of flexibility and unambiguity if
> somehow we could suggest that CIF applications started to turn to the
> dictionary rather than syntax to determine the exact nature of a data item
> (afterall, as I see it, that's one very strong  motivation for developing CIF2 in
> the first place - and is the preferred approach in CIF1 too).
> > Thankfully (from my point of view) this isn't even an issue for the
> > majority of applications I have looked at (they simply grab the data
> > however they've found it and make use of it if they can, or they
> > carefully validate the data against the dictionary). So what is
> > challenging me is how we achieve this - i.e. nudging some applications to be
> a little more flexible (which in my experience is what many  'users' would
> most appreciate)., while at the same time maintaining the convention that
> numbers especially are still presented in an undelimited (uncluttered)
> fashion. I've no convincing answer to this yet.
> >
> > Regarding the ? and . 'null' values I hesitate to suggest that we
> > could take these out of the issue altogether by making them CIF key
> > tokens - I hesitate because I suspect that some applications simply
> > ignore their significance anyway and incorrect usage rarely presents a
> > real problem (and also I haven't yet attempted to see if its actually
> > possible to define them in this way in any case:-)
> > - so its probably unnecessary and may even seem like a new
> > complication to applications that were not particularly aware of, or
> bothered by, the significance of these tokens in the first place.
> >
> > Cheers
> >
> > Simon
> >
> > ----------------------------------------------------------------------
> > --------------------------------------------------------------
> >
> > *From:*"Bollinger, John C" <John.Bollinger@STJUDE.ORG
> > <mailto:John.Bollinger@STJUDE.ORG>>
> > *To:* Group finalising DDLm and associated dictionaries
> > <ddlm-group@iucr.org <mailto:ddlm-group@iucr.org>>; SIMON WESTRIP
> > <simonwestrip@btinternet.com <mailto:simonwestrip@btinternet.com>>
> > *Sent:* Wednesday, 8 July 2015, 16:21
> > *Subject:* RE: [ddlm-group] Semantics of whitespace-delimited values
> >
> > Thanks Simon, James, and John.  I am uncertain how many distinct
> > parsers are represented by the reports so far, but it seems there must be
> at least five.
> >
> > I think we agree that parsers and applications should be permitted, if
> > not required, to distinguish between the values . and '.', and between
> > the values ? and '?'.  We also seem to agree that it is not useful to
> > insist that parsers or applications refuse to interpret quoted values as
> numbers, although some CIF 1.1 parsers in fact do so at their own discretion,
> and some warn instead of refusing (with even that relying on taking the
> position that numbers are not supposed to be quoted).
> >
> > Not being enamored of special cases, and not wanting CIF 2.0 to rule
> > out CIF interpretation practice that is accepted and common in CIF 1.1
> > applications, I find myself favoring CIF 2.0 taking the position that
> > in general, it is permitted but not required to interpret any string
> > value differently when it is presented in whitespace-delimited form
> > than when it is presented in any of the other forms.  The conventions
> > for the special values . and ? could then be taken to apply on a
> > domain-wide basis, whereas the convention for the form of numbers could
> be taken to apply on a more selective basis (per-DDL, per-dictionary, or even
> per-definition). An implication of this position, however, is that whether or
> not a value is presented whitespace-delimited becomes a property of that
> value that a fully general CIF 2.0 parser must make available to its clients.
> Moreover, for better or for worse, future dictionaries could establish
> additional items or data types whose values are required to be presented
> unquoted.
> >
> > We could perhaps characterize that more specifically, maybe by saying
> > that the exact form of values presented in any of the quoted forms is
> > significant, or something along those lines, whereas values presented in
> whitespace-delimited form may afford equivalent alternative expressions.
> That doesn’t exactly fit . and ?, but perhaps some similar statement could do
> so better.
> >
> > John
> >
> > --
> >
> > John C. Bollinger, Ph.D.
> >
> > Computing and X-Ray Scientist
> >
> > Department of Structural Biology
> >
> > St. Jude Children's Research Hospital
> >
> > John.Bollinger@StJude.org <mailto:John.Bollinger@StJude.org>
> >
> > (901) 595-3166[office]
> >
> > www.stjude.org <http://www.stjude.org/>
> >
> >
> >
> > --
> >
> > T +61 (02) 9717 9907
> > F +61 (02) 9717 3145
> > M +61 (04) 0249 4148
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org <mailto:ddlm-group@iucr.org>
> > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> >
> >
> >
> > _______________________________________________
> > ddlm-group mailing list
> > ddlm-group@iucr.org
> > http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
> >
> 
> --
> John Westbrook, Ph.D.
> RCSB, Protein Data Bank
> Rutgers, The State University of New Jersey Department of Chemistry and
> Chemical Biology
> 174 Frelinghuysen Rd
> Piscataway, NJ 08854-8087
> e-mail: john.westbrook@rcsb.org
> Ph: (848) 445-4290 Fax: (732) 445-4320
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
_______________________________________________ddlm-group mailing listddlm-group@iucr.orghttp://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
Reply to: [list | sender only]

References:

Re: [ddlm-group] Semantics of whitespace-delimited values (Bollinger, John C)

Re: [ddlm-group] Semantics of whitespace-delimited values (SIMON WESTRIP)

Re: [ddlm-group] Semantics of whitespace-delimited values (Bollinger, John C)

Re: [ddlm-group] Semantics of whitespace-delimited values (john.westbrook@rcsb.org)

Prev by Date: Re: [ddlm-group] Semantics of whitespace-delimited values

Next by Date: Re: [ddlm-group] Semantics of whitespace-delimited values

Prev by thread: Re: [ddlm-group] Semantics of whitespace-delimited values

Next by thread: Re: [ddlm-group] Semantics of whitespace-delimited values

Index(es):

Date

Thread
Discussion List Archives

Re: [ddlm-group] Semantics of whitespace-delimited values