[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Semantics of whitespace-delimited values

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values
From: "Bollinger, John C" <[email protected]>
Date: Mon, 3 Aug 2015 22:17:29 +0000
Accept-Language: en-US
authentication-results: spf=none (sender IP is )[email protected];
In-Reply-To: <[email protected]>
References: <BY2PR0401MB09361817139D43D10235DBB3E0770@BY2PR0401MB0936.namprd04.prod.outlook.com><[email protected]>

An extensive body of practice (substantially all DDL2 applications) indicates that 2.2.7.1 (10) is not to be taken at face value. I think it's best to interpret it as pertaining to the base 'numb' type, on which DDL2 does not rely. It is in any event about a relationship between syntax and semantics, thus neither wholly one nor wholly the other. It brings out the fact that it isn’t always possible to define a clean separation between the two. CIF’s attempts to do so are not wholly successful.

I stand by my earlier words, but I should clarify. The key point that I ineffectively attempted to make in my answer to question (2) was that you cannot rely on syntax *alone* to establish data typing, not even in a DDL1 application. That’s why the distinction does not occur at the syntactic level, the presence of the <Numeric> production in the grammar notwithstanding. A value matching that production does not necessarily have numeric type, and values not matching that production can and are interpreted as numbers.

Syntax does contribute to data type determination in some cases, however, per 2.2.7.1 (within its scope) and 2.2.7.4.8. If that were not so then we would not be having this conversation. My interpretation of the specs and general practice renders 2.2.7.1 in particular relevant only to DDL1 applications, including checkCIF. As I wrote before, a DDL1 application that accepts quoted values as numeric thereby exercises an error recovery mechanism. That is useful and appropriate for some, but not for others.

I will not quibble too much over "can be sensitive" vs. "is sensitive". Both are true. One calls attention to the fact that some values will be interpreted identically, regardless of quoting, whereas the other emphasizes that some will be interpreted differently when presented quoted than when presented unquoted. I prefer the former because to me it more clearly describes the situation that the interpretations of some values are sensitive to quoting, but those of others aren’t.

Data typing is ultimately a determination to be made at the semantic level, as informed by the syntax and details of the value string. This has always been the case for CIF 1.1, and it does not change for files presented in CIF 2.0 syntax. Any needed clarification should be made for CIF 1.1.

Cheers,

John

From: ddlm-group [mailto:[email protected]] On Behalf Of SIMON WESTRIP
Sent: Monday, August 03, 2015 1:24 PM
To: Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

I am confused - paragraph 10 of section 2..2.7 "Formal specification of the Crystallographic Information File" clearly states that numeric data are not to be quoted:

"10. A simple data value (i.e. one which does not
contain white space or begin with a special character string) may
optionally be delimited by any of the same set of delimiting character
strings, *except* for data values that are to be interpreted as
numbers."

This is part of section 2.2.7.1 Syntax - not 'semantics'.

So I believe the CIF 2.0 specification manuscript is wrong in its description of CIF1.1, i.e. "Interpretation of a CIF data value 'can be' sensitive to whether it is presented in whitespace-delimited form" - should be *is* sensitive to whether it is presented in whitespace-delimited form.

For better of for worse, the requirement that numbers are always delimited by whitespace is being recognized and enforced (checkCIF - perceived as the de facto standard in CIF validation - issues level A alerts when a numeric value is delimited by anything other than whitespace).

If this is purely a semantic feature, then for CIF2 we should make it clear - i.e. remove all confusion and any necessity to issue alerts or worse reject a CIF that is otherwise syntactically correct, or even semantically valid (e.g. with respect to dictionary definitions)?

Cheers

Simon

From: "Bollinger, John C" <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Monday, 3 August 2015, 17:49
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

1) The fodder for this debate is for the most part the CIF 1.1 specifications and the behavior of existing CIF 1.1 applications. My argument therefore applies first and foremost to CIF 1.1. On the other hand, we have not approved or before now even really discussed any change in this area for CIF 2, and I do not think there is any interest in making a change, so whatever is decided for CIF 1.1 probably will apply to CIF 2.0 as well. Nevertheless, the last sentence of section 2 of the CIF 2.0 specification manuscript already does address this: "Interpretation of a CIF data value can be sensitive to whether it is presented in whitespace-delimited form." What we are discussing now is essentially whether that should be narrowed (or, perhaps, emphasized). More on that below.

2) We do not agree that the distinction between number and character types is at the syntactic level in CIF 1.1 or above. All text in ITVG 2.2 that discusses data typing falls under one of two "common semantic features" headings, 2.2.5 and 2.2.7.4. The grammar does contain a production for numbers, but that can and should be interpreted as merely supporting the "common semantic features" prose. Section 2.2.7.4.7.1 (17) cannot otherwise be effective, and the consensus interpretation is that it *is* effective. CIF 2.0 isn’t changing anything here, but I do hope that omission of a numeric format from the CIF 2.0 EBNF will help reduce confusion in this regard.

3) None of the existing DDLs has a formal mechanism for saying anything about data value quoting. My understanding is that DDL2 applications are expected to altogether ignore the character/number typing described in ITVG 2.2, however, and indeed to ignore distinctions between quoted and unquoted values except for '.' and '?'. Effectively, that’s implicit in using a DDL2 dictionary, at least for the items defined by such a dictionary, and it is inconsistent with syntactic data typing. Practice is less uniform among DDL1 applications, as the behavior survey showed. I’d be inclined to say that ITVG 2.2 data typing is implicit at least in the core dictionary, and probably in all DDL1 dictionaries. In practice, many -- but not all -- DDL1 applications are more accepting, which I think is best characterized as their choice of error-handling behavior. Overall, then, I’m saying that the applicable DDL implicitly directs this detail of value interpretation.

This is certainly an area that would benefit from clarification. I see only two viable interpretations of ITVG 2.2 as it pertains to the significance of data value quotation:

A) syntactically, CIF 1.1 data values have exactly two properties, which can be characterized as a string of characters and a flag indicating whether the string is quoted; or

B) syntactically, CIF 1.1 data values have several properties (which can be counted various ways), describing at least: a string of characters, whether that string matches one or more numeric formats, whether the string consists of a single decimal point, whether the string consists of a single question mark, and whether that string is presented quoted. The last can be suppressed for values that don’t have any of the given specific forms (the advantage to doing so being unclear to me), but it is nevertheless implicit in some other data value forms, such as those matching data names.

Interpretation (B) or something close to it is required if one wants the result that quoting can affect value interpretation only for numbers and the special null values, yet doesn’t necessarily do so for numbers. It is the narrowest interpretation that can work. Interpretation (A) permits more distinctions to be drawn between quoted and unquoted values, but does not inherently require such a distinction to be made in any particular case. It is the simplest specialization of STAR grammar that can work. I prefer (A).

Any way around, all parsers I know about apply at least a bit of semantic interpretation at parse time. In particular, although behavior varies with respect to parse-time numeric interpretation, I don’t know any CIF parser that fails to provide special handling for the null values. This is not a problem. CIF parsers need only serve the purposes for which they are intended; they are not required to be completely general. Even those that are designed to serve general purposes may still take into account the shape of the problem space, as reflected by the current DDLs and practices.

John

John C. Bollinger, Ph.D.

Computing and X-Ray Scientist

Department of Structural Biology

St. Jude Children's Research Hospital

[email protected]

(901) 595-3166 [office]

www.stjude.org

From: ddlm-group [mailto:[email protected]] On Behalf Of SIMON WESTRIP
Sent: Friday, July 31, 2015 1:27 PM
To: Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

John asserts that:

"I nevertheless maintain, however, that these and other parts of the specs absolutely make it *permissible* to ascribe different significance to quoted and unquoted values, especially with regard to data typing. Since dictionaries are the primary vehicle for type specification, that means dictionaries are empowered to distinguish values and their types based on quoting status, whether or not they take any advantage of that power."

Three questions:

1) If this is to hold true for CIF2, should we make this clear in the specification of CIF2 - importantly that a robust parser must return the quoting status (context) or a data value to the calling application?

2) If the distinction between number and character types at the 'syntactical' level (which I think we agree is part of the 'non-semantic' CIF1.1 specification) is not to be enforced in CIF2, should this be made clear in the specification of CIF2?

NB. Section 2.2.7.4.7.1 (17) is part of the "common semantic features", which on the whole are largely inappropriate for CIF2, and indeed are not entirely ideal for CIF1 (though have proved 'workable'). To clarify: I currently happen to be working on a new CIF1.1. application and find myself wishing it could use CIF2 - i.e. some of the text-field semantics of CIF1.1 can be a bit of a nightmare for developers and users alike, e.g.

(i) representation of a caret is not really possible according to the semantics - i.e. \^ is a combining circumflex, while an unescaped caret ^ signifies the start of a superscript;

(ii) <i> indicates italic, so to represent the same sequence according to the semantics you would have to use \\langle i>, which isn't strictly the same thing;

(iiii) a tilde similarly has to be interpreted according to context...;

(iv) the description of line-folding semantics uses C:\foldername\filename as an example - is this C:<phi>oldername<phi>ilename? I'm being pedantic I know - especially as these 'semantics' are no longer relevant to CIF2....

Main point is the distinction between 'semantics' and 'specification' - semantics as described for CIF1 are very much optional - but the base character/number type distinction is outside the 'semantic' description so is not optional in CIF1.

3) Lastly, is it possible to specify in a dictionary how a value should be quoted in the syntax?

Cheers

Simon

From: "Bollinger, John C" <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Friday, 31 July 2015, 17:16
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

The authoritative specification of CIF 1.1 is section 2.2 of ITVG, and ITVG 2.2.7.1.4 contains pretty much the same text. In practice, however, the quoted provisions yield little practical benefit, and they have been rendered largely moot by other specifications and by widespread practice. In particular, the specifications elsewhere assign primary responsibility for data type determination to dictionaries. Section 2.2.7.4.7.1 (17) is explicit about dictionaries taking precedence over syntactic type determination:

"Where the attributes of a data value are not available in a dictionary listing, it may be assumed that a character string interpretable as a number should be taken to represent an item of type 'numb'. However, an explicit dictionary declaration of type will override such an assumption. "

One could argue that paragraphs 10 and 13 of section 2.2.7.1.4 require that quoted character strings not be considered "interpretable as a number", and indeed some parsers take that approach, but the prevailing interpretation is different. Few parsers used by prevalent programs interpret the specs as forbidding them to interpret quoted strings as numbers, ITVG 2.2.7.1.4 (10-13) notwithstanding.

Furthermore, these days I hold a rather broad interpretation of what a "dictionary" is. Certainly there are the DDL[12m]-format external dictionary files, but also, the many programs that hard-code usage of specific data items for specific purposes thereby build in a collection of data item definitions that should be considered a local dictionary. From that perspective, ITVG 2.2.7.1.4 is applicable only to values that are not subjected to any specific interpretation. Such values can be used at all only by programs that perform generic CIF manipulations such as pretty printing, and those must take care to preserve values’ forms exactly because they do not know which aspects of those forms -- including quoting status -- are significant.

I nevertheless maintain, however, that these and other parts of the specs absolutely make it *permissible* to ascribe different significance to quoted and unquoted values, especially with regard to data typing. Since dictionaries are the primary vehicle for type specification, that means dictionaries are empowered to distinguish values and their types based on quoting status, whether or not they take any advantage of that power.

John

From: ddlm-group [mailto:[email protected]] On Behalf Of SIMON WESTRIP
Sent: Saturday, July 25, 2015 5:47 AM
To: Group finalising DDLm and associated dictionaries
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

I think we need to clarify what CIF1.1 specifies in this respect. From what I can see, the distinction between character types and numeric types is part of the base specification.

From http://www.iucr.org/resources/cif/spec/version1.1/cifsyntax:

"10. A simple data value (i.e. one which does not
contain white space or begin with a special character string) may
optionally be delimited by any of the same set of delimiting character
strings, *except* for data values that are to be interpreted as
numbers."
...
"12. The complete syntactic description of a numeric data value is
included in Appendix A (paragraph 57) under
the production (i.e. rule for constructing a part of the
language) <Numeric>.
13. The base CIF specification distinguishes between character and
numeric values (see paragraph 15 of the document Common semantic
features). Particular CIF applications may make more
finely-grained distinctions within these types. The paragraphs immediately
above have the corollary that a data value such as 12 that

appears within a CIF may be quoted (e.g. '12') *if, and only if*

it is to be interpreted and stored in computer memory as a

character string and not a numeric value. For example '12' might
legitimately appear as a label for an atomic site, where another
alphabetic or alphanumeric string such as 'C12' is also
acceptable; but it may *not* legitimately be used to represent an
integer quantity twelve."

So although reference is made to 'common semantic features', it still seems clear to me that
the base specification forbids the quoting of numbers (paragraph 10).

For CIF2, I would like to drop this as a rule and make it clear that it is *only a convention*.
Given current practice, I do not see that this is an unreasonable request.

Otherwise, I'm afraid I am guilty of occasionally storing e.g. '12' as a number 'in computer memory' when I should only ever treat it as a string, and I regularly handle e.g. 12 as a string ;-)

Cheers

Simon

From: SIMON WESTRIP <[email protected]>
To: SIMON WESTRIP <[email protected]>; Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Friday, 24 July 2015, 20:11
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

In an attempt to reinforce my argument, does anyone know of a programming language that will read 6.6666(6) as a float with an associated s.u.? As far as I can tell, such 'numbers' will require additional parsing to transform them into a type that is useable by a program? Furthermore, if I want to round-up such a value (as journals do for presentation), I may well need to treat it as a 'string' in order to avoid floating-point errors (depending on the programming languages and associated libraries I'm using).

Cheers

Simon

From: SIMON WESTRIP <[email protected]>
To: Group finalising DDLm and associated dictionaries <[email protected]>
Sent: Friday, 24 July 2015, 19:24
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

I agree with John Westbrook - this is how I would like it to be with CIF2:

"All data type interpretation at input is done via the data dictionary"

CIF is a data container not a programming language. The data container does not have to be aware of data types - it simply needs to unambiguously tag (in the case of CIF) and store the data so that it may be retrieved reliably according to the specified syntax.

There is really no need these days for the data container (in this case its text-based to all intents and purposes) to concern itself with the type of value that it stores, as long as it is stored and retrievable unambiguously. The power of CIF lies in its dictionaries, and especially for CIF2 in the methods that they describe.

When it comes down to it, CIF2 is not backwards compatible with CIF1 (the use of UTF8 in particular is a potential pitfall), so why not take the opportunity to address some of those issues that still today propagate a negative attitude to CIF (all too often I've been on the receiving end of criticism regarding CIF and its 'idiosyncrasies', whether justified or not; and only recently I've read some concerns regarding the PDB's preference for CIF rather than PDB format in the macromolecular community - to my mind unjustified but nonetheless the perception is not favourable in some quarters).

At the risk of sounding as if I'm 'ranting', I am worried that some of the CIF2 changes will not be received that favourably from developers (for example, the list and table structures are so close to JSON structures that an obvious question is why not just follow JSON - for what its worth my answer would be that lists would simply require the same parsing as loops, and much the same for tables, except for the colon... :-)

I think we have an opportunity with CIF2 to simplify the basic data storage format, while adding power through the enhanced dictionary format. We have a general precedent in STAR2 for these changes, so why not break a little more from CIF1.

Cheers

Simon

From: "[email protected]" <[email protected]>
To: [email protected]
Sent: Friday, 24 July 2015, 17:38
Subject: Re: [ddlm-group] Semantics of whitespace-delimited values

This discussion is becoming rather confusing to me. So with respect to my understanding and current
PDBx/mmCIF usage: (1) a dot (.) and a question mark (?) are treated as special tokens in the grammar for handing
null and missing values, (2) parenthetically appended uncertainties are not used in PDBx (e.g. xxx.xx(xx))
rather uncertainties are represented in separated data items, (3) On input we do interpret or otherwise apply
any data typing based on the quoting. All data type interpretation at input is done via the data dictionary.

Regards,

John

On 7/24/15 12:17 PM, Bollinger, John C wrote:
> That the BNF in ITVG contains a production for numeric values does not itself make CIF 1.1 numeric typing any less of a convention.
> I half wish it did. The production just provides the details of the conventional format; it does not say that the same value must
> be interpreted differently if you put it inside quotation marks. It doesn’t even really say that values matching that production
> must be interpreted as numbers. It only says, by way of its use in other productions, that a whitespace-delimited string having
> that form is a well-formed data value. In that sense it is superfluous, and it makes the grammar ambiguous (reflecting the genuine
> CIF ambiguity that this is all about).
>
> It therefore does not follow that a strict CIF 2.0 parser, or even a strict CIF 1.1 parser, is required to reject “6.6666(6)” as a
> number. That is conventional, but the usefulness of that convention is in doubt. It does not even follow that in _atom_site_label
> 12, the “12” must be interpreted as a number.
>
> In any case, I think I’ve failed to make my point, because you (Simon) “agree” with the opposite of my position. A distinction
> between quoted and unquoted data values is a de facto inherent aspect of CIF 1.1 format, else the conventions for ., ?, and numeric
> format could not work. The conventions rely on that distinction to ascribe specific different significance to certain data values
> when they are presented unquoted than when they are presented quoted, but the underlying, basic distinction cannot be matter of
> convention. However broadly that inherent distinction applies, I do not want to change it in CIF 2.0.
>
> It is plausible -- and to me appealing -- to interpret the distinction between quoted and unquoted data values to apply to all CIF
> 1.1 values, with the general practice being to ignore it except in certain cases. That seems to fit more naturally with having the
> conventions on top, too. The main alternative is to say that it applies only to the value data forms explicitly called out in the
> conventions. The only advantage I see in the narrow interpretation is to avoid embarrassment arising from at this point discovering
> a new, or at least forgotten, aspect of CIF. That’s offset for me by the inherent contrivance inherent in interpreting the feature
> to exactly fit the convention it supports.
>
> John
>
> *From:*ddlm-group [mailto:[email protected]] *On Behalf Of *SIMON WESTRIP
> *Sent:* Friday, July 24, 2015 9:50 AM
> *To:* Group finalising DDLm and associated dictionaries
> *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
>
> Upon reflection, I think I've been misleading in referring to numeric typing in CIF1.1 as a 'common semantic feature' (in IntTabs G
> the BNF contains a <numeric> rule). So unless this is to be changed, the specification of CIF2 does not necessarily need to change
> in this respect (though for completeness and to avoid confusion I still think it ought to be included in the EBNF if possible).
>
> So I suppose *strictly* a CIF2 parser, like CIF1.1, should not recognize e.g. _cell_length_a "6.6666(6)" as a number, and should
> always treat e.g. _atom_site_label 12 as a number? Or perhaps the interpretation is that _atom_site_label 12 could be a number or a
> string, but _atom_site_label '12' is definitely a string and cannot be a number?
>
> I agree "We could (*should*) say that CIF 2.0 removes such distinctions, except for (.) and (?)"
>
>
>
> Cheers
>
>
>
> Simon
>
> ------------------------------------------------------------------------------------------------------------------------------------
>
> *From:*"Bollinger, John C" <[email protected] <mailto:[email protected]>>
> *To:* Group finalising DDLm and associated dictionaries <[email protected] <mailto:[email protected]>>
> *Sent:* Friday, 24 July 2015, 15:13
> *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
>
> I have no objection to APIs being tolerant of numeric data in this way.
>
> I see no particular advantage to adding special productions to the EBNF to match unquoted (.) and (?), however, as the current EBNF
> already will match them as values just fine. EBNF is good for describing the language grammar and syntax, but it is not the right
> mechanism for expressing semantics. Putting these explicitly in the EBNF is in any case a secondary issue. The primary one is
> whether the CIF format distinguishes quoted values from unquoted ones generally, or whether it distinguishes only certain special
> cases of quoting vs. non-quoting.
>
> We seem to agree that there’s no good way around the (.) and (?) cases, but I suspect we differ about the more general question.
> Even though it is desirable for CIF parsers to be flexible about numbers, the published CIF conventions say to distinguish between
> quoted and unquoted values with respect to numeric interpretation. That’s only a convention, so CIF software is not obligated to
> follow it, but following it must be **allowed**, at least in CIF 1.1. That means that there indeed must be an actionable
> distinction between quoted and unquoted numbers. With that being the case, I am inclined to make it a general distinction, even if
> it is one that is typically ignored, rather than a special case. Moreover, I am inclined to say that there always has been such a
> distinction; it just hasn’t been used outside the numeric- and null-value cases.
>
> We could say that CIF 2.0 removes such distinctions, except for (.) and (?), but I don’t really see the need for another break from
> CIF 1.1.
>
> John
>
> *From:*ddlm-group [mailto:[email protected]] *On Behalf Of *SIMON WESTRIP
> *Sent:* Friday, July 24, 2015 7:27 AM
> *To:* Group finalising DDLm and associated dictionaries
> *Subject:* [ddlm-group] Semantics of whitespace-delimited values
>
> I agree - indeed the 'less-tolerant' applications in my little survey used third-party APIs to read the CIF so it is probably the
> case that the API is being 'intolerant' rather than the application.
>
> I'd also be happy to see period and question mark in the EBNF - afterall these tokens when white-space delimited should never be
> interpreted as the string values "?" or ".", so fundamentally they could be regarded as structural tokens, regardless of any other
> semantics associated with them.
>
> In general, with respect to the imminent introduction of CIF2, the assumption will likely be that the common semantic features of
> CIF1 will apply to CIF2, which is fair enough. However, personally I would prefer that such semantics were kept distinctly separate
> from the specification. For example, the CIF1 line-folding 'semantics' are now part of the specification that a parser is expected
> to be aware of, while the CIF1 character encoding semantics are purely conventions that may be useful in certain domains (the parser
> really doesn't need to be aware of them). So if the CIF1 convention with respect to period and question marks is generally thought
> to be an inherent part of CIF, then it would be better placed in the specification that parsers should be aware of (i.e. parsers
> should be aware of these 'null' tokens and not simply return a "." or "?" with no context)?
>
> The same applies to numbers - if a parser is expected to unequivocally identify numbers from the syntax, then this is no longer a
> 'common semantic feature'. I believe that a parser needs minimally to identify a 'value', which can be interpreted further down the
> line.
>
> So perhaps the question boils down to: which (if any) of the semantic features of CIF1 would we expect a CIF2 parser to be aware of?
>
> Cheers
>
> Simon
>
> ------------------------------------------------------------------------------------------------------------------------------------
>
> *From:*James Hester <[email protected] <mailto:[email protected]>>
> *To:* SIMON WESTRIP <[email protected] <mailto:[email protected]>>; Group finalising DDLm and associated
> dictionaries <[email protected] <mailto:[email protected]>>
> *Sent:* Friday, 24 July 2015, 6:37
> *Subject:* Re: [ddlm-group] Semantics of whitespace-delimited values
>
> Let me take up one of Simon's comments:
>
> "...we could suggest that CIF applications started to turn to the dictionary rather than syntax to determine the exact nature of a
> data item".
>
> What is not perhaps appreciated is that the application programmer accessing the CIF file searching for a numeric value for a
> particular dataname has already consulted the dictionary when writing the program (with the minor exception of e.g. pretty-printers
> as noted before). Consulting the very same dictionary at runtime is pointless as these meanings are never supposed to change. So I
> would suggest to Simon that there is no problem nudging *application* programmers to accept the dictionary definitions as they
> already have done so in order to write correct calculations, the only problem will be nudging CIF APIs to do their best to return a
> number if asked (and your survey would suggest that the majority already do). My position is very strongly pro application
> programmer - if they are asking my API for a number, I am not going to second-guess them unless they have asked me to by providing a
> dictionary as well.
>
> I'd be happy to see period and question mark added to the EBNF as primitive productions, this is a simple change.
>
> On 9 July 2015 at 07:46, SIMON WESTRIP <[email protected] <mailto:[email protected]>> wrote:
>
> Dear all
>
> I extended the mini survey of current applications a little and looked closer at some of the less-liberal parsers:
>
> one of the applications I've looked at did not complain when I included some non-ASCII text in the CIF, while another complained
> about a data value constructed as '''z''' (valid CIF1), and one displayed rather quirky behaviour with regard to semicolon-delimited
> strings, rejecting the contained 'value' if it had a leading newline but not if it had a leading space -
>
> all of these particular applications complained about delimited numbers to the extent that the application stopped processing.
>
> Based on this (albeit limited) survey of some current well- known CIF applications, regarding the introduction of CIF2 it would
> definitely be prudent to indicate that 'yes indeed' the interpretation of CIF1.1 wihitespace-delimited values retains significance
> in CIF2. However, if possible I think it would be in the interests of flexibility and unambiguity if somehow we could suggest that
> CIF applications started to turn to the dictionary rather than syntax to determine the exact nature of a data item (afterall, as I
> see it, that's one very strong motivation for developing CIF2 in the first place - and is the preferred approach in CIF1 too).
> Thankfully (from my point of view) this isn't even an issue for the majority of applications I have looked at (they simply grab the
> data however they've found it and make use of it if they can, or they carefully validate the data against the dictionary). So what
> is challenging me is how we achieve this - i.e. nudging some applications to be a little more flexible (which in my experience is
> what many 'users' would most appreciate)., while at the same time maintaining the convention that numbers especially are still
> presented in an undelimited (uncluttered) fashion. I've no convincing answer to this yet.
>
> Regarding the ? and . 'null' values I hesitate to suggest that we could take these out of the issue altogether by making them CIF
> key tokens - I hesitate because I suspect that some applications simply ignore their significance anyway and incorrect usage rarely
> presents a real problem (and also I haven't yet attempted to see if its actually possible to define them in this way in any case:-)
> - so its probably unnecessary and may even seem like a new complication to applications that were not particularly aware of, or
> bothered by, the significance of these tokens in the first place.
>
> Cheers
>
> Simon
>
> ------------------------------------------------------------------------------------------------------------------------------------
>
> *From:*"Bollinger, John C" <[email protected] <mailto:[email protected]>>
> *To:* Group finalising DDLm and associated dictionaries <[email protected] <mailto:[email protected]>>; SIMON WESTRIP
> <[email protected] <mailto:[email protected]>>
> *Sent:* Wednesday, 8 July 2015, 16:21
> *Subject:* RE: [ddlm-group] Semantics of whitespace-delimited values
>
> Thanks Simon, James, and John. I am uncertain how many distinct parsers are represented by the reports so far, but it seems there
> must be at least five.
>
> I think we agree that parsers and applications should be permitted, if not required, to distinguish between the values . and '.',
> and between the values ? and '?'. We also seem to agree that it is not useful to insist that parsers or applications refuse to
> interpret quoted values as numbers, although some CIF 1.1 parsers in fact do so at their own discretion, and some warn instead of
> refusing (with even that relying on taking the position that numbers are not supposed to be quoted).
>
> Not being enamored of special cases, and not wanting CIF 2.0 to rule out CIF interpretation practice that is accepted and common in
> CIF 1.1 applications, I find myself favoring CIF 2.0 taking the position that in general, it is permitted but not required to
> interpret any string value differently when it is presented in whitespace-delimited form than when it is presented in any of the
> other forms. The conventions for the special values . and ? could then be taken to apply on a domain-wide basis, whereas the
> convention for the form of numbers could be taken to apply on a more selective basis (per-DDL, per-dictionary, or even
> per-definition). An implication of this position, however, is that whether or not a value is presented whitespace-delimited becomes
> a property of that value that a fully general CIF 2.0 parser must make available to its clients. Moreover, for better or for worse,
> future dictionaries could establish additional items or data types whose values are required to be presented unquoted.
>
> We could perhaps characterize that more specifically, maybe by saying that the exact form of values presented in any of the quoted
> forms is significant, or something along those lines, whereas values presented in whitespace-delimited form may afford equivalent
> alternative expressions. That doesn’t exactly fit . and ?, but perhaps some similar statement could do so better.
>
> John
>
> --
>
> John C. Bollinger, Ph.D.
>
> Computing and X-Ray Scientist
>
> Department of Structural Biology
>
> St. Jude Children's Research Hospital
>
> [email protected] <mailto:[email protected]>
>
> (901) 595-3166[office]
>
> www.stjude.org <http://www.stjude.org/>
>
>
>
> --
>
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
>
> _______________________________________________
> ddlm-group mailing list
> [email protected] <mailto:[email protected]>
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
>
>
>
> _______________________________________________
> ddlm-group mailing list
> [email protected]
> http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group
>

--
John Westbrook, Ph.D.
RCSB, Protein Data Bank
Rutgers, The State University of New Jersey
Department of Chemistry and Chemical Biology
174 Frelinghuysen Rd
Piscataway, NJ 08854-8087
e-mail: [email protected]
Ph: (848) 445-4290 Fax: (732) 445-4320
_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

_______________________________________________
ddlm-group mailing list
[email protected]
http://mailman.iucr.org/cgi-bin/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Semantics of whitespace-delimited values (James Hester)

References:

Re: [ddlm-group] Semantics of whitespace-delimited values (Bollinger, John C)

Re: [ddlm-group] Semantics of whitespace-delimited values (SIMON WESTRIP)

Prev by Date: Re: [ddlm-group] Semantics of whitespace-delimited values

Next by Date: Re: [ddlm-group] Semantics of whitespace-delimited values

Prev by thread: Re: [ddlm-group] Semantics of whitespace-delimited values

Next by thread: Re: [ddlm-group] Semantics of whitespace-delimited values

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Semantics of whitespace-delimited values