Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Draft CIF2 standard available

  • Subject: Re: Draft CIF2 standard available
  • From: "Bollinger, John C" <John.Bollinger@xxxxxxxxxx>
  • Date: Wed, 7 Apr 2010 09:35:56 -0500
  • Accept-Language: en-US
  • acceptlanguage: en-US
Hello All,
 
Imagine my surprise at returning to this list after a several-year hiatus (involving two job changes and an interstate relocation) just in time to find a call for public comment on a new CIF standard.  I hope I’m in time, at least.  I see that no public discussion has proceeded from James’s solicitation, but I suppose that follows from many of the usual voices here having been involved in the DDLm working group that hammered out the draft in the first place.
 
I have several comments and questions about the draft.  Many are more editorial than technical, so I hope you will indulge me there.  All comments are based on the February 18, 2010 draft.
 
General
 
1) The convention used in the specification for expressing characters by hexadecimal code point value is readily enough interpreted (at least for an XML initiate such as myself), but it is nowhere defined, and I have not seen this exact convention before.  Perhaps it is common in DDLm or STAR space, but I don’t think it a safe assumption that consumers of the spec will be familiar with those areas.  Since CIF2 uses Unicode, I suggest characters be expressed instead according to Unicode convention; for example, “U+0020” instead of “#x20”.  If the present convention is retained then it should be defined in the text.
 
Terminology section
 
2) The reference to “UNICODE Code Page 0” is meaningless as far as I am aware.  I have never before seen the term “code page” applied in this context.  It would be clearer to refer explicitly to the first 127 characters of Unicode, or, in more technically precise terms, the characters having Unicode code points 0  through 7E(hex).  (Or was the omission of U+007F from the definition of ASCII characters accidental?)
 
3) Depending on how standardesque one wants to be, it might be appropriate to use the ISO name for the “Latin-1” character set, “ISO-8859-1”.  (You might also want to specify ISO-10646 (+- some revision) for Unicode, but in that case there are some small, but non-trivial distinctions.)
 
4a) I do not understand the meaning of the sentence “The lexical characters of CIF2 are restricted to the 7 bit ASCII range […].”  That is to say, I think I understand the intent, but it’s different from what the sentence appears on the surface to say, which is that a lexer for CIF2 need only deal with 7-bit ASCII characters, just like one for CIF1.  This is false if the CIF contains non-ASCII characters (anywhere), for the UTF-8 encodings of all other characters use bytes that are not mapped to ASCII characters (i.e. they have the most significant bit set).
 
4b) It is debatable whether “this enables faster parsing since one can defer UTF-8 decoding to later.”  I don’t think it’s useful to have that debate now, but I do think that that rationale is out of place in the “Terminology” section.  It would be better placed with change 2.
 
5) As far as I know, the local convention for record termination is nowhere “architecture dependent” though it certainly is dependent on other aspects of the local computing environment.
 
6) I presume that it is not intended to refuse to recognize the Unicode line separator and / or paragraph separator characters as newline *in present or future environments where that is an accepted local convention*.
 
7) I find it unappealing and unpersuasive to justify the exclusion of some characters from syntactic significance based on their *rendering*.  This is not to say that I expect or want characters U+201C and U+201D to be syntactically significant as opening and closing double quotation marks, or any similar thing.  I do not.  I would rather see the whole rationale omitted (at least here).  If it is believed important to include reasoning, here, for this decision, then I think minimizing the syntactic complexity of the language is a much stronger justification.
 
8) With respect to the claim that “Applications built on the CIF2 standard will be able to process CIF1 data files,” is this a constraint on CIF2 applications, or an assertion about the CIF2 format?  If the former, then I think it out of order (no matter how likely it is to be true).  If the latter, then it is incorrect, for CIF2, as specified, is not 100% backwards compatible with CIF1.
 
Change 1
 
9) What is the expected CIF version for a CIF that does not specify the CIF2 version comment?  Is the specification saying that such a file does not conform to CIF2?  As a practical matter (probably not appropriate for inclusion in the spec) what is the expectation for application behavior when presented with such a CIF?
 
Change 2
 
10) I am very pleased with CIF2’s extension of character repertoire to the full Unicode set.   The section header and text express an unfortunate comingling of the concepts of Unicode and UTF-8, however.  These are distinct concepts: the former is a coded character set, mapping characters to numeric code points; the latter is one of several schemes commonly used to encode Unicode code points  as byte sequences.  The contents of a CIF are thus *Unicode* characters, encoded via UTF-8 (not “UTF-8 characters”).
 
11) Although it probably doesn’t belong in the format specification, it would be useful to know whether the special codes for expressing certain non-ASCII characters in CIF1 values are expected to be supported by CIF2 applications as well (in addition to the ability to include all those characters directly).
 
Change 3
 
12) I applaud lifting the artificial restriction on data name lengths.  It has always seemed backward to me that CIF 1.1 restricts data name lengths to those definable in a DDL1 dictionary.
 
13) Is it really appropriate in specifying the CIF *format* to document restrictions imposed by DDLm (or any other DDL) on *dictionaries*?  Especially in such detail?  These are characteristics of particular dictionaries, not of CIF itself, and they are relevant only to dictionary writers, who need to be intimately familiar with their chosen DDL anyway.
 
Change 4
 
14) Is the text “A whitespace‐delimited string cannot exactly match any STAR keyword, loop_ global_ save_* stop_ data_* […],” especially the phrase “exactly match,” intended to imply case-sensitive matching?  That would be at variance with CIF1.
 
Change 5
 
15) What is the rationale for changing the interpretation of single-quoted and double-quoted strings relative to CIF1?  I’m talking about forbidding these values to contain the delimiter (even when it is not followed by whitespace).  Yes, CIF1’s provision for such values may be surprising to newcomers, and yes, it makes CIF1 a little trickier to parse, but CIF1 parsers already handle it, and experienced CIF users already know it.  This appears to be CIF2’s principal departure from backwards compatibility, and I simply don’t see why it is necessary.  Full (or at least better) backwards compatibility would be VERY VALUABLE.  Changing it now will produce unnecessary confusion and make CIF parsing even trickier (not easier) because both variants will need to be accommodated.
 
Change 6
 
(no comments)
 
Change 7
 
16) This is admittedly a quibble, but the sentence “A list is an ordered set” is susceptible to misinterpretation.  The examples indicate that it is not intended that a list must have the set property of not containing duplicate elements.  Perhaps “sequence” would be an apt substitute for “ordered set”.
 
17) Is whitespace allowed between the opening(closing) bracket of a list and its first(last) element, or only to separate elements?
 
18) If “there is implicit line joining, and the newline has no meaning with regard to List values” then are “\n;”-delimited text blocks forbidden as list elements?
 
Change 8
 
19) If “a Table is an unordered set” then what are its elements?  If they are name-value pairs (as the text seems to indicate), then wouldn’t that mean the same name could appear multiple times with different values?  I think it would be helpful here to separate the data model for this type from its syntactic representation.  A Table *is* a set of labels, each with an associated value.  It is represented in CIF2 as a collection of label-value pairs, enclosed by matching curly braces.
 
20) Is whitespace allowed between the opening(closing) brace of a list and its first(last) element, or only to separate elements?
 
21) Is there special value in requiring the labels in a CIF2 Table to be quote-delimited?  And in not allowing them to be triple-quote delimited?  These restrictions are no special burden, I don’t think, but they seem unnecessary.  As far as I am concerned, the labels could be any value interpretable according to the DDL1/2 “char” type.
 
22) If “there is implicit line joining, and the newline has no meaning with regard to the Table values” then are “\n;”-delimited text blocks forbidden as list elements?  There’s a better argument here than for List values (comment 18, above), but given the explicit inclusion of ‘\n’ as part of the delimiter for these values, they could be allowed.
 
Change 9
 
23) Tokens are not a language component, but rather a parsing detail, and parsers are not obligated to tokenize in any particular way.  I have no use for the spec instructing me about how to tokenize a CIF stream.  The main point here seems to be requiring whitespace between language constructs where otherwise the spec might be construed to permit its omission.  It would be more appropriate, easier to understand, and harder to miss if the details relevant to each language construct were included with the description of changes to that component.  Alternatively, see (26) below.
 
24) The Reasoning subsection contains only descriptions and explanations, not any reasoning or even justification / rationale.
 
25) Moreover, I don’t think it’s correct to say that “The only meaningful use of whitespace is to separate data tokens.”  At least, I think one must adopt a very broad view of “data tokens” to make it correct.  Surely whitespace meaningfully separates language keywords, tags, block and save frame headers, AND data values, all one from another.  Furthermore, whitespace inside delimited string values is meaningful to applications, and therefore must be allowed and preserved.
 
26) If you want to make a blanket statement here then perhaps you want to say that language keywords, tags, block and save frame headers, and top-level data values must all be separated from one another by whitespace, where “top-level data values” are defined as values that are not components of a List or Table.  Details of the required syntax for List and Table contents are more appropriately discussed in changes 7 and 8, respectively.
 
 
Best Regards,
 
John
 
--
John C. Bollinger, Ph.D.
Computing and X-Ray Scientist
Department of Structural Biology
St. Jude Children's Research Hospital
John.Bollinger@StJude.org
(901) 595-3166 [office]
 
 

  ________________________________  
_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.