RE: Proposal to regulate markup in CIF files
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: RE: Proposal to regulate markup in CIF files
- From: "Brown, David" <idbrown@mcmaster.ca>
- Date: Thu, 21 Sep 2017 15:46:01 +0000
- Accept-Language: en-CA, en-US
- In-Reply-To: <CAM+dB2fR6EzfUWskcYrSxcRejs8YdDS4TE-SC=EwxmQkRKayTQ@mail.gmail.com>
- References: <CAM+dB2cSRx=68QpdH5nQMuwWYU_pCwDmf2YVM7caU4vcw3fDsQ@mail.gmail.com><DM2PR0401MB145300C788489D6130483DA9E06E0@DM2PR0401MB1453.namprd04.prod.outlook.com>,<CAM+dB2fR6EzfUWskcYrSxcRejs8YdDS4TE-SC=EwxmQkRKayTQ@mail.gmail.com>
James,
You ask a question at the end of this part of the discussion, and below I give you the answer as I understand it.
1. To the extent that the proposal envisions data files being enabled to self-specify a particular markup convention from among several choices, it seems to violate the principle that the meaning of an item should not depend on the value of a different item. In my opinion,this principle needs to be clarified. For example, any non-key data name 'depends' on the value of key data names in its category, or the meaning of a fractional coordinate 'depends' on the values of the cell parameters. We need a precise
rephrasing of the principle, e.g. "no new key data names may be added to a pre-existing category" or "where possible, the meaning of data names should depend only on data names that are identifiers". Once this is clarified we might be in a better position
to judge when a proposal is suspect. Does anyone know the underlying basis for this principle?
Here is the answer to this question. The original intent was that the item name should be sufficient to indicate to the reading program where to store the information given by the item, and in what form that information was
given, without having to refer to the value of some other item. For example we wanted to avoid defining an item: _atom_site.adp which might contain the anisotropic displacement parameters in any one of the forms B, beta or U, the particular form depending
on the value assigned to the item _atom_site.adp_type. Without consulting *.adp_type the information is meaningless. This required defining separate item names for each of the three different ways depending on how the reading program should store the value
of the item (or preferably restricting CIF to use only one convention). The situation with text is different. A text item such as _abstract or _title could be stored in the same place and processed without regard to the markup convention, up to the time that
it was being displayed.
I hope this clarifies the thinking behind this rule.
David
I. David Brown
Professor Emeritus Department of Physics and Astronomy McMaster University Hamilton, Ontario, Canada From: comcifs [comcifs-bounces@iucr.org] on behalf of James Hester [jamesrhester@gmail.com]
Sent: September 20, 2017 20:59 To: Discussion list of the IUCr Committee for the Maintenance of the CIF Standard (COMCIFS) Subject: Re: Proposal to regulate markup in CIF files Dear All,
I've turned the proposal into a discussion document, separating non-ASCII character markup and other markup as per John's comments. The best way forward for 'other markup' is not clear to me, so I've asked a few questions at the end of the document that I invite
you to consider.I've also inserted replies to John's comments inline below. James.
On 14 September 2017 at 01:08, Bollinger, John C
<John.Bollinger@stjude.org> wrote:
In my opinion,this principle needs to be clarified. For example, any non-key data name 'depends' on the value of key data names in its category, or the meaning of a fractional coordinate 'depends' on the values of the cell parameters. We need a precise
rephrasing of the principle, e.g. "no new key data names may be added to a pre-existing category" or "where possible, the meaning of data names should depend only on data names that are identifiers". Once this is clarified we might be in a better position
to judge when a proposal is suspect. Does anyone know the underlying basis for this principle?
The point being that software written with one set of markup conventions in mind would be caught unawares by data values written according to a new markup convention. However, the idea is that _publ.markup_convention is used to flag the use of a new convention
and allow such software to act appropriately. Adding new enumerated values is a normal way of developing dictionaries and I don't think is controversial. I have removed this from the proposal for now.
Yes, this is true. All of our dictionaries currently depend on the core dictionary in any case and this is unlikely to change. A completely different domain would need to define its own analogue.
This is true.
I have done this in the revised proposal.
Agreed, hopefully the rewritten proposal addresses this.
I have rewritten the proposal to remove the option of specifying particular types of markup, and indeed largely removed discussion of markup pending some feedback from this group.
James.
================================================================ Revised discussion paper for regulating markup of CIF text items ================================================================ Summary ======= 1. Data values containing backslash escapes for indicating non-ASCII characters are to be considered entirely equivalent to values obtained after all such substitutions and escapes have been applied. 2. Other mark-up (superscripts, subscripts, italic etc.) is given a new type, but is otherwise unspecified and needs to be discussed. Introduction ============ >From the very first publication describing CIF, markup conventions have been provided in order to extend the range of characters and font effects representable in ASCII. Which data values these conventions might apply to, and whether or not this is more properly a CIF syntax or dictionary (semantic) issue, has been left implicit. Marked-up text according to the ad-hoc definitions described in Vol G appears both in CIF data files and in dictionary definitions. While COMCIFS has control over the conventions applying within dictionaries, it has far less control over data values in data files, which are produced both by dedicated software, such as publCIF, and hand-editing or local ad-hoc solutions. Marked-up text in data files plays an important role in the publication workflow. Vol G (First Edition) notes in section 2.2.5.3: "It is hoped that in future different types of such markup may be permitted so long as the data values affected can be tagged with an indication of their content type that allows the appropriate content handlers to be invoked". It is not, however, clear that multiple alternative markups are desirable. Moving forward ============== The markup in use can be divided into two classes: 'character encoding' and 'font effects'. Under this proposal, each class is treated differently. 1. Character encoding Character encoding markup represents non-ASCII letters using a backslash followed by one or more ASCII characters, for example, '\a' is 'greek letter alpha'. This is a format-specific method of allowing access to the full characterset used by the DDL textual types. From the point of view of the (format-agnostic) dictionary data model, how a particular format wishes to encode characters is irrelevant. Therefore, the set of character escapes is most appropriately documented as part of the description of CIF syntax, not within a DDL dictionary. In other words, CIF1/2 data values with backslash character escapes are semantically identical to CIF2 data values where those escapes have had their Unicode equivalents substituted. 2. Font effects Font effects differ from character encodings in not having a DDLm type that they are a concrete realisation of. By analogy, then, we could create a DDLm type 'Marked-up text', whose contents contain marked-up text. Particular implementations and syntaxes might then specify what particular convention(s) 'Marked up text' should conform to. Notes ===== 1. An important function of the 'Marked-up text' type is to designate data values that are not intended to be machine-actionable. No DDLm functions or attributes are envisioned for manipulating the markup. The type could alternatively be something like 'Rich text'. 2. Enumerated values and identifiers must not be of type marked-up text. 3. The 'marked-up text' data value is obtained from a CIF syntax file after backslash character codes have been substituted. Open questions ============== The above proposal does not specify a particular markup convention. Leaving anything unspecified is dangerous for a standard, as it invites the appearance of multiple, incompatible solutions. We should as a matter of urgency answer the following questions: (1) Should alternative markup conventions be possible? (2) If yes, should the markup convention in use be (i) per dataname? (ii) per datavalue (maybe via an embedded flag)? (iii) per data block? (iv) per dictionary? (v) per syntax? (e.g. CIF/CIF-JSON/HDF5 etc.) (3) If no, the current convention is the only possible one for reasons of backward compatibility. Should it be: (i) a feature of CIF syntax? (ii) a feature of CIF syntax when combined with a DDLm dictionary? (iii) defined in DDLm? My answers to these questions would be (1) No alternatives should be possible, in order to simplify publishing workflows and maintain the publCIF investment (3) (ii) Some explanation regarding (ii), which possibly sounds a bit abstruse. A CIF syntax file can be used (in theory) with an alternative dictionary language and associated data model. Likewise, DDLm dictionaries can be used to describe non-CIF files. In each case, the way in which syntactical data values are constructed to match the dictionary types may differ (for example, numbers may be a text string or binary). Each combination of syntax and dictionary must explicitly state how each dictionary type is represented in that syntax. So I am suggesting that for the combination 'CIF + DDLm' that we specify the current markup conventions to represent type 'Marked-up text'. -- |
Reply to: [list | sender only]
- References:
- Proposal to regulate markup in CIF files (James Hester)
- RE: Proposal to regulate markup in CIF files (Bollinger, John C)
- Re: Proposal to regulate markup in CIF files (James Hester)
- Prev by Date: Re: Proposal to regulate markup in CIF files
- Next by Date: Re: Proposal to regulate markup in CIF files
- Prev by thread: Re: Proposal to regulate markup in CIF files
- Next by thread: Re: Proposal to regulate markup in CIF files
- Index(es):