Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CIF-JSON draft 2017-05-15

I think it's important to be precise and consistent with those in referencing CIF concepts.

Bob


On Mon, May 15, 2017 at 10:05 PM, Robert Hanson <hansonr@stolaf.edu> wrote:
Sorry for the fragmented messages -- probably should have waited until I was done with that. Here are my responses to James all in one document. Thanks very much for the clarification, James.

Hi again Bob,

On 16 May 2017 at 03:53, Robert Hanson <hansonr@stolaf.edu> wrote:

    Two questions arising:

    1) CIF2 mentions byte order. This would be the (optional) first two characters of the data stream?


Byte order marks are a JSON syntax issue and thus not relevant to this specification. https://tools.ietf.org/html/rfc7159#section-8.1 states that JSON implementations must not add BOMs to streams, but parsers may ignore BOMs in the interests of interoperability.

OK, I think the real answer is that unlike UTF-16, UTF-8 does not have little- or big-endian byte order. Perfect.


    2) list item names need not be lower-case, right? Nothing I see in CIF says that they conform to the requirements of data names. Thus, CIF2 could have upper- and lower-case names in list items.


Currently lower case is only enforced for JSON datablock names.  The CIF syntaxes require that datanames appearing in a datablock must have canonical caseless forms that do not match, but the datanames themselves do not have to be presented in canonical caseless form.  As case is, in general, significant for datavalues, list item names and any other datavalues may be in upper case, capitalised etc.

OK, I think I am confused by all these terms. I just hadn't thought about all these names yet. I'm using the language of http://journals.iucr.org/j/issues/2016/01/00/aj5269/index.html#SEC3.9 here.

Here's what I suggest:

the JSON equivalents of the CIF data block headers must be lower case (not sure that matters if "data_" precedes them.)
the JSON equivalents of the CIF data block item names must be lower case (this is critical)
the JSON equivalents of the CIF save block names must be lower case (not sure  that matters if "save_" precedes them.)
the JSON equivalents of the CIF save block item names must be lower case (this is critical)
the JSON equivalents of the CIF table data item names may be any case (no restriction here)

I will just point out the that CIF2 spec does not use the word "dataname" and instead uses the more natural language "data name" and I wish this would, too.


On 16 May 2017 at 04:16, Robert Hanson <hansonr@stolaf.edu> wrote:

    Two points:

    1. I do not understand the stripping of "data_" and "save_" from the names we have for these.


These are stripped as the data_ and save_ parts of the names perform a syntactical role in CIF, and the JSON curly brace essentially replaces their syntactical function of encapsulation.  Put another way, requiring that datablock names started with 'data_' would be unnecessary additional baggage.

Yes, OK, I see that. Still, I would suggest it doesn't add a significant amount of baggage to add "data_" or "save_" once or twice in a file, and it significantly improves readability both by humans and machines, at least in my opinion. It provides a visual cue to the original CIF reference. Also, it is common for a JSON data reader to first get a list of keys without values, and, starting with that, know what to do with them.

I just noticed that in several of the files I have the _audit_link_block_code values reference data block names, but I think those are actually supposed to be referencing _audit_block_code entries. So I think these may be broken. Pretty sure they were hand-made:

data_comp1012814988
_exptl_crystal_type_of_structure comp
...
loop_
_audit_link_block_code
_audit_link_block_description
? 'common experimental and publication data'
1012814988_0_MOD 'modulated structure (Global data)'
1012814988_1_MOD 'modulated structure (subsystem 1)'
1012814988_2_MOD 'modulated structure (subsystem 2)'
1012814988_0_REFRNCE 'reference structure (Global data)'
1012814988_1_REFRNCE 'reference structure (subsystem 1)'
1012814988_2_REFRNCE 'reference structure (subsystem 2)'
...
data_1012814988_1_MOD
_cell_length_a  4.905(2)
...

    2. Save frames.  What is the problem with just doing this?


         "data_another_block":{
            "_abc":["xyz"],
            "save_internal":{"_abc":["yzx"],
                            "_r.fruit":["apple","pear"],
                            "_r.colour":["red","green"]}
                            },
         }

    That is, why the special "frames" list?


Well, we could do it the way you suggest, perhaps with a capital 'S'.  Is there any reason to prefer one over the other? I would have thought that putting all save frames under a single name would make processing a datablock slightly easier, as you don't have to check every block entry for the 'save_' sequence when running through a datablock, especially as these frames only occur in dictionaries.  To get all save frames under the current spec you would just have to go something like:

defs = myjson['blockname']['frames']

but if you need them under your proposal you'd have to go something vaguely like:

defs = myjson['blockname']
save_names = defs.keys().filter(key[0:5] == "save_")

The former approach is faster, and seems simpler but perhaps that's just me?

BH: There's no capital/lower case issues for the first character data block item names. These in CIF all start with "_". So anything that doesn't do that could be a save name. I don't think speed is any issue here. These are very high-level operations -- only a handful per file. All data block item names have to be checked for "_" anyway. Right?

1. Triple-quoted CIF strings: these are purely a CIF syntactical device for encapsulating a string value, there is no meaning attached to these that is not captured by a normal JSON string.

Yes, of course. I wasn't thinking. Maybe something in there about standard JSON quoting of \" and \n. I wonder if it should require that "new line" be a UNIX new line, \n, not \r or \r\n.

2. data names in case-normal form: I can see no problems with this, does anybody else have any thoughts?  Such a restriction would handily enforce the need for datanames to be canonically-caselessly unique as JSON requires all object names to be unique within their parent object.

I'm having problems identifying what "data name" means. Is that the CIF data block header names?  or the CIF data block item names? I think that's what got me confused.

Bob




--
Robert M. Hanson
Larson-Anderson Professor of Chemistry
St. Olaf College
Northfield, MN
http://www.stolaf.edu/people/hansonr


If nature does not answer first what we want,
it is better to take what answer we get.

-- Josiah Willard Gibbs, Lecture XXX, Monday, February 5, 1900

_______________________________________________
cif-developers mailing list
cif-developers@iucr.org
http://mailman.iucr.org/cgi-bin/mailman/listinfo/cif-developers

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.