[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] What we have resolved so far

A quick response to JRH's email, to put things in context. My more detailed
summary is still coming.

On 19/11/09 8:22 AM, "James Hester" <jamesrhester@gmail.com> wrote:

> Nick's forthcoming email notwithstanding, here is a quick list of what
> I think we have resolved and not resolved so far:
> 1.  The new standard is called CIF2
> 2.  All files conforming to the new standard must have a header
> containing something like the characters "#CIF2"
> 3.  Non-quote-delimited strings may not contain any
> syntactically-significant characters (exact character set has been
> specified by Nick, but before UTF-8 decision)
> 4.  Quote delimited strings may not contain instances of the
> terminating character, regardless of following whitespace.

Unless you do as in (5)

> 5.  In a quote-delimited string, a reverse solidus escapes the
> following character, if that character is otherwise syntactically
> meaningful
> 6.  Files are UTF-8 encoded
> 7.  No tuples
> UNRESOLVED (with notes)
> 1.  Do we maintain the fixed line length restriction?
>     - I will post something to the relevant thread to provoke a resolution

Currently at 2048 bytes. I will propose maintaining this in deference to
legacy and future Fortran programmes.

> 2.  Is an escaping reverse solidus part of the datavalue?
>     - This conversation didn't appear to resolve itself

I will propose yes, that it is left to a downstream application. This is
actually consistent with how Python works. My email timestamped

Mon, 09 Nov 2009 10:35:41 +0800

Explained my reasoning.

> 3.  Are square brackets permitted in datanames? (getting close to resolution)

I will propose a character set restricted with only _ and . as allowed
punctuation characters. All data names can be identifiers in dREL, and even
those we assume won't be in dREL can be because someone writing a completely
different dictionary can import our definitions and then add our data names
to their dREL scripts.

To simplify this issue I suggest avoiding the problem. Legacy CIF1 names
will be aliased in CIF dictionaries so that when we read a CIF1 data name in
a CIF1 file we can immediately map it to its CIF2 name (this avoids the need
to remediate all existing CIF1 files).

> 4.  Does STAR also adopt UTF-8 or go with straight binary? (This may
> be up to Nick)

I will propose binary. Any other application domain can then choose UTF-8,
UTF-16, UCS2 or whatever encoding they wish. This will make Herb's imgCIF a
legitimate STAR application while not a CIF2 application because of his
binary component being in binUTF? binUCS?.

> 5.  Can we use whitespace instead of comma as a list item delimiter?
>     -not yet tackled seriously but deserves consideration

I will propose it has to be a comma, but make the coercion rule that space
separated values in a list-type object be coerced into comma separated
values. That is, read spaces as you want, but don't encourage them.

> 6.  Are braces only or square brackets + braces used to delimit lists
> and associative arrays?
>     - some consider this decision to be coupled to (3), obvious preference
>       is for square brackets and braces if other issues are solved

With my proposal for 3 acceptable, then I would propose returning to [] for
lists and {} for associative arrays, making it possible to distinguish the
two at the lexical level by reading the first character.

> 7.  What is the exact form of the header comment (there was some
> discussion of adding a second character such as % or !)?

I think it should be the same as Unix shell headers.

> 8.  Usage of triple-quoted strings: (a) do we need them? (b) do we
> need both of them?

(a) Yes if you want inline multiline strings. (b) Seems superfluous but
makes encoding a """ in a ''' string much easier (and vice versa) without
having to elide.

> 9.  Are general unicode characters allowed in non-quote-delimited strings?

You know my view on this. I want to discourage non-delimited strings and
encourage delimited strings. But I can't see (for now) any reason that the
characters sets have to be different.

There is one thing about Unicode we have to clarify. The XML specification
does not allow ALL Unicode characters because some of them (I think) break
the parsing process. The exclusion set is small, but probably significant. I
don't know the details but when we say Unicode characters we had better be
explicit as to which. Herb, you seem to have a handle on the XML spec maybe
you can explain what the exclusion set is and why. You can propose to this
group what the Unicode set should be.



Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au

ddlm-group mailing list

Reply to: [list | sender only]