Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Ok. Back on board. I am proposing some old and some new stuff here. From the
beginning,

(1) restricting the character set of non-delimited strings is
NON-NEGOTIABLE. If we don't restrict it, then we can't build recursive data
structures and exploit DDLm. If we aren't going to exploit DDLm, IUCr should
drop it now and stick with its current DDL.

IUCr needs to make that decision now.

I have built a new lexer for the current syntax specification and checked
for cases where

(1) a double-quote-delimited string contains a double quote.
(2) a single-quote-delimited string contains a single quote.
(3) a non-delimited string contains any of " ' , : { }
(4) a data name contains any of (3)

The contents of (3) are sufficient I think) restriction to non-delimited
strings to enable us to move forward.

I have scanned 10345 of the 60173 (17%) mmCIF files in the archive. The
results are

(1) 0 of the 3.4M (M = million) data values failed the test.

(2) 4 of the 1.3M data values failed the test.
When I pointed these out to John he said these SHOULD have been in
semi-colon delimited text because at the PDB they have been systematically
dealing with quotes within quotes to avoid parsing problems.

HENCE not allowing a string delimiter character within the string delimited
by the same character poses very little or no problem in mmCIF.

(3) 138,733 of the 2,009M data values failed the test (.007%)

Again the magnitude of the problem has been exaggerated. The restrictions
will not affect many of the archived data items. All the failures were
limited to 3-5 data names. These were those with embedded : which includes
the specification of a URL, and those with embedded , to which Herb has
already alluded. John has stipulated that those restrictions we are
suggesting can be quickly and efficiently implemented (I am here and looked
at their systems and the changes are a single change to dictionary entry and
all software handles the change immediately). I believe the PDB has a
remediation process that will resolve all legacy issues (at least for them).

Conclusion: This restriction has minimal (.007%) impact on how things have
been done, and can be easily implemented for files from here on.

(4) 0 data names contain these characters.

I will not comment further on this point until I have done the same analysis
for the IUCr archive. I suspect the problem will be bigger for those files
because they represent a more lackadaisical period in CIFs evolution where
we suggested you could do whatever you want etc, and also there are IUCr
mark ups that likely cause problems. Once I get my hands on that archive I
will let people know.

Now guess what? If we don't allow a ' within a '..' and a " within a ".."
and any "',:{} within a non-delimited string or a data name WE DON'T NEED A
SPACE BEFORE OR AFTER THE TOKEN DELIMITER. This simplifies AND more
importantly NORMALIZES the grammar.

I don't accept the argument that the new parser is so much more difficult
that existing parsers. Currently you have (if you are inside a double quote
delimited string)

if (char == \") {
  tmpchar=lookahead(1);
  if (tmpchar == " ") return END_OF_STRING;
  else continue;
  }

In the new parser you will have

if (char == \") return END_OF_STRING

YOU WILL NOTE:

I have note included the [] characters in the restriction. There is too much
legacy associated with their existence in data names in both small and mm
CIFs.

I am going to suggest a single token to represent lists, lists of lists and
associative arrays, namely  {...}. These are new, and don't present a
problem.

UTF-8 encoding. This is a 1-4 byte variable encoding schema (actually
originally up to 6 bytes providing 31 bits of representation). It is a
binary representation. The encoding algorithm is not brain busting, but
neither is it trivial. Having a CIF file not editable by a bog standard
editor will upset some people. I propose the introduction of a new string
type within the DDLm semantics that allows one to define it to be Unicode.
Within the string I propose we adopt a \uABCD[EF] (ie 1-6 HEX characters) to
represent the character. Equally we could go with the HTML approach of
� (ie 1-6 HEX characters).

I also strongly propose support fort the UNICODE string within """ strings
ONLY. Lets's start from a restrictive stance from the outset.

I will be arriving at Dowling at about noon on Wednesday Herb. I'll bring my
boxing gloves, Frances can referee :)

Nick

On 6/10/09 11:01 PM, "James Hester" <jamesrhester@gmail.com> wrote:

> Dear All:
> 
> As a result of the discussion with Herbert I can see two differing
> approaches to these CIF syntax changes:
> 
> 1. Any changes to CIF syntax should be such that earlier syntax
> versions form a subset of the new syntax, i.e. files in the older
> syntax will also conform to the new syntax
> 
> or
> 
> 2. When making changes to the standard, the opportunity should be
> taken to simplify and streamline syntax as much as possible.
> 
> Advantages of (1): a single CIF parser can be maintained for all
> syntax versions; a CIF writer is always conformant to the latest
> version and only needs changing if new syntax features are to be used;
> the existing CIF software ecosystem is minimally affected
> 
> Advantages of (2): implementation of CIF readers/writers from scratch
> is easier; the standard is easier to define formally and more
> aesthetically pleasing; mistakes in previous versions can be fixed,
> warts do not accumulate
> 
> I would like to suggest we act as follows: in essence, we deprecate
> rather than exclude.  In detail:
> 
> 1. For this edition of the standard (1.2) we follow Herbert's line,
> leaving everything currently defined untouched.  We simply add triple
> quote delimited strings and bracket expressions.  The content of
> non-delimited strings in bracket expressions will be as proposed by
> Nick.
> 
> 2. In the documents associated with the new standard we strongly
> suggest that all non-delimited strings use the same character set as
> for non-delimited strings in bracket expressions (i.e. Nick's original
> proposal).  We might point out that this simplifies code for writing
> CIFs, and perhaps (if all agree) we add that using the CIF1.1
> non-delimited string character set is deprecated, darkly foreshadowing
> that a future version of the syntax standard will adopt this character
> set for all non-delimited strings.
> 
> 3. We also deprecate including string delimiters inside strings,
> regardless of whitespace issues.
> 
> 4. In all dictionaries we adopt the restricted character set for
> non-delimited strings and exclusion of string delimiters in strings.
> 
> 5. We ask that CheckCIF emit a warning about use of deprecated
> characters in non-delimited strings
> 
> 6. When (say in 10 years' time) a sufficiently large proportion of
> incoming CIFs conform to the new non-delimited string character set,
> we promulgate the 1.3 version of the standard.
> 

cheers

Nick

--------------------------------
Associate Professor N. Spadaccini, PhD
School of Computer Science & Software Engineering

The University of Western Australia    t: +61 (0)8 6488 3452
35 Stirling Highway                    f: +61 (0)8 6488 1089
CRAWLEY, Perth,  WA  6009 AUSTRALIA   w3: www.csse.uwa.edu.au/~nick
MBDP  M002

CRICOS Provider Code: 00126G

e: Nick.Spadaccini@uwa.edu.au





_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.