Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

I am currently connected to the world via a slow dialup connection, so I will
tend to fewer, more wordy communications.

There are two issues here which we can treat separately. The
first is the restriction of the character set for non-delimited
strings, to which I have seen no objections so far.  Can we therefore
take the expression given by Nick as agreed?  For reference it was:

non-DS = [A-Za-z0-9./-()+?][A-Za-z0-9_./-()+?]*

There remains then the treatment of whitespace.  Following Nick's
visit, I have had some time to ponder this topic and have shifted my
position somewhat. I am not overly swayed by the assertion that
computer language parsers never use whitespace as a delimiter, so
neither should we. A CIF file is different from a computer language
source file.  By and large, computer language source files are
created, edited and maintained by humans, who will generally do
whatever they can to improve readability, including using whitespace
to delimit words when appropriate.  There is no reason beyond
enforcing readability to use whitespace as a delimiter (NB Python's
use of indentation as semantically meaningful). CIF files, on the
other hand, are almost always computer-generated and computer-read,
and so unless whitespace is required by the standard it will tend to
disappear.  This erodes CIF readability, one of the pleasant features
of CIF when compared with other data formats.  Therefore, while I
sympathise with the urge to simplify the BNF description, I believe
the complexity introduced by whitespace treatment is the price we pay
for enforcing readability. So I would prefer that all items in a CIF
file are separated by whitespace, where I view a bracket expression as
a single item.

That said, we need to disallow delimiters inside delimited strings,
even if not followed by whitespace. This would simplify parsing,
editing in delimiter-aware editors, and importation of CIF loops into
other software (e.g. spreadsheet software often understands double and
single quote delimited strings, and whitespace as a delimiter). It
also simplifies treatment of delimited strings inside bracket
structures, where one might expect that a comma or close bracket could
follow immediately after a string closing delimiter.

A concern for backwards compatibility has been expressed.  There are three
different types of compatibility issues that I can see:

1. Ability of legacy software to read new-style (CIF 1.2) CIF files
2. Ability of legacy software to write new-style CIF files
3. Need for remediation of old-style CIFs.
4. Upgrade burden on software writers

Regarding reading: as soon as a triple quote or bracket construct
appears in a CIF file, legacy software will not parse the CIF
correctly.  I would suggest that it is therefore pointless to worry
about incompatibilities in the details of string-handling also
breaking the parse.  Quite the opposite, if we are going to break
compatibility, we might as well do it all at once so that the
programmer only has to edit their code once.

Regarding writing: I believe that a policy decision has been made not
to redefine existing datanames to use bracket constructs.  Therefore,
current CIF software for outputting CIFs falls into three categories:
(a) software with conservative string handling - all non-numeric data
    delimited by quotes, even if not necessary under CIF 1.1
(b) software which puts the "#CIF1.1" magic comment at the top of its files,
    but outputs strings that might not be correct under CIF 1.2
(c) software with no "#CIF1.1" magic comment and incorrect CIF 1.2 string
    handling.

I would suggest that only type (c) is of concern, and that these files are
easily caught and "#CIF1.1" added to the top.

Need for remediation: as Nick has said, this simply means putting a
"#CIF1.1" string sequence at the top of every file that doesn't have
one.

Upgrade burden: I think this is where we have to tread carefully, as a
large part of the success of CIF1.2 will depend on the provision of
programs that support it. For this reason, Nick's proposal to minimise
the number of string productions is welcome as it translates into
reduced work for the programmer.  Removing use of delimiters
internally if not followed by whitespace also simplifies things in a
small way for the programmer.


-- 
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.