[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
Date: Wed, 30 Sep 2009 09:32:52 -0400 (EDT)
In-Reply-To: <279aad2a0909300514s2608eb59u851ed658352164b4@mail.gmail.com>
References: <C6E123F5.11EB6%nick@csse.uwa.edu.au><20090924063136.D23301@epsilon.pair.com><279aad2a0909300514s2608eb59u851ed658352164b4@mail.gmail.com>

Dear Colleagues,

   Just to be clear, I do think the restriction on character set of 
non-delimited strings is unwise -- of all the changes proposed, I believe 
that it is the one that invalidates the largest number of existing CIFS, 
and serves no useful purpose that could not be achieved by the simple 
exclusion of specific cases, as we have already done.

   To be specific, I would recommend continuing to accept any string of 
printable characters that does not contain embedded whitespace and that 
does not match one of the specifically enumerated lexemes of the language: 
global_ loop_ data_... save_... stop_..., one of the quoting constructs, 
or one of the bracketed constructs.

   I would also consider all the printable UTF-8 characters as valid.

   Regards,
     Herbert

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 30 Sep 2009, James Hester wrote:

> I am currently connected to the world via a slow dialup connection, so I will
> tend to fewer, more wordy communications.
>
> There are two issues here which we can treat separately. The
> first is the restriction of the character set for non-delimited
> strings, to which I have seen no objections so far.  Can we therefore
> take the expression given by Nick as agreed?  For reference it was:
>
> non-DS = [A-Za-z0-9./-()+?][A-Za-z0-9_./-()+?]*
>
> There remains then the treatment of whitespace.  Following Nick's
> visit, I have had some time to ponder this topic and have shifted my
> position somewhat. I am not overly swayed by the assertion that
> computer language parsers never use whitespace as a delimiter, so
> neither should we. A CIF file is different from a computer language
> source file.  By and large, computer language source files are
> created, edited and maintained by humans, who will generally do
> whatever they can to improve readability, including using whitespace
> to delimit words when appropriate.  There is no reason beyond
> enforcing readability to use whitespace as a delimiter (NB Python's
> use of indentation as semantically meaningful). CIF files, on the
> other hand, are almost always computer-generated and computer-read,
> and so unless whitespace is required by the standard it will tend to
> disappear.  This erodes CIF readability, one of the pleasant features
> of CIF when compared with other data formats.  Therefore, while I
> sympathise with the urge to simplify the BNF description, I believe
> the complexity introduced by whitespace treatment is the price we pay
> for enforcing readability. So I would prefer that all items in a CIF
> file are separated by whitespace, where I view a bracket expression as
> a single item.
>
> That said, we need to disallow delimiters inside delimited strings,
> even if not followed by whitespace. This would simplify parsing,
> editing in delimiter-aware editors, and importation of CIF loops into
> other software (e.g. spreadsheet software often understands double and
> single quote delimited strings, and whitespace as a delimiter). It
> also simplifies treatment of delimited strings inside bracket
> structures, where one might expect that a comma or close bracket could
> follow immediately after a string closing delimiter.
>
> A concern for backwards compatibility has been expressed.  There are three
> different types of compatibility issues that I can see:
>
> 1. Ability of legacy software to read new-style (CIF 1.2) CIF files
> 2. Ability of legacy software to write new-style CIF files
> 3. Need for remediation of old-style CIFs.
> 4. Upgrade burden on software writers
>
> Regarding reading: as soon as a triple quote or bracket construct
> appears in a CIF file, legacy software will not parse the CIF
> correctly.  I would suggest that it is therefore pointless to worry
> about incompatibilities in the details of string-handling also
> breaking the parse.  Quite the opposite, if we are going to break
> compatibility, we might as well do it all at once so that the
> programmer only has to edit their code once.
>
> Regarding writing: I believe that a policy decision has been made not
> to redefine existing datanames to use bracket constructs.  Therefore,
> current CIF software for outputting CIFs falls into three categories:
> (a) software with conservative string handling - all non-numeric data
>    delimited by quotes, even if not necessary under CIF 1.1
> (b) software which puts the "#CIF1.1" magic comment at the top of its files,
>    but outputs strings that might not be correct under CIF 1.2
> (c) software with no "#CIF1.1" magic comment and incorrect CIF 1.2 string
>    handling.
>
> I would suggest that only type (c) is of concern, and that these files are
> easily caught and "#CIF1.1" added to the top.
>
> Need for remediation: as Nick has said, this simply means putting a
> "#CIF1.1" string sequence at the top of every file that doesn't have
> one.
>
> Upgrade burden: I think this is where we have to tread carefully, as a
> large part of the success of CIF1.2 will depend on the provision of
> programs that support it. For this reason, Nick's proposal to minimise
> the number of string productions is welcome as it translates into
> reduced work for the programmer.  Removing use of delimiters
> internally if not followed by whitespace also simplifies things in a
> small way for the programmer.
>
>
> -- 
> T +61 (02) 9717 9907
> F +61 (02) 9717 3145
> M +61 (04) 0249 4148
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

References:

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Nick Spadaccini)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (Herbert J. Bernstein)

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings. (James Hester)

Prev by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by Date: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Prev by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Next by thread: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.