[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Data-name character restrictions - one last time

To: Group finalising DDLm and associated dictionaries <[email protected]>
Subject: Re: [ddlm-group] Data-name character restrictions - one last time
From: James Hester <[email protected]>
Date: Thu, 10 Dec 2009 13:58:17 +1100
In-Reply-To: <[email protected]>
References: <[email protected]><[email protected]> <[email protected]>

Dear All,

I've just had a discussion with Nick around the dataname characterset
issue, and have elicited a strong argument from him against
liberalising the character set.� First, let me state the case for
*not* restricting the character set in CIF2 datanames:

1. Characters in a dataname such as forward slash, square brackets,
hyphen and most of UTF8 do not break the other CIF2 syntax that we
have agreed on, because:

(a) datanames cannot appear inside lists

(b) Lexers will not misinterpret square brackets etc. inside
datanames, as a dataname must be separated from any succeeding token
by whitespace. All characters up to that whitespace must therefore
either belong to the dataname, or be syntax violations.

2. If we allow datanames with these 'extra' characters, we can have
all CIF1 names in a CIF2 file, improving backwards compatibility

The argument *against* including these characters involves looking at
the whole picture.� This whole CIF2 effort is motivated by the need to
add list structures and to define a simple dictionary method language
which can manipulate data items, including these lists.� So, an
argument against including these characters runs as follows:

1. Almost any liberalisation of the dataname characterset will
necessarily force additional complexity in the dictionaries, because
the datanames are used as identifiers in dREL methods.� This
complexity manifests either through extra aliases, or a new dREL
'quote' function, or category/object names that do not match the
data name.� This complexity makes it both more difficult to write
dictionaries, as one must keep track of the 'true' name, and more
difficult for a human reader to read and check dictionaries, as it is
not necessarily easy to find out which 'real' dataname a dREL method
is referring to in its derivation.� The dREL 'quote' function idea
makes writing dictionary methods potentially more complex, when the
intention is that these methods should be maximally accessible to
non-programmers for both reading and, in particular, construction.

2. While it would be possible to simply promulgate a general principle
that DDLm dictionaries cannot define datanames using characters
outside a restricted characterset, a more robust approach is to reduce
the possibility of their appearance by making them syntactically
forbidden.

Which of these lines of argument you favour comes down to how highly
you value simplicity in the overall design compared to how much you
value compatibility at the syntactical level with CIF1, given that
workarounds for compatibility are possible at the dictionary level.
As we are prepared for CIF2 to be a disruptive but clean
change, I would favour simplicity and therefore keeping a restricted
character set.

On Thu, Dec 10, 2009 at 6:29 AM, David Brown <[email protected]> wrote:

I would suggest that we add CIF2 data namea as aliases in the DDL1 and DDL2 dictionaries for those few items where the names differ.� This would mean that any DDL1 dictionary would recognize all the CIF2 data names that corresponded to items appearing in the DDL1 dictionary.� Of course files with arrays could not be read this way, and adding arrays to DDL1 dictionaries would violate the DDL1 rules and would essentially convert the DDL1 dictionaries into non-conforming DDL2 dictionaries, thus defeating the goal of being able to read CIF2 data files with DDL1 software.� Adding DDLm aliases to the DDL1 dictionaries would be easy since we would need to add less than a dozen aliases (but we would have to know what the DDLm data name is, or is going to be).� Of course there is also the _ versus . problem with the data names.� Adding '.' data names as aliases in the DDL1 dictionaries would get around that problem.� It would be straightforward to add these, but it would be a larger job since all the data names would need an added alias.� Even this will not help with legacy software using hard coded data names, but this might just encourage people to write a front-end that uses the dictionaries for input. There is still the problem of the data names in DDL2 dictionaries that include [].� Adding an alias name that does not use these characters may allow DDL2 programs to read CIF2 data files, but CIF data files containing these [] data names would require a CIF1 parser (and lexer?), so we just need to recognize this fact and live with it.� In any case a CIF1 lexer should always be an optional front-end to a DDLm dictionary if it is to read in legacy data files as required by the specifications. It would not be straightforward for a DDLm program to output a CIF1 data file, but is this really necessary?� Once one has started to use the features of CIF2 one would probably wish to output items that do not even exist in CIF1.� If one needed a fully compliant CIF1 data file in order to make use of legacy software, it might be better to write a CIF2 file, but restrict the items to those that exist in the DDL1 (or 2) dictionaries (this information can be found from the aliases in the DDLm dictionaries).� The DDLm data names that would appear in this data file are either identical to the DDL1 data names or would appear as aliases to the DDL1 dictionary as described above. As for recognizing CIF2 data files, isn't that what the magic code is for?� If someone chooses not to use the magic code their CIF2 data file is non-conforming and they can expect difficulties.� Most of the CIFs in DDL1 are initially prepared by computer and only the text being added by hand.� Once the computers start generating CIF2 data files, they will be programmed to add the magic code. David Herbert J. Bernstein wrote: Personally, I would greatly prefer to allow all data names that do not create a major lexer/parser conflict to appear in a data CIF and only apply the strong restrictions to data names that appear in CIF2 dictionaries as defined data names (not as aliases). -- Herbert At 2:40 PM +0000 12/9/09, Brian McMahon wrote: I have one remaining niggle that I'd like to revisit before we put this finally to bed. As has been mentioned a couple of times recently, restricting the data-name character set does invalidate syntactically many existing CIF 1 files (e.g. _refine_ls_shift/esd_max ). We have discussed strategies for handling this, and I think these are workable strategies, but will involve investment and hence expense in workflow management in CIF archives. I understand the rationale behind this restriction is to simplify future processing of data names in areas such as dREL applications. The question really is whether we're choosing the right trade-off in making things cleaner at that end of the processing chain. I would suppose that a dREL or other application could ingest a data name with dangerous characters, convert it internally into a "safe" identifier that's used for all processing, and then restore the original form upon output; but writing that intermediate layer of processing is of course expensive (especially if there aren't readily available libraries that will do this transparently). I suspect that some of the original proposed syntactic changes also had the effect (whether by design or collaterally) of simplifying i/o, data structure management, symbol table processing etc., but those may have suffered in the subsequent revision exercise we've just been practising. Given the consensus we are now approaching, would the code builders now be prepared to incur the addition expense of handling "dangerous" data names? I really don't want to spark off a long discussion on this - if a quick round of response shows that there's no appetite to allow the additional punctuation characters in data names, I'll accept that gracefully. *** One last comment while I have the floor, though it is related in part to the above question. A concern raised in the editorial office was that there would be circumstances where users didn't know if they were dealing with a CIF 1 or 2 ("users" meaning authors, perhaps resorting to the vi editor - and we're imagining most of them are dealing with small-molecule/inorganic CIFs). My supposition is that the IUCr editorial offices would only want to use CIF2 seriously in association with DDLm dictionaries, and that we would expect the revised core dictionaries to use the dot component in data names to signal this further evolution. So even a superficial glimpse of the middle of a CIF would make it clear whether it was CIF1 or CIF2. Does that fit in with how others see this progressing? Cheers Brian _______________________________________________ ddlm-group mailing list [email protected] http://scripts.iucr.org/mailman/listinfo/ddlm-group
_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

--
T +61 (02) 9717 9907
F +61 (02) 9717 3145
M +61 (04) 0249 4148

_______________________________________________
ddlm-group mailing list
[email protected]
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

Follow-Ups:

Re: [ddlm-group] Data-name character restrictions - one last time (Herbert J. Bernstein)

References:

[ddlm-group] Data-name character restrictions - one last time (Brian McMahon)

Re: [ddlm-group] Data-name character restrictions - one last time (David Brown)

Prev by Date: Re: [ddlm-group] Revised version of syntax change summary document

Next by Date: Re: [ddlm-group] Data-name character restrictions - one last time

Prev by thread: Re: [ddlm-group] Data-name character restrictions - one last time

Next by thread: Re: [ddlm-group] Data-name character restrictions - one last time

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Data-name character restrictions - one last time