[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] Data-name character restrictions - one last time

To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
Subject: Re: [ddlm-group] Data-name character restrictions - one last time
From: David Brown <idbrown@mcmaster.ca>
Date: Thu, 10 Dec 2009 13:55:27 -0500
In-Reply-To: <4B21311E.1080900@niehs.nih.gov>
References: <20091209144035.GB29341@emerald.iucr.org> <a06240801c74578ec8b59@192.168.2.104> <4B1FFAAD.4000408@mcmaster.ca> <279aad2a0912091858g44848fb7wc3990553d3582d99@mail.gmail.com> <a06240803c7461b28c06a@192.168.2.104> <279aad2a0912092113s6dabd9ddo176e96fc71a752b0@mail.gmail.com> <a06240802c7463c929515@[192.168.2.104]><4B21311E.1080900@niehs.nih.gov>

Title:

I agree with Herbert. If the CIF2 lexers can handle the -/[ and ] characters in a nondelimited data name string, then there is no reason for excluding them from the CIF2 definition. In CIF2 all data names are converted to the standard DDLm data names before use using the aliases. We also note that we will avoid using these characters in any new data names and dREL will only recognize the standard DDLm data names. In this way CIF2 applications make no distinction between CIF1 and CIF2 data files and can deal with hybrids which could otherwise cause problems, e.g., data names with -/[ or ] occuring in a CIF2 file along with arrays and other CIF2 features. The user will be aware of the extensions, but will not be aware of any distinction between CIF2 and CIF1 data files. It is a beautiful, clean solution that meets the requirement of backward compatibilty. David Joe Krahn wrote: FWIW, here is my view. Whatever the formal standard defines, it is almost certain that many CIF2 implementations will allow CIF1 data names for all of the practical reasons defined by Herbert. Other people prefer to avoid a "messy" implementation, and make a strict conversion to CIF2. As Brian points out, there is quite a bit of diversity in CIF and pseudo-CIF implementations. Sometimes (often?) it is due to not reading the CIF specifications; that is how PDB atom name element-alignment rules got mutilated. Other times, it is for practical reasons to get work done. I think a good analogy is Fortran77. Fortran language development stalled after that due to conflicting views on modernization versus maintaining traditional Fortran. Meanwhile, people needed to get work done, and compiler developers added many non-standard extensions. Most Fortran code used these extensions, and many became almost universal. Everyone still called it Fortran. So, maybe it really does not matter if "standard" CIF2 allows CIF1 data names. Everyone that wants to will do it anyhow, and others will run them through dictionary aliases before using the non-standard data files. Fortran developed slowly due to lack of agreement, leading to many annoyances that were being solved quickly by newer programming languages. That is why Fortran is almost dead. So, the other lesson is that CIF needs to avoid annoying the user base, or they will just switch to XML, and this work will all be for nothing. If the consensus is to keep name restrictions to promote proper DDLm-compatible names, it might be worth writing a formal CIF2 extension. This would keep the non-standard naming implementations compatible, while still making it clear that they are not proper CIF2. I always write code to compile flags for strict standards compliance. It often avoids problems, but can also prevent me from using some useful language extensions. Other people are happy to mandate Gnu compilers, and benefit from those features -- Gnu libc uses them a lot. Joe Herbert J. Bernstein wrote: Dear James, With all due respect, I believe are completely, unconditionally wrong about this point. The aliases in the CIF2 dictionaries will allow us to continue to accept both CIF1 DDL1 and CIF 1 DDL2 data files and process them against DDLm dictionaries, using methods and doing a better job at validation. From a user point of view, we are processing all these CIFs with DDLm -- who gains anything by saying that something is illegal in his CIF? Perhaps you are concerned that a user will mix DDL1 and DDL2 tags in a single core CIF file. Why is that a problem? Whether you do it in one pass or 2 passes, the translation of the CIF 1 tags into the aliases CIF2 tags and a cleanup-up of any funny strings should give a valid CIF2 file for code later down the pipeline. I cannot see _any_ user external in the specification of a data file where this distinction matters. This is very different from the differences between Fortran and C where the computational model is very different making, for example, the output of f2c almost unreadably complex. Hopefully we are not talking about a similar difference between CIF1 and CIF2. The better comparison if not Fortran versus C, but C versus Java. C includes an explicit preprocessor. Java does not. This creates a nuisance for Java users, and multiple competing approaches to handling macros -- e.g. using m4 versus using the C preprocessor. Both we and the users are better off clearly specifying the operation of the aliasing mechanism as part of CIF 2. The alias mechanism is not something external to CIF2. It is an essential part of CIF2. I propose a very simply rule for aliases: that any string beginning with an underscore and not containing any whitespace may will be accepted by and handled by the aliasing mechanism and may appear as a tag in a CIF data file presented for processing if it is aliased to a valid CIF2 tag in the associated dictionary. This has zero impact on dREL methods, but it does require a clear agreement to provide an alias-translating front-end. Regards, Herbert At 4:13 PM +1100 12/10/09, James Hester wrote: Dear Herbert and others: See comments inserted below. On Thu, Dec 10, 2009 at 3:01 PM, Herbert J. Bernstein <<mailto:yaya@bernstein-plus-sons.com>yaya@bernstein-plus-sons.com> wrote: Dear James and Nick, I find this position very difficult to understand. Please look at the problem from the external point of view of user of CIF, rather than the point of view of a designer of lexers and parsers, and I think you will see that _externally_ we have to support tag names with at least the square brackets. We are committed to being able to process existing CIF 1 files using DDLm dictionaries. It does not matter how we do this -- with one pass in an integrated parser, or two passes through a dictionary driver alias applier, or more passes through some other construct -- from the _user's_ point of view, there is a specification of something called DDLm that is going to accept files that conform to the existing DDL1 and DDL2 dictionaries and that also accepts files that conform to the DDLm specification. We accomplish nothing useful by tell such a user that yes, we will accept CIFs with tags that contain square brackets from the CIF1 core dictionary, but no, such tags are not legal in CIF2. Be the user. He will think we are insane. You go on to say: "We are going to accept the old CIF1 tags, no matter whether a particular parser accepts them, they are a legal part of the overall CIF2 system -- deprecated perhaps -- but legal. " If I translate from Fortran to C, does that make Fortran part of my C environment? I think not. In the same way, just because a route exists to use CIF1 data files in a CIF2 environment, that does not mean that CIF1 'is part of' CIF2. Now, let's think about the 'users' more specifically. There are the CIF writing contingent (single crystal software authors, for example) who will be told: you can continue to use CIF1 datanames as defined in DDL1/2 dictionaries and from an external point of view these will be processed as before (why is this insane?). Alternatively, you can use these CIF2 datanames as replacements, and the files will still be processed in the same way (this is also not insane). We do not have to tell the 'users' not to use square brackets, we simply tell them about replacement datanames. You seem to be arguing that having data name aliases is in itself crazy? By the same token, we require of CIF readers nothing more than an acceptance of the aliasing protocol, which has been around for a while. The user will have no trouble if we say that we are moving away from tags with such characters. The user would not even have a problem if we said that we would not be able to do as good a job validating files with such tags as we could for ones with the new tags, but if the alias mechanism works properly, even that would be untrue. From the user point of view, nothing but confusion is results if we tell him that tags in existing official CIF1 dictionaries with square brackets are illegal -- they won't be. Well, I would be confused too if somebody told me that something was illegal in CIF1 when it wasn't. CIF2 is not part of CIF1, but it does maintain backward compatibility with CIF1. There is a difference. No, the only people who need to hear about this character restriction and understand it are CIF2 DDLm dictionary writers, who need to be told: "If you are creating a new tag, would would be well advised not to include square brackets in the name because CIF2 requires somewhat complex alias mechanisms to handle such tags." The entire argument you are making sounds like an argument against fully supporting the alias mechanism. That simply will not fly. The aliases of the old CIF1 tags have to be fully documented and supported in CIF2. You can say that the valid CIF 2 tags with the restricted character set are the only tags that will appear unquoted in a CIF2 dictionary, and that they are the only tags that will appear in a dREL method, but that is _not_ something the users will care about -- only the dictionary writers. To be blunt the entire argument about "simplicity" or being "maximally disruptive" is, in my opinion, misguided and uncoupled to the more important objectives to providing something useful and comprehensive to our users. Issues such as simplicity and zero-based design (which is as much of maximally disruptive design as I can swallow) are important _only_ if reasonable user externals are achieved. In this case these peripheral issues are getting in the way and must be sacrificed to the real needs of our users to get work done. Regards, Herbert At 1:58 PM +1100 12/10/09, James Hester wrote: Dear All, I've just had a discussion with Nick around the dataname characterset issue, and have elicited a strong argument from him against liberalising the character set. First, let me state the case for *not* restricting the character set in CIF2 datanames: 1. Characters in a dataname such as forward slash, square brackets, hyphen and most of UTF8 do not break the other CIF2 syntax that we have agreed on, because: (a) datanames cannot appear inside lists (b) Lexers will not misinterpret square brackets etc. inside datanames, as a dataname must be separated from any succeeding token by whitespace. All characters up to that whitespace must therefore either belong to the dataname, or be syntax violations. 2. If we allow datanames with these 'extra' characters, we can have all CIF1 names in a CIF2 file, improving backwards compatibility The argument *against* including these characters involves looking at the whole picture. This whole CIF2 effort is motivated by the need to add list structures and to define a simple dictionary method language which can manipulate data items, including these lists. So, an >argument against including these characters runs as follows: 1. Almost any liberalisation of the dataname characterset will necessarily force additional complexity in the dictionaries, because the datanames are used as identifiers in dREL methods. This complexity manifests either through extra aliases, or a new dREL 'quote' function, or category/object names that do not match the data name. This complexity makes it both more difficult to write dictionaries, as one must keep track of the 'true' name, and more difficult for a human reader to read and check dictionaries, as it is not necessarily easy to find out which 'real' dataname a dREL method is referring to in its derivation. The dREL 'quote' function idea makes writing dictionary methods potentially more complex, when the intention is that these methods should be maximally accessible to non-programmers for both reading and, in particular, construction. 2. While it would be possible to simply promulgate a general principle that DDLm dictionaries cannot define datanames using characters outside a restricted characterset, a more robust approach is to reduce the possibility of their appearance by making them syntactically forbidden. Which of these lines of argument you favour comes down to how highly you value simplicity in the overall design compared to how much you value compatibility at the syntactical level with CIF1, given that workarounds for compatibility are possible at the dictionary level. As we are prepared for CIF2 to be a disruptive but clean change, I would favour simplicity and therefore keeping a restricted character set. On Thu, Dec 10, 2009 at 6:29 AM, David Brown ><<mailto:<mailto:idbrown@mcmaster.ca>idbrown@mcmaster.ca><mailto:idbrown@mcmaster.ca>idbrown@mcmaster.ca> wrote: I would suggest that we add CIF2 data namea as aliases in the DDL1 and DDL2 dictionaries for those few items where the names differ. This would mean that any DDL1 dictionary would recognize all the CIF2 data names that corresponded to items appearing in the DDL1 dictionary. Of course files with arrays could not be read this way, and adding arrays to DDL1 dictionaries would violate the DDL1 rules and would essentially convert the DDL1 dictionaries into non-conforming DDL2 dictionaries, thus defeating the goal of being able to read CIF2 data files with DDL1 software. Adding DDLm aliases to the DDL1 dictionaries would be easy since we would need to add less than a dozen aliases (but we would have to know what the DDLm data name is, or is going to be). Of course there is also the _ versus . problem with the data names. Adding '.' data names as aliases in the DDL1 dictionaries would get around that problem. It would be straightforward to add these, but it would be a larger job since all the data names would need an added alias. Even this will not help with legacy software using hard coded data names, but this might just encourage people to write a front-end that uses the dictionaries for input. There is still the problem of the data names in DDL2 dictionaries that include []. Adding an alias name that does not use these characters may allow DDL2 programs to read CIF2 data files, but CIF data files containing these [] data names would require a CIF1 parser (and lexer?), so we just need to recognize this fact and live with it. In any case a CIF1 lexer should always be an optional front-end to a DDLm dictionary if it is to read in legacy data files as required by the specifications. It would not be straightforward for a DDLm program to output a CIF1 data file, but is this really necessary? Once one has started to use the features of CIF2 one would probably wish to output items that do not even exist in CIF1. If one needed a fully compliant CIF1 data file in order to make use of legacy software, it might be better to write a CIF2 file, but restrict the items to those that exist in the DDL1 (or 2) dictionaries (this information can be found >from the aliases in the DDLm dictionaries). The DDLm data names >that would appear in this data file are either identical to the DDL1 data names or would appear as aliases to the DDL1 dictionary as described above. As for recognizing CIF2 data files, isn't that what the magic code is for? If someone chooses not to use the magic code their CIF2 data file is non-conforming and they can expect difficulties. Most of the CIFs in DDL1 are initially prepared by computer and only the text being added by hand. Once the computers start generating CIF2 data files, they will be programmed to add the magic code. David Herbert J. Bernstein wrote: Personally, I would greatly prefer to allow all data names that do not create a major lexer/parser conflict to appear in a data CIF and only apply the strong restrictions to data names that appear in CIF2 dictionaries as defined data names (not as aliases). -- Herbert At 2:40 PM +0000 12/9/09, Brian McMahon wrote: I have one remaining niggle that I'd like to revisit before we put this finally to bed. As has been mentioned a couple of times recently, restricting the data-name character set does invalidate syntactically many existing CIF 1 files (e.g. _refine_ls_shift/esd_max ). We have discussed strategies for handling this, and I think these are workable strategies, but will involve investment and hence expense in workflow management in CIF archives. I understand the rationale behind this restriction is to simplify future processing of data names in areas such as dREL applications. The question really is whether we're choosing the right trade-off in making things cleaner at that end of the processing chain. I would suppose that a dREL or other application could ingest a data name with dangerous characters, convert it internally into a "safe" identifier that's used for all processing, and then restore the original form upon output; but writing that intermediate layer of processing is of course expensive (especially if there aren't readily available libraries that will do this transparently). I suspect that some of the original proposed syntactic changes also had the effect (whether by design or collaterally) of simplifying i/o, data structure management, symbol table processing etc., but those may have suffered in the subsequent revision exercise we've just been practising. Given the consensus we are now approaching, would the code builders now be prepared to incur the addition expense of handling "dangerous" data names? I really don't want to spark off a long discussion on this - if a quick round of response shows that there's no appetite to allow the additional punctuation characters in data names, I'll accept that gracefully. *** One last comment while I have the floor, though it is related in part to the above question. A concern raised in the editorial office was that there would be circumstances where users didn't know if they were dealing with a CIF 1 or 2 ("users" meaning authors, perhaps resorting to the vi editor - and we're imagining most of them are dealing with small-molecule/inorganic CIFs). My supposition is that the IUCr editorial offices would only want to use CIF2 seriously in association with DDLm dictionaries, and that we would expect the revised core dictionaries to use the dot component in data names to signal this further evolution. So even a superficial glimpse of the middle of a CIF would make it clear whether it was CIF1 or CIF2. Does that fit in with how others see this progressing? Cheers Brian _______________________________________________ ddlm-group mailing list ddlm-group@iucr.org http://scripts.iucr.org/mailman/listinfo/ddlm-group

begin:vcard
fn:I.David Brown
n:Brown;I.David
org:McMaster University;Brockhouse Institute for Materials Research
adr:;;King St. W;Hamilton;Ontario;L8S 4M1;Canada
email;internet:idbrown@mcmaster.ca
title:Professor Emeritus
tel;work:+905 525 9140 x 24710
tel;fax:+905 521 2773
version:2.1
end:vcard

_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]

References:

[ddlm-group] Data-name character restrictions - one last time (Brian McMahon)

Re: [ddlm-group] Data-name character restrictions - one last time (David Brown)

Re: [ddlm-group] Data-name character restrictions - one last time (James Hester)

Re: [ddlm-group] Data-name character restrictions - one last time (James Hester)

Re: [ddlm-group] Data-name character restrictions - one last time (Herbert J. Bernstein)

Re: [ddlm-group] Data-name character restrictions - one last time (Joe Krahn)

Prev by Date: Re: [ddlm-group] Data-name character restrictions - one last time

Next by Date: Re: [ddlm-group] Data-name character restrictions - one last time

Prev by thread: Re: [ddlm-group] Data-name character restrictions - one last time

Next by thread: [ddlm-group] Revised version of syntax change summary document

Index(es):

Date

Thread

Discussion List Archives

Re: [ddlm-group] Data-name character restrictions - one last time