Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.


Greetings all,

I appreciate the on going discussions on issues related to character set,
character encoding, quoting conventions.  We at PDB are currently in the process
of working out how we will address these issues in the future so I can
convey our evolving perspective.

First let me address  Herb's  comment about the use of mmCIF syntax by PDB users.
While there has not been widespread adoption of mmCIF relative to the older PDB
format the adoption to date has been hard fought and targets some key pipelines
and developers.   As I am sure Herb will agree, we do not want to do anything
to destabilize progress to date.    Experience to date also should be a lesson
in just how difficult it is to get people to move away from working systems,
no matter what the perceived future benefits.   On the PDB side, science is now
forcing the issue with the old PDB format to the extent that there will certainly
be a change in this area in near term.  To drive future change in the direction
of CIF it is important not to undercut the progress that has been made by those
that have already adopted the current technology.

On to the issues --


Issues 1/2 -  quoting and backward compatibility issues -

Legacy -
On the specific issues under discussion there is little we can do with the
legacy of files that we maintain which will deviate from any new quoting conventions
that are adopted.    I should qualify this by describing that PDB maintains
timestamped snapshots of its archival files and these will contain the historical
quoting conventions.

Moving forward  -

PDB now regenerates all of its current released entries approximately yearly.  This
would provide an opportunity to address any outstanding issues with quoting moving
forward.  We have already taken steps to avoid the quoting ambiguity in atom names
by requiring all atom names containing a single quote (prime) to be quoted.  We have
further removed the double quote character from any preferred atom name in our current
chemical component dictionary.   We would now quote any non-whitespace containing
string with an embedded quote with quotes of a different type (e.g. "abcd'e" or 'abcd"e').
Strings containing mixed quotes are rare and we can in future wrap these with semi-colons.

This leaves the cases that Nick has identified containing unquoted strings with embedded
punctuation such as comma, semi-colon and colon.  We can in future quote such strings.

I believe that this address all of the issues that we and Nick have identified with the
PDB data set regarding quoting conventions.

Summary -  Something has to be done to provide support for existing entries but in such a
way as not to ham string future evolution.    I am in favor of tightening the quoting
conventions in existing entries where possible and PDB can/will produce such files in future.

I suggest that a clear identification of syntax (including character encoding) + dictionary
version be built into any new system and that an unqualified default fall back to the
current convention.   In any case in solving this problem lets not create the same
problem for the next group to have to address this issue.

I do not think this falls precisely into the alternatives from the previous discussion.
We are not in a position to abandon our existing archival entries so we will continue
to support old conventions.   We are prepared to comply with the proposed future conventions
as I currently understand these.

Issue 3:  UTF-8  / supporting extended character sets and alternative encodings -

This is a very difficult issue for PDB at present as we have to support a format that is
treated by most as upper case ascii.    For the most part the issue of an extended character
set impact a small number of data items in PDB.  The ones that are impacted are of quite
important.    We are now in the camp of capturing the extending character information, but
in terms of mappings codes in ascii.    We will identify in our data dictionary those
items where we expect to encounter extended character mapping.

This is very much a transitional position, but we are very sensitive to the impact of
changing character encoding on existing applications.   This is further complicated for
us in that we are already effectively supporting UTF-8 for our PDBML files as well as
legacy PDB which is forever bound to ascii.

Many of the drivers that Brian and Simon have mentioned for handling data that is
native UTF-8 are issues for us.    Again, moving forward I would suggest that it be
possible to specify the character encoding explicitly rather than lock in a specific
convention for CIF x.x.


Summary -   I cannot see not adopting UTF-8 moving forward.   Anything that PDB will create
moving forward I believe will be "readable" by a utf-8 reader.   We may not be able to fully
exploit UTF-8 encoding in the near term for export files.   In terms of reading files I
expect that we will live in a world of mixed encodings and it would be good to make the
encoding explicit in next generation files.


Regards,

John
















Herbert J. Bernstein wrote:
> Dear Colleagues,
> 
>    There are multiple communities out there.
> 
>    There are, of course, the internal IUCr and PDB uses of CIF.
> 
>    For small molecule work, there are well-established CIF-based workflows 
> using DDL1-based CIFS.  There is a very large base of exsting files, and
> write logic.   There would be great value in the DDLm-based validation
> for this community, but if we make it difficult and confusing, DDLm
> will simply be ignored.  If we make writers for CIFs that cannot be read
> by the base of exsiting software, those writers will not be used.
> 
>    For macromolecular structural work, there is very little adoption of 
> DDL2-based CIF outside of the software base controlled by the PDB, so
> there is it feasible to try to make signficant changes, but for the
> rapidly growing imgCIF DDL2-based software, changes that have not been
> discussed with the detector community and vetted by them will, as with the
> small molecule community, simply be ignored.
> 
>    In addition, there are many idosyncratic uses of CIF, e.g. in the 
> "harvest" mode for macromolecular experiments that have their own 
> softare base and user community, and are simply not going to be told
> what to do.
> 
>    There are sundry and assorted CIF software packages standing alone and
> embedded in applications, for which the delvelopers and maintainers have
> no particular need of desire to change anything, and who are not
> conerned about data validation, and who are going to continue to read
> what they currently read and, more importantly, write what they currently
> write, no matter what COMCIFS says.
> 
>    There is much more to this, but the bottom line is, if we want to make
> changes and improvements, we need to talk to and involved a fairly broad
> sampling of a variety of communities, or we will meet very stiff 
> resistance.
> 
>    Because of this, I think it would be best to work on clean well-defined
> proposals with solid upwards and downwards migration plans, and then to
> have workshops to get community feedback.
> 
>    Regards,
>      Herert
> 
> =====================================================
>   Herbert J. Bernstein, Professor of Computer Science
>     Dowling College, Kramer Science Center, KSC 121
>          Idle Hour Blvd, Oakdale, NY, 11769
> 
>                   +1-631-244-3035
>                   yaya@dowling.edu
> =====================================================
> 
> On Sat, 10 Oct 2009, SIMON WESTRIP wrote:
> 
>> Dear all
> 
> Before this thread diverges into a deeper discussion of UTF-8 and unicode, 
> can I ask for clarification of a few points.
> 
> As an observer, seems to me that this thread has been 'stumbling' because 
> of some fundamental issues with respect to
> 
> adopting DDLm. I now find myself questioning my understanding of the 
> situation. At the risk of sounding as if I'm just repeating some of the 
> recent comments from Brian and James (or indeed that I shouldnt have been 
> asked to listen in at all), I've been observing these discussions under 
> the assumption that:
> 
> 1) it was already accepted that CIF1.2 is going to have to be treated as a 
> distinct format, requiring new CIF1.2-enabled software. The new software 
> should be backwards-compatible - able to read/write CIF1.1 if required. 
> This is not an uncommon scenario (e.g. in the world of word-processing 
> software - the latest format will not be readable by programs written for 
> the previous formats, but programs supporting the latest format will be 
> able to convert between the old and new). This is an acceptable annoyance 
> if the new format markedly enhances the old format?
> 
> 2) The general aim is to make the transition between the old and new as 
> painless as possible, but not at the expense of realizing the benefits of 
> the new?
> 
> 3) The sooner the specs for the new are made available the better - so 
> that developers can at least keep them in mind when they work on their 
> projects - whether it be a fully fledged CIF reader/writer, or a program 
> that just accepts CIF as a data source.
> 
> Forgive me if I'm off the mark with my assumptions, or if I'm going over 
> ground you've already covered (being a newcomer, I'm afraid I may not be 
> up to speed on all this, though as someone who may well be involved in 
> implementing whatever is decided upon, even my ignorance may be of use to 
> you when it comes to considerations of how the changes may be handled by 
> interested parties).
> 
> Cheers
> 
> Simon
> 
> Simon P. Westrip
> 
> 
> 
> 
> 
> 
> ________________________________
> From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
> To: Group finalising DDLm and associated dictionaries <ddlm-group@iucr.org>
> Sent: Saturday, 10 October, 2009 18:02:35
> Subject: Re: [ddlm-group] THREAD 3: The alphabet of non-delimited strings.
> 
> Yes, most modern Fortrans cannot tell the difference between UTF-8 and 
> ascii.
> 
> =====================================================
>    Herbert J. Bernstein, Professor of Computer Science
>      Dowling College, Kramer Science Center, KSC 121
>           Idle Hour Blvd, Oakdale, NY, 11769
> 
>                    +1-631-244-3035
>                    yaya@dowling.edu
> =====================================================
> 
> On Sat, 10 Oct 2009, Brian McMahon wrote:
> 
>> Dear Herbert
>>
>> Thanks for the clarification. I've now read
>>  http://en.wikipedia.org/wiki/UTF-8
>> :-)
>>
>> It seems to me that the STAR spec still needs to be modified to
>> state explicitly that its allowed character set is Unicode as
>> expressed in UTF-8 encoding.
>>
>> I note also from the above Wikipaedia entry that there is some
>> latitude in practices for handling invalid byte sequences (and to some
>> extent invalid code points). I think we should consider whether the
>> full STAR/CIF1.2 specs should formalise exception handling procedures
>> in such cases.
>>
>> Regards
>> Brian
>>
>> PS Just for my own information, does the statement
>>  > For the point of view of any
>>  > C-program intended to work with the 256-chacacter ISO characters sets,
>>  > a UTF-8 string handles just the same as an ISO string.
>> hold equally well for modern Fortran applications?
>>
>> On Sat, Oct 10, 2009 at 12:01:05PM -0400, Herbert J. Bernstein wrote:
>>> Dear Colleagues,
>>>
>>>    There is a misundertsanding about UTF-8.  For the point of view of any
>>> C-program intended to work with the 256-chacacter ISO characters sets,
>>> a UTF-8 string handles just the same as an ISO string.  The major
>>> differences are that the bottom 128 characters are the US national variant
>>> we call ASCII, and the second 128 characters that in the past would have
>>> had the accented and special characters needs to handle the western
>>> European languages in an ASCII environment have been replaced with the
>>> variable length encodings for a 31 bit character set.  That is what is
>>> nice about UTF8 -- it is actually using what should be printable
>>> characters to do its encoding, avoiding anything that looks like
>>> binary data.
>>>
>>>    UTF-16/UCS-2 is different.  There you have a lot that looks like binary
>>> when working in an ascii world, and you need special libraries (for wide
>>> characters) to deal with them, unless you are working in java or with a
>>> browser, where that is the native encoding.
>>>
>>>    We are in the midst of a painful, worldwide transition in which we have
>>> a mixture of:
>>>
>>>    1.  The code code-page based character encodings based on the multiple
>>> ISO national variants.  ASCII is just the US national variant.
>>>    2.  The UTF-16/UCS-2 version of unicode heavily adopted by many hardware
>>> vendors and used as the native encoding in many operating systems and all
>>> browsers
>>>    3.  The UTF-8 version of unicode, extensively adopted in Linux-based
>>> applications and slowly being accepted in almost all operating systems.
>>>
>>> My guess is that by 10 years from now, UTF-8 will have been fairly
>>> completely adopted except for some legacy java and browser UCS-2
>>> stuff.
>>>
>>>    My suggestion would be to try to support ascii, UCS-2 and UTF-8 for the
>>> moment and work towards joining the march towards UTF-8.
>>>
>>>    Regards,
>>>      Herbert
>>>
>>> =====================================================
>>>   Herbert J. Bernstein, Professor of Computer Science
>>>     Dowling College, Kramer Science Center, KSC 121
>>>          Idle Hour Blvd, Oakdale, NY, 11769
>>>
>>>                   +1-631-244-3035
>>>                  yaya@dowling.edu
>>> =====================================================
>>>
>>> On Sat, 10 Oct 2009, Brian McMahon wrote:
>>>
>>>> Regarding the adoption of the Unicode character set, I agree that
>>>> this would make it easier to accommodate accented and non-Latin
>>>> characters and symbols, and I see no reason to oppose implementing
>>>> it as a UTF-8 encoding, and so I vote 3.2.
>>>>
>>>> (It's not a panacea, especially for maths, where new symbols can
>>>> always be invented, and one must be able to specify a two-dimensional
>>>> layout as well as just the glyphs, so we shall still need other
>>>> approaches for various types of "rich" text.)
>>>>
>>>> However, this is a binary encoding, is it not, and so the underlying
>>>> STAR specification must be modified to accommodate this. (I'm afraid
>>>> I haven't got Nick's draft paper for the revised STAR specification
>>>> to hand, so I apologise if that's already been addrressed.)
>>>>
>>>> Does it raise issues of endian-ness? If we are introducing binary
>>>> encodings, are there any reasons to restrict the character set
>>>> encoding to UTF-8 or should one also allow UTF-16 etc. (i) in STAR
>>>> and (ii) in CIF? And, ultimately, is there a prospect of extending
>>>> the STAR spec in a way that properly accommodates at least the CBF
>>>> implementation, and possibly other binary data incorporation?
>>>>
>>>> I am happy in this case that handling by "old" CIF software can
>>>> be done by adopting a protocol that allows UTF-8 Unicode characters
>>>> to be represented by ASCII encodings such as \u27. (I don't think
>>>> that we need specify a protocol at this point, just be sure that
>>>> one can be defined if needed.)
>>>>
>>>> I again draw attention to the amusing fact that with an ASCII
>>>> Unicode encoding, "O\u27Neill" is a valid data value under the
>>>> current proposals, whereas the UTF-8 equivalent would not be,
>>>> because the UTF-8 encoding of ' is just ' !
>>>>
>>>> Brian
>>>> _______________________________________________
>>>> ddlm-group mailing list
>>>> ddlm-group@iucr.org
>>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>>>
>>> _______________________________________________
>>> ddlm-group mailing list
>>> ddlm-group@iucr.org
>>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group

-- 
******************************************************************
   John Westbrook, Ph.D.
   Rutgers, The State University of New Jersey
   Department of Chemistry and Chemical Biology
   610 Taylor Road
   Piscataway, NJ 08854-8087
   e-mail: jwest@rcsb.rutgers.edu
   Ph:  (732) 445-4290  Fax: (732) 445-4320
******************************************************************
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.