Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

A few notes on IUCr workflow and the impact of the encoding issue:

Submission/author services (i.e. invloving CIF upload):

I envisage that where the encoding could not be determined by e.g. BOM,
upon submission of a CIF it may be necessary to prompt the user to confirm the encoding (maybe using an interactive
tool that allows the uploaded CIF to be viewed in a variety of encodings). This is probably not unreasonable as most of
the IUCr's author services are interactive.

Processing:

Changes to subsequent processing would probably not involve much more than converting
the CIF to e.g. UTF-8, or whatever encoding is required by the processing software.

Archive:

Though changes to the CIF archive would be negligible, the way in which CIFs are retrieved from the archive may require
some changes (e.g. offering the recepient a choice of encoding if permitted by the spec), though the content that is made publically available is unlikely to contain non-ASCII characters.

So work would be required, but not nearly as much as involved in working with the new dictionaries and developing a system to
handle both CIF1 and CIF2 in the transition period.

As far as the user is concerned, there may be slight inconvenience of having to confirm the encoding of their CIF every time it is
uploaded to an IUCr service.

This perceived impact would probably hold regardless of what is decided upon: if CIF2 were to be UTF8 only, in recognition of
the variety of encodings available and user practice with standard (non-CIF) text editors, I expect the IUCr would still attempt to
accommodate non-UTF8 cifs.

So you may ask why I've bothered to support or otherwise some of the proposals discussed in this thread.
Basically, given that CIF is a 'text' format, the specification should address the issues arising from that format, so
I do not agree that text-encoding should play no part in, or be treated separately from the standard.
Equally, the standard should not mandate anything that markedly affects the treatment of CIF as 'text' (i.e. complete reliance on
CIF-only software).

I was in favour of UTF8-only as in the draft spec, but after Herbert's description of imgCIF in particular,
 I now find myself thinking we ought to be more flexible (at the very least by leaving the door open to other encodings).
To this end, I think the specification should allow a 'declaration' of the encoding, however unreliable given current practice
of using any old text editor. Furthermore, I do not think it unreasonable for the specification to define a default encoding.
Afterall, CIF is a data-exchange format and surely that requires strict definitions if it is to work as such.

Overall, I suspect there will be problems relating to encoding, but I am of the view that with good software support
and a specification that addresses the issue, they will be minimal.

Cheers

Simon


From: Herbert J. Bernstein <yaya@bernstein-plus-sons.com>
To: Group for discussing encoding and content validation schemes for CIF2 <cif2-encoding@iucr.org>
Sent: Monday, 13 September, 2010 15:47:46
Subject: Re: [Cif2-encoding] Splitting of imgCIF and other sub-topics. .. .

Dear James,

>Somewhat overwrought, don't you think?  Because we can't agree on a
>scheme for additional encodings we should chuck CIF2 syntax, DDLm and
>dREL overboard??  When the IUCr will function perfectly well with UTF8
>only? If you would like to start coding, please structure your code so
>that the decoding step may take other encodings beyond UTF8.  The rest
>is in the draft standard (you will be pleased to see the lack of
>ambiguity in that standard, it will make your task easier).

No, actually, I am being very, very restrained in my public comments.
Right now the CIF2 efforts really seems to be headed nowhere.  I
would like it to be used.  If the IUCr will "function perfectly
well with the UTF8 only" version, then let's get the IUCr workflows
converted and get this thing in use.

Please point to the current best URLs, let's see if we have
agreement on those specific documents by putting them to a
formal COMCIFS vote and then let's ask the IUCr journals
operation to give it a try.

>Why?  It is simply a CIF2 syntax parser with checksum of all the
>contents thus parsed.  It is not worth spending the time on something
>so obviously possible until we agree that we want such a system.

The concept may be obvious, the details of implementation most
certainly are not, and without a reference implementation, we
will end up with multiple incompatible interpretations, e.g.
do lines get trailing blank stripped?  does embedded whitespace
in bracketed constructs get compressed?  God and the devil are
in the details.

>What we do is make standards.  Our syntax standards are aimed
>primarily at programmers.  What these programmers want is assurance
>that their programs will produce files that can read and write files
>written or read by other compliant programs.  Suggestions are very
>polite, but don't provide any certainty as to what other programmers
>will do.  Will they accept the suggestion?  It is, after all, only a
>suggestion.  By *mandating* we do not mean "do this or we'll send the
>black helicopters around".  We simply mean that this is what compliant
>files always look like.

Sorry, but we must deal with very different programmers.  The ones
I deal with rebel at anything stronger than a polite suggestion
grounded in common interest.  You mistake the silence on the DDLm
proposal as agreement.  I would suggest asking some of the major
developers if they have even read it yet.  The burden is on us
to justify _any_ effort we will require of them.  We need to
have something finished, complete and well supported with
necessary software before most of them will even think about
learning what we have been up to.

>By being so vague about how to deal with encodings, you are simply
>building in the potential for ambiguity and misunderstanding, thereby
>creating the inconvenient nuisance you intend to remove.

No, I am simply respecting the difference between a text representation
and a binary representation.  Multiple encodings are a fact of
life when working with text.

>  > this brings us back to being unreasonably rigid and fussy and certain to be
>>  ignored.  I cannot fgure out what "reserve the first line of a CIF2 file"
>>  means in practice, and "non UTF8 encodings ... be considered by COMCIFS" has
>>  no practical meaning for what a poor user or software developer in, say, a
>>  code-page or UCS2 environment is supposed to do now.
>
>Please enlarge on this important use case.  Are you suggesting that
>there are systems out there that enforce UCS2 for all text files or
>use a code-page that imposes an encoding on all text files in the
>system?  Your imgCIF use case (embedded in an XML file) was very
>helpful in resolving one issue, so perhaps you could describe these
>code-page systems and how they manage file encodings.


Here is a simple example use case:  Assume we have specified UTF-8 with no
BOM.  Assume a user on a system with an editor that writes a BOM on
output of all UTF-8 files edits a CIF2 file.  What is he supposed to
do now (not in theory, but with the tools be has available on
normal OS's plus what we are providing), especially because he
probably has no indication of the change.

Here is another simple example use case:  We have a user working with
an EUC-CN code page based editor with a UTF-8 based CIF.  What should
be do to edit that CIF and return it to the IUCr?

We are dealing with both software developers and end-users.  We need
to consider both.

  Regards,
    Herbert



At 11:47 PM +1000 9/13/10, James Hester wrote:
>See comments below.
>
>On Mon, Sep 13, 2010 at 10:52 PM, Herbert J. Bernstein
><yaya@bernstein-plus-sons.com> wrote:
>>  I would suggest actually writing the utility you have in mind.
>
>Why?  It is simply a CIF2 syntax parser with checksum of all the
>contents thus parsed.  It is not worth spending the time on something
>so obviously possible until we agree that we want such a system.
>
>>  In practice, inasmuch as a CIF file looks like a text file, people
>>  are very likely to just pick one up in any convenient text editor
>>  change what they want to change and write an unidentified pseudo-cif
>>  file back out.  Anything else needs to be provided to them in
>>  a complete, platform portable, well-documented package they can
>>  use easily in place of an editor that they use all the time for
>>  everything else.
>
>Note that we are not suggesting replacing editors, far from it, if we
>could do that we wouldn't have a problem in the first place.
>
>>  Please be practical -- CIF is a working tool, embedded in the IUCr
>>  journal process flows, in many crystallographic applications, in
>>  the PDB workflows, in Dectris detector software, etc.,. etc.
>>
>>  The more disruptive you make the transition from CIF1 to CIF2, the
>>  more software and documentation you need to create to allow people
>>  to make the transition actually happen.  We are essentially in
>>  the same place we were in Osaka.  How do we break out of this
>>  loop and move forward?
>
>We are a lot further forward than Osaka.  We have a *complete* syntax
>specification on the table, which has received zero objections outside
>of this group.  No further DDLm problems have been identified. The
>only issue left unresolved is that not enough encodings are allowed,
>although the one encoding that is allowed is actually sufficient for
>all of the useful work that the IUCr expect to do.  We could take what
>we have to Madrid, with a single caveat that a system for dealing with
>non UTF8 encodings is under consideration, and (if the response on the
>mailing lists is any indication) everybody would be happy outside of
>this list.  As for demonstrations, Nick and Syd have been
>demonstrating this system for over a decade (with cosmetic differences
>in syntax).
>
>>  We need a realistic plan to get our job done and have a complete
>>  specification with the necessary supporting software for CIF2 in place
>>  and ready to demonstrate for Madrid, or I would suggest we
>>  accept the failure of this effort, and start over.
>
>Somewhat overwrought, don't you think?  Because we can't agree on a
>scheme for additional encodings we should chuck CIF2 syntax, DDLm and
>dREL overboard??  When the IUCr will function perfectly well with UTF8
>only? If you would like to start coding, please structure your code so
>that the decoding step may take other encodings beyond UTF8.  The rest
>is in the draft standard (you will be pleased to see the lack of
>ambiguity in that standard, it will make your task easier).
>
>>    -- Herbert
>>
>>
>>
>>  At 10:32 PM +1000 9/13/10, James Hester wrote:
>>>The original concept was to edit the non UTF8 files in the text editor
>>>of choice, then run a simple checksumming application (that
>>>understands CIF2 syntax) to update the checksum.  This application
>>>would also pick out sections of text that would be displayed
>>>incorrectly in the wrong encoding, and ask the user to confirm that
>>>the text was displayed correctly.  Such an application could be made
>  >>freely available by the IUCr.
>>>
>>>On Mon, Sep 13, 2010 at 8:22 PM, SIMON WESTRIP
>>><simonwestrip@btinternet.com> wrote:
>>>>  I questioned:
>>>>
>>>>  "For example, if mandatory, does that mean it becomes
>>>>impossible to create a
>>>>  non-UTF8 CIF without using
>>>>  CIF2-aware software?"
>>>>
>>>>  In some respects this might not be a bad idea - i.e.restricting
>>>>the use of
>>>>  non-UTF8 to CIF2-aware systems...
>>>>
>>>>  Simon (thinking aloud)
>>>>
>>>>  ________________________________
>>>>  From: SIMON WESTRIP <simonwestrip@btinternet.com>
>>>>  To: Group for discussing encoding and content validation schemes for CIF2
>>>>  <cif2-encoding@iucr.org>
>>>>  Sent: Monday, 13 September, 2010 11:05:12
>>>>  Subject: Re: [Cif2-encoding] Splitting of imgCIF and other
>>>>sub-topics. .. .
>>>>
>>>>  Yes - I beleive that such a declaration should be mandatory for
>>>>all non-UTF8
>>>>  CIF2 files,
>>>>  and agree that a supporting checksum mechanism would be very useful to
>>>>  CIF2-aware
>>>>  programs. Until I've revisited the checksum scheme, I can not
>>>>say that the
>>>>  checksum should be mandatory too.
>>>>  For example, if mandatory, does that mean it becomes impossible
>>>>to create a
>>>>  non-UTF8 CIF without using
>>>>  CIF2-aware software?
>>>>
>>>>  I need to review the discussions on checksums and indeed the
>>>>various forms
>>>>  that such a declaration might take,
>>>>  but I do beleive in the principle that it should be mandatory for all
>>>>  'stand-alone' non-UTF8 CIF2 files.
>>>>  If a CIF is packaged in a container, then it will be the job of non-CIF
>>>>  software to retreive it from the container
>>>>  and deliver it in its original form. So a non-UTF8 CIF packaged in a
>>>>  non-UTF8 container (or even a UTF8 container)
>>>>  should still carry its non-UTF8 declaration.
>>>>
>>>>  Cheers
>>>>
>>>>  Simon
>>>>
>>>>  ________________________________
>>>>  From: James Hester <jamesrhester@gmail.com>
>>>>  To: Group for discussing encoding and content validation schemes for CIF2
>>>>  <cif2-encoding@iucr.org>
>>>>  Sent: Monday, 13 September, 2010 6:24:42
>>>>  Subject: Re: [Cif2-encoding] Splitting of imgCIF and other
>>>>sub-topics. .. .
>>>>
>>>>  Hi Simon: the issue with such an encoding declaration is that it is
>>>>  not supported by generic text tools, and so would not be automatically
>>>  > inserted, updated or respected when creating, editing (ie open in one
>>>>  encoding, save in another) or transcoding a CIF2 file.  This means it
>>>>  has no status beyond a hint that could cause as many problems as it
>>>>  solves. Such a declaration becomes more robust if accompanied by the
>>>>  checksum that John B suggested.  The checksum gives some guarantee
>>>>  that the encoding has been checked by a CIF-aware program.
>>>>
>>>>  If you are proposing that such a declaration and checksum be mandatory
>>>>  for all non-UTF8 CIF2 files (not only during transfer), I agree with
>>>>  you that this would be acceptable.
>>>>
>>>>
>>>_______________________________________________
>>>cif2-encoding mailing list
>>>cif2-encoding@iucr.org
>>>http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>>
>>  --
>>  =====================================================
>>  Herbert J. Bernstein, Professor of Computer Science
>>    Dowling College, Kramer Science Center, KSC 121
>>          Idle Hour Blvd, Oakdale, NY, 11769
>  >
>>                  +1-631-244-3035
>>                  yaya@dowling.edu
>>  =====================================================
>>  _______________________________________________
>>  cif2-encoding mailing list
>>  cif2-encoding@iucr.org
>>  http://scripts.iucr.org/mailman/listinfo/cif2-encoding
>>
>
>
>
>--
>T +61 (02) 9717 9907
>F +61 (02) 9717 3145
>M +61 (04) 0249 4148
>_______________________________________________
>cif2-encoding mailing list
>cif2-encoding@iucr.org
>http://scripts.iucr.org/mailman/listinfo/cif2-encoding


--
=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
        Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Council for Science (admitted 1947). Member of CODATA, the ICSU Committee on Data. Member of ICSTI, the International Council for Scientific and Technical Information. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

ICSU Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.