[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[ddlm-group] A useful web page

Many of you may find the discussion of character encodings used for HTML4
helpful:

http://www.w3.org/TR/REC-html40/charset.html

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Mon, 28 Jun 2010, Herbert J. Bernstein wrote:

> Dear Colleagues,
>
>   I would suggest posting this summary to the wider community and
> soliciting their comments.   While I strongly disagree with James'
> comments in his marked up version, I have no objection to his
> also posting his views to the wider community _after_ posting
> the unmarked-up version, and I will wait a few days after that before
> posting any rebuttal.  It will be very interesting to see if this
> community is ready for a transition to pure UTF8 already.
>
>   I would suggest starting with the ccp4-dev, ccp4bb and pdb-l lists.
>
>   Regards,
>     Herbert
>
> =====================================================
>  Herbert J. Bernstein, Professor of Computer Science
>    Dowling College, Kramer Science Center, KSC 121
>         Idle Hour Blvd, Oakdale, NY, 11769
>
>                  +1-631-244-3035
>                  yaya@dowling.edu
> =====================================================
>
> On Mon, 28 Jun 2010, James Hester wrote:
>
>> The following is a summary of the encoding discussion so far.  It
>> incorporates material from the previous discussion in Oct/Nov 2009. I
>> have refrained from commenting on the validity of the various
>> arguments, but I will be posting a subsequent message with my
>> thoughts.
>>
>> There are approximately two points of view regarding the encoding to
>> be used for CIF2: allow only one encoding, which would be UTF-8; or
>> allow multiple encodings to be used. The multiple encoding option
>> comes with several alternative approaches:
>>
>> 1) Mandate that all CIF2 processors must support UTF-8, and are not
>> required to support any other encoding.  Non-UTF-8 encoded input CIFs
>> would first need to be transcoded by a separate tool to UTF-8
>>
>> 2) Remain silent on the encoding issue (as for CIF1)
>>
>> 3) Specify separately a 'CIF interchange format', which would strongly
>> encourage use of UTF-8 for transmission and storage but not prohibit
>> use of different encodings among agreeing parties.
>>
>> 4) Specify UTF-8 as a 'canonical' encoding in terms of which trip
>> tests and test suites would be written.
>>
>> Following is a list of the arguments presented for and against the
>> above two choices.
>>
>> Restrict CIF2 to UTF-8:
>> =======================
>>
>> Arguments in favour:
>>
>> * Implementation of the standard is simpler as CIF processors are not
>>  required to negotiate encoding or support more than one encoding
>>
>> * UTF8 is a unique encoding in that a non UTF-8 encoded file is
>>  detectable with high probability due to the specific bit-patterns
>>  required for UTF-8 encoding
>>
>> * A single encoding conforms to the philosophical principle observed
>>  in earlier CIF standards, that it is only necessary to define one
>>  convention for each feature in the standard
>>
>> * A key virtue of CIF is simplicity.  Multiple optional encodings is
>>  not simple.
>>
>> Arguments against:
>>
>> * Choosing a specific encoding unduly restricts user freedom or shows
>>  a lack of respect for the way others do science
>>
>> * We are premature in settling on Unicode and/or UTF-8; by doing so we
>>  risk alienating important user groups and/or backing the wrong horse
>>
>> Allow multiple CIF2 encodings always including UTF-8:
>> =====================================================
>>
>> Arguments in favour
>>
>> * CIF has always been a 'text' standard, with no encoding mandated.
>>  This has worked out OK so far
>>
>> * Provided sender and receiver system understand that a file is a
>>  'text' file, encodings are manipulated automatically to produce a
>>  correct file after transmission
>>
>> * If a user anticipates the need to specify encoding (because none is
>>  mandated and the documents remind them of this need) then they are
>>  more likely to include information about the encoding they are
>>  using.  If no encoding information is thought necessary, then a
>>  non-UTF-8 encoded file mistakenly sent as a UTF-8 file would be
>>  difficult to decode.
>>
>> * Binary formats are bad
>>
>> * Labelling is normal practice, and so there is nothing contentious
>>  about labelling the encoding used in a file
>>
>> * Saving CIF files in the native text format allows system text tools
>>  (e.g. for searching) to be useful
>>
>> * Users are going to produce CIFs in multiple encodings anyway, so we
>>  might as well try to manage this use by specifying standards
>>
>> Arguments against multiple encodings:
>>
>> * There is no way to reliably detect which encoding has been used in a
>>  file, and it is not reasonable to assume that a human editor has
>>  gotten an embedded encoding declaration correct, requiring that all
>>  files are therefore read through by a human after transmission to
>>  check for incorrect letters, accents etc.
>>
>> * Facilitating use of multiple encodings encourages them to be used,
>>  which increases the scale of the multiple encoding problem
>>
>>
>> --
>> T +61 (02) 9717 9907
>> F +61 (02) 9717 3145
>> M +61 (04) 0249 4148
>> _______________________________________________
>> ddlm-group mailing list
>> ddlm-group@iucr.org
>> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>>
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]