Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [ddlm-group] options/text vs binary/end-of-line. .. .. .. .. .... .. .. .. .. .

Brian asked about sub and superscripts.  There us a limited set
of subscripts and superscript digits in unicode in addition to the
usual accents.  This capability is not, however, a replacement for
markup language.

=====================================================
  Herbert J. Bernstein, Professor of Computer Science
    Dowling College, Kramer Science Center, KSC 121
         Idle Hour Blvd, Oakdale, NY, 11769

                  +1-631-244-3035
                  yaya@dowling.edu
=====================================================

On Wed, 30 Jun 2010, Brian McMahon wrote:

> John B's analysis (here and elsewhere) is very pertinent and provides
> a very helpful conceptual framework.
>
> There are a few other considerations regarding encodings that come to
> my mind.
>
> First, any assertion of an encoding is at best a hint. Properly, a
> trusted archive should validate any such assertion and correct it if
> necessary. Validation may not always be possible through entirely
> automatic procedures. Copying and pasting text from one system into
> another may transfer bytes without transcoding the intended characters
> (see use case suggested in the next paragraph). Transmission checksums
> will help, but will not catch everything. There will arise different
> levels of trust associated with experience of particular programs,
> operating systems or perhaps authors.
>
> Settling on a single mandated encoding for CIF doesn't solve all the
> problems of CIF author/editing applications, assuming some
> interaction with other text processing software is allowed or
> expected. An author - let us say Polish, since we do currently publish
> names with Polish accents - uploads a CIF submission to Chester via web
> and subsequently emails to me a Word document saying "Please
> replace my abstract by this revised version." I save the email
> from my Solaris account to a Samba filesystem, open it in Word, copy
> the text to the Windows clipboard, and paste it via my Windows X server
> to the publCIF clipboard which I am running on a Linux machine via
> an ssh session from my Solaris host. Does publCIF correctly write the
> Unicode code point U+0142 ("Polish letter l with stroke") into
> the CIF? If not, what has gone wrong and where, and (how) can it be
> fixed?
>
> (I have forgotten.) Are we proposing to allow also the "old" CIF1
> character encoding scheme (\/l <=> Unicode U+0142) within CIF2, and
> have we a procedure for distinguishing when it has been applied?
>
> The elephant in the room: the discussion has so far addressed
> alphabetic character encodings. To retain at least the functionality
> of CIF1, we will need to specify schemes for handling other constructs
> such as sub/superscripts and possibly some other mathematical
> constructs - though most of the CIF1 maths is at the level of
> character representations which may be catered for by Unicode code
> points. Will procedures for identifying which of these schemes apply
> (possibly at the level of individual CIF data items) be orthogonal
> to specification of character encodings in the CIF, will the two
> interfere destructively, or can they somehow be handled in much the
> same way in terms of attaching "metadata about encoding" of some sort
> to the relevant objects?
>
> Regards
> Brian
>
>
>> [...] I maintain that a stream of bytes
>> divorced from any explicit or implicit metadata about its encoding
>> is binary, not text.  This complication of electronic text handling
>> is not new, but it has assumed much more prominence as
>> internationalization issues have gained importance.
>>
>> Implicit encoding metadata commonly takes the form of the text in
>> question being encoded according to the default scheme for the
>> system or tool.  It could, in one sense, also take the form of
>> a requirement in the format specification, but that is meaningful
>> only for tools specific to the format, which rather moots the
>> text vs. binary question.  It could also take the form of local
>> policy, such as "all CIFs in this archive are encoded in CESU-8,"
>> which would be useful to tools configured for the relevant
>> environment (e.g. a web server).
>>
>> Explicit metadata can be carried by the file itself or conveyed
>> out-of-band.  XML's encoding attribute is an example of the former,
>> and HTTP's content-type header is an example of the latter.  These
>> are useful only to certain tools, specific to a particular format,
>> environment, or exchange mechanism.
>>
>> One of the upshots of all this is that transcoding must in general be
>> a routine aspect of text file exchange, as that can make explicit
>> encoding metadata implicit.  As Simon has shown, transcoding not
>> automatic in many contexts, so it may require extra work on the
>> receiving end.  To the extent that there is a current assumption and
>> practice of CIFs being stored and forwarded byte-for-byte as received
>> (i.e. without transcoding or explicit metadata), CIF is already being
>> treated as a binary format.  In a sense, perhaps, it is being treated
>> simultaneously as several distinct binary formats.
>>
>> ***
>>
>>> By extending the character set beyond ASCII, we have to accept that
>>> not all general-purpose text tools are going to be applicable as CIF
>>> editors/viewers.
>>
>> That's a valid perspective, but I would sharpen it: as part of
>> extending the character set beyond ASCII, we abandon the premise that
>> CIF is a text format, though under some circumstances it may still be
>> possible to manipulate CIFs with tools designed for text.
>>
>> Alternatively, I have been advocating essentially this: by extending
>> the character set beyond ASCII, we magnify the importance of
>> exchanging and storing CIFs according to text conventions, including
>> correctly communicating encodings as necessary and transcoding as
>> appropriate.
>>
>> I hope the latter position adequately encompasses Herb's view as well.
>> Each position carries additional baggage, which I have omitted to
>> focus on the essential ideas.  If wider comment is sought, then I
>> submit that these alternatives provide a suitable basis for soliciting
>> such.
>>
>> Whichever position prevails, I should like to see something
>> substantially similar to the corresponding position statement above
>> be inserted into the spec.
>>
>> Regards,
>> John
> _______________________________________________
> ddlm-group mailing list
> ddlm-group@iucr.org
> http://scripts.iucr.org/mailman/listinfo/ddlm-group
>
_______________________________________________
ddlm-group mailing list
ddlm-group@iucr.org
http://scripts.iucr.org/mailman/listinfo/ddlm-group

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.