Discussion List Archives

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Cif2-encoding] The discussion so far. .

 

On Tuesday, August 03, 2010 9:58 AM, John Bollinger wrote:

 

>7) In response to question (2), James offered a scheme (A) [...] This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.

 

Scheme A:

1.  For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes. 

2.  Any encoding specifying the mapping from bytes to Unicode code points may be used, provided that:

1.  The encoding is specified in an international standard

2.  The encoding is distinguishable from all other standard encodings at the binary level.  This requirement may be satisfied by an initial 'signature', provided that this signature is specified in the relevant international standard as being mandatory

3.  The encoding is supported across the range of platforms likely to use CIF2.  "Support" includes:

1.  Availability of text input and output functions using this encoding across a range of programming languages

2.  Availability of applications on the platform to manipulate text in this encoding, most importantly text editors but also tools such as search

4.  The encoding is coincident with US-ASCII for codepoints <= 127.  This requirement may be dropped in the future if CIF2 becomes the dominant form of CIF file.

 

 

>8) In response to question (1), John Bollinger offered a scheme (B) [...] This scheme will be reproduced in a separate e-mail, as it is not currently available from the DDLm-list web archive.

 

Note: This is my amended form of scheme B, not the original.

Note2: James's alternative amended version of scheme B (dubbed "Scheme C") is not presented, as, for reasons explained elsewhere, it does not meet scheme B's objectives.

 

Scheme B':

1.  This scheme provides for reliable archiving and exchange of CIF text.  Although it depends in some cases on metadata embedded in the CIF text, presence of such metadata is not a well-formedness constraint on the text itself.

2.  For the purposes of storage and transfer, CIF files must be treated by all file handling protocols as streams of bytes.

3.  Any text encoding may be used.  If the encoding does not comply with either (5a) or (5b) below, then its name must be given via an encoding tag following the magic code, on the same line.  Otherwise, an encoding tag is optional, but if present then it must correctly name the encoding.

4.  Archiving or exchange of CIF text complies with this scheme if the CIF text contains a correct content hash:

a.  The hash value is computed by applying the MD5 algorithm to the Unicode code point values of the CIF text, in the order they appear, excluding all code points of CIF comments and all other CIF whitespace appearing outside data values or separating List or Table elements.

b.  The code point stream is converted to a byte stream for input to the hash function by interpreting each code point as a 24-bit integer, appearing on the byte stream in order from most-significant to least-significant byte.

c.  The hash value is expressed in the CIF itself as a structured comment of the form:

#\#content_hash_md5:XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

where the Xs represent the hexadecimal digits of the computed hash value.

d.  The hash comment may appear anywhere in the CIF that a comment may appear, but conventionally it is at the end of the CIF.

5.  Archiving or exchange of CIF text that does not contain a content hash complies with this scheme if

a.  the text encoding is specified in an international standard and is distinguishable from all other encodings at the binary level, or

b.  the text encoding is coincident with US-ASCII for all code points appearing in the CIF.

For the purposes of (5a), distinguishing encodings may rely on the characteristics of CIF, such as the allowed character set and the required CIF version comment, and also on the actual CIF text (such as for recognition of UTF-8 by its encoding of non-ASCII characters).

 

 

Regards,

 

John

 

--

John C. Bollinger, Ph.D.

Department of Structural Biology

St. Jude Children's Research Hospital

 

 


Email Disclaimer: www.stjude.org/emaildisclaimer
_______________________________________________
cif2-encoding mailing list
cif2-encoding@iucr.org
http://scripts.iucr.org/mailman/listinfo/cif2-encoding

Reply to: [list | sender only]
International Union of Crystallography

Scientific Union Member of the International Science Council (admitted 1947). Member of CODATA, the ISC Committee on Data. Partner with UNESCO, the United Nations Educational, Scientific and Cultural Organization in the International Year of Crystallography 2014.

International Science Council Scientific Freedom Policy

The IUCr observes the basic policy of non-discrimination and affirms the right and freedom of scientists to associate in international scientific activity without regard to such factors as ethnic origin, religion, citizenship, language, political stance, gender, sex or age, in accordance with the Statutes of the International Council for Science.