[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Reply to: [list | sender only]
RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. .
- To: "Discussion list of the IUCr Committee for the Maintenance of the CIFStandard (COMCIFS)" <comcifs@iucr.org>
- Subject: RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. .
- From: "Herbert J. Bernstein" <yaya@bernstein-plus-sons.com>
- Date: Thu, 10 Mar 2011 10:49:42 -0500 (EST)
- In-Reply-To: <8F77913624F7524AACD2A92EAF3BFA54169146B7CF@SJMEMXMBS11.stjude.sjcrh.local>
- References: <AANLkTikfLNd6mQB9hB9haGek_52ceO3GjXrtAR5tbsnj@mail.gmail.com><AANLkTin+DsXM58+gQ=H4vXGyuRS7xcDHcmAKKYMztvDL@mail.gmail.com><AANLkTimzgzLHrAg_pKHv82Qjzsz6ME1NPFsfZ87P2tQ8@mail.gmail.com><AANLkTi=pQoaya+9eyChCzn5HnkGkcOcbZxL=rQEN=jDL@mail.gmail.com><a06240800c996972c073b@[149.72.35.130]><8F77913624F7524AACD2A92EAF3BFA54168ECD35D9@SJMEMXMBS11.stjude.sjcrh.local> <20110305125300.GA4352@emerald.iucr.org><a06240800c997f77ca4d3@[192.168.2.102]><472620FF2D2FBB4BB62FD1285C58A04F9244163E01@mail01.ccdc.cam.ac.uk><8F77913624F7524AACD2A92EAF3BFA54169146B7CF@SJMEMXMBS11.stjude.sjcrh.local>
Dear Colleagues, Unfortunately, John Bollinger, in his desire to help clarify the current CIF2 proposal with respect to encoding has overstated the rules in his summary. What the change documcent currently says is: "CIF2 files are standard variable length plain text files, which for compatibility with older processing systems will have a maximum line length of 2048 characters. As discussed above and below, however, there are some restrictions on the character set for token delimiters, separators and data names. For compatibility with CIF1 behaviour, there is no formal restriction on the encoding of CIF2 files, providing they contain only code points from the ASCII range. If a CIF2 file contains characters equivalent to Unicode code points greater than U+007F (127 decimal), then the particular encoding used must either be UTF8 or algorithmically identifiable from the CIF2 file itself. Acceptable identification algorithms will be published as necessary as annexes to this standard (see description of magic code and encoding disambiguation in Change 1). Annexes notwithstanding, (i) a CIF2 file containing characters outside the ASCII range with no BOM and no disambiguation signature will be a UTF8 file, and (ii) a CIF2 file containing characters outside the ASCII range with a valid UTF8 or UTF16 BOM and no disambiguation signature, will be a Unicode file written in the indicated encoding. In keeping with XML restrictions we allow the characters U+0009 U+000A U+000D U+0020 -- U+007E U+00A0 -- U+D7FF U+E000 -- U+FDCF U+FDF0 -- U+FFFD U+10000 -- U+10FFFD In addition, character U+FEFF and characters U+xFFFE or U+xFFFF where x is any hexadecimal digit are disallowed. Unicode reserves the code points E000 F8FF for private use. The IUCr and only the IUCr may specify what characters are assigned to these code points in the context of CIF2. Reasoning: There is growing demand for the wider character set afforded by Unicode to be made available in applications, especially those where internationalisation is an issue. ===================================================== In particular, the statement > The only CIF 2.0 mechanisms currently supported for > including literal characters that have no ASCII mapping are (1) to > encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2) > to encode the whole document in UTF-16 with a UTF-16 BOM. implies the encoding issue is settled. That is not what the draft change document says, and what is in the change document is certainly not the last word on encodings. I would urge those who have ideas on the subject to feel free to express them, especially because the specification of "disambuguation signatures" is an open, unresolved issue in the change document, and the concept of a unicode BOM admits a much wider range of encodings than just UTF-8 and UTF-16. I have found what has been said thus far very helpful and educational, and hope that the dicussion will continue. Regards, Herbert ===================================================== Herbert J. Bernstein, Professor of Computer Science Dowling College, Kramer Science Center, KSC 121 Idle Hour Blvd, Oakdale, NY, 11769 +1-631-244-3035 yaya@dowling.edu ===================================================== On Thu, 10 Mar 2011, Bollinger, John C wrote: > > On Thursday, March 10, 2011 3:22 AM, Matthew Towler wrote: > >> I will agree with many of the points made by Peter. I believe the >> decision on the byte order markings (BOM) should be made having >> considered what type of format CIF should be. As I see it there are >> two options. >> >> 1) An easily human editable, text based format, as CIF 1.1 is >> presently. [...] >> >> 2) A machine editable or non-text format, such as XML or PDF or a text >> file with non-standard encoding. [...] > > [...] > >> In summary, I feel that creating a non-standard-standard will impede >> the usage of the new files, and therefore the best choice is to use >> standard Unicode files. > > I would like to point out that the DDLm technical subcommittee devoted > considerable time and energy to character encoding and related topics, > to the extent that we prevailed upon IUCr to provide a discussion list > specifically for that contentious debate. You will find the early part > of the discussion among the archives of the main DDLm list > (http://www.iucr.org/__data/iucr/lists/ddlm-group/), and you will find > the later, larger part of the discussion, including the genesis of our > ultimate compromise, in the archives of the cif2-encoding list > (http://www.iucr.org/__data/iucr/lists/cif2-encoding/). > > A specification documenting the differences between CIF 1.1 and CIF 2.0 > (http://www.iucr.org/__data/assets/pdf_file/0004/47434/cif2_syntax_changes_jrh20101115.pdf) > was previously approved by COMCIFS. Inasmuch as the CIF 2.0 syntax > discussion continues, however, the changes already approved could yet be > modified. I encourage those interested in the topic of character > encoding to read the "Change 2" section of the changes document to find > how CIF 2.0, as currently constituted, will address those issues. To > summarize, however, the approved CIF 2.0 changes attempt to address the > text-based historical legacy of CIF -- recognizing that "text" is a > poorly-defined and system-dependent term -- while simultaneously looking > forward to Unicode. The only CIF 2.0 mechanisms currently supported for > including literal characters that have no ASCII mapping are (1) to > encode the whole document in UTF-8 with or without a UTF-8 BOM, or (2) > to encode the whole document in UTF-16 with a UTF-16 BOM. > > I am certain that COMCIFS would be interested in hearing from anyone who > believes the compromise to be flawed or unreasonable, or that it would > hinder adoption of CIF 2.0. I do hope to avoid repeating the debate > that the DDLm group already conducted on the topic, however. > > > Regards, > > John > > -- > John C. Bollinger, Ph.D. > Department of Structural Biology > St. Jude Children's Research Hospital > > > Email Disclaimer: www.stjude.org/emaildisclaimer > > _______________________________________________ > comcifs mailing list > comcifs@iucr.org > http://scripts.iucr.org/mailman/listinfo/comcifs >
Reply to: [list | sender only]
- References:
- Advice on COMCIFS policy regarding compatibility of CIF syntax withother domains (James Hester)
- Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (James Hester)
- Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (Peter Murray-Rust)
- Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains (James Hester)
- Re: Advice on COMCIFS policy regarding compatibility of CIFsyntax with other domains (Herbert J. Bernstein)
- RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. . (Bollinger, John C)
- Re: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. (Brian McMahon)
- Re: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. (Herbert J. Bernstein)
- RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains. (Matthew Towler)
- RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. . (Bollinger, John C)
- Prev by Date: RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. .
- Next by Date: Madrid 2011
- Prev by thread: RE: Advice on COMCIFS policy regarding compatibility of CIFsyntaxwith other domains.. .
- Next by thread: Re: Advice on COMCIFS policy regarding compatibility of CIF syntaxwith other domains
- Index(es):