=begin
On 2009/09/13 9:21, Run Paint Run Run wrote:
Bug #2095: Oniguruma No Longer Understands Unihan Characters
http://redmine.ruby-lang.org/issues/show/2095
Author: Run Paint Run Run
Status: Open, Priority: High
ruby -v: ruby 1.9.2dev (2009-09-11) [i686-linux]
As Oniguruma was undocumented, the recent update was based mainly on guesswork.
We based the update on UnicodeData.txt and Scripts.txt,
UnicodeData.txt since ages contains two-line entries such as
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
or
4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
9FC3;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
or
AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DB80;<Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
DBFF;<Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
E000;<Private Use, First>;Co;0;L;;;;;N;;;;;
F8FF;<Private Use, Last>;Co;0;L;;;;;N;;;;;
These are indications of any of the following:
- All the characters in the respective range have the same property
(e.g. 'Lo' for CJK Ideographs)
- Certain properties essentially don't apply (e.g. Surrogates are 'L',
but for Ruby, they should not exist, and certainly not match in Regexps)
- Properties or other relevant data should be generated algorithmically
(e.g. Character Names for Ideographs and Hangul, normalization
(de)compositions for Hangul,...)
In my experience, it is best to handle each of these specific ranges
explicitly in a script such as yours, and to throw an error (and use a
patch to fix it) when a new range is encountered, because a) new such
ranges are added rarely (currently, there are only 10), and b) it is
impossible to predict which of the above three cases applies.
Regards, Martin.
but as the former omits Unihan characters their properties are no longer recognized. To fix this we can have tool/enc-unicode.rb parse Unihan.txt (or, rather, the files to which it is divided over as of Unicode 5.2). However, I'd prefer instead to update the script to use the new XML dump Unicode has made available. This is comprehensive and the simpler, standardized file format means parsing bugs are far less likely. In addition it makes it easier to expand our Unicode support in the feature simply by selecting additional attributes. Unfortunately, both approaches preclude storing the data file(s) in SVN (as we currently do with UnicodeData.txt and Scripts.txt) because the Unihan.txt file alone is 28MB uncompresse!
d. (The XML dump is, of course, even bigger).
In the next 24 hours I will update the script to download the latest XML dump and parse it.
http://redmine.ruby-lang.org
--
#-# Martin J. Dürst, Professor, Aoyama Gakuin University
#-# http://www.sw.it.aoyama.ac.jp mailto:duerst@it.aoyama.ac.jp
=end