Project

General

Profile

Actions

Bug #2095

closed

Oniguruma No Longer Understands Unihan Characters

Added by runpaint (Run Paint Run Run) over 14 years ago. Updated about 13 years ago.

Status:
Closed
Target version:
-
ruby -v:
ruby 1.9.2dev (2009-09-11) [i686-linux]
Backport:
[ruby-core:25540]

Description

=begin
As Oniguruma was undocumented, the recent update was based mainly on guesswork. While working on a Unicode library to create an exhaustive test suite I noticed that the update introduced a serious regression. We based the update on UnicodeData.txt and Scripts.txt, but as the former omits Unihan characters their properties are no longer recognized. To fix this we can have tool/enc-unicode.rb parse Unihan.txt (or, rather, the files to which it is divided over as of Unicode 5.2). However, I'd prefer instead to update the script to use the new XML dump Unicode has made available. This is comprehensive and the simpler, standardized file format means parsing bugs are far less likely. In addition it makes it easier to expand our Unicode support in the feature simply by selecting additional attributes. Unfortunately, both approaches preclude storing the data file(s) in SVN (as we currently do with UnicodeData.txt and Scripts.txt) because the Unihan.txt file alone is 28MB uncompressed. (The XML dump is, of course, even bigger).

In the next 24 hours I will update the script to download the latest XML dump and parse it.
=end

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0