Project

General

Profile

Actions

Feature #19908

open

Update to Unicode 15.1

Added by nobu (Nobuyoshi Nakada) 7 months ago. Updated 4 months ago.

Status:
Assigned
Target version:
-
[ruby-core:114936]

Description

The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of Indic_Conjunct_break properties with values.

I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/ or /\p{InCB=Liner}/ as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.


Related issues 2 (1 open1 closed)

Related to Ruby master - Bug #10416: Create mechanism for updating of Unicode data files downstreams when we wantAssignednobu (Nobuyoshi Nakada)Actions
Related to Ruby master - Bug #20150: Memory leak in grapheme clustersClosedActions
Actions #1

Updated by nobu (Nobuyoshi Nakada) 7 months ago

  • Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Actions #2

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

  • Target version deleted (3.3)

Updated by duerst (Martin Dürst) 4 months ago

There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.

Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.

From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.

Updated by duerst (Martin Dürst) 4 months ago

@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...、so I think '=' may be appropriate. But Grapheme_Cluster_Break=... uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=..., not just InCB=...?

Actions #5

Updated by duerst (Martin Dürst) 4 months ago

  • Related to Bug #20150: Memory leak in grapheme clusters added

Updated by janosch-x (Janosch Müller) 4 months ago

Is not this the updated regular expression?

 ccs-base :=     [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
 ccs-extend :=  [\p{M}\p{Join_Control}]
 extended_base :=       ccs-base
 | hangul-syllable
-crlf :=        CR LF
+crlf :=        CR LF | CR | LF
 legacy-core := hangul-syllable
 | ri-sequence
 | xpicto-sequence
 legacy-postcore :=    [Extend ZWJ]
 core :=        hangul-syllable
 | ri-sequence
 | xpicto-sequence
+| conjunctCluster
 | [^Control CR LF]
 postcore :=    [Extend ZWJ SpacingMark]
 precore :=     Prepend
 hangul-syllable :=    L* (V+ | LV V* | LVT) T*
 | L+
 | T+
 xpicto-sequence :=     \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster :=     \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

Updated by duerst (Martin Dürst) 4 months ago

@janosch-x (Janosch Müller) You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0