Feature #19908
openUpdate to Unicode 15.1
Description
The Unicode 15.1 is released.
The current enc-unicode.rb seems to fail because of Indic_Conjunct_break
properties with values.
I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/
or /\p{InCB=Liner}/
as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.
Updated by nobu (Nobuyoshi Nakada) about 1 year ago
- Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Updated by duerst (Martin Dürst) 10 months ago
There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.
Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.
From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.
Updated by duerst (Martin Dürst) 10 months ago
@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...
、so I think '=' may be appropriate. But Grapheme_Cluster_Break=...
uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=...
, not just InCB=...
?
Updated by duerst (Martin Dürst) 10 months ago
- Related to Bug #20150: Memory leak in grapheme clusters added
Updated by janosch-x (Janosch Müller) 10 months ago
Is not this the updated regular expression?
ccs-base := [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
ccs-extend := [\p{M}\p{Join_Control}]
extended_base := ccs-base
| hangul-syllable
-crlf := CR LF
+crlf := CR LF | CR | LF
legacy-core := hangul-syllable
| ri-sequence
| xpicto-sequence
legacy-postcore := [Extend ZWJ]
core := hangul-syllable
| ri-sequence
| xpicto-sequence
+| conjunctCluster
| [^Control CR LF]
postcore := [Extend ZWJ SpacingMark]
precore := Prepend
hangul-syllable := L* (V+ | LV V* | LVT) T*
| L+
| T+
xpicto-sequence := \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster := \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+
Updated by duerst (Martin Dürst) 10 months ago
@janosch-x (Janosch Müller) You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!
Updated by hsbt (Hiroshi SHIBATA) about 2 months ago
Unicode 16.0 has been released.
https://www.unicode.org/versions/Unicode16.0.0/
Should we move this instead of 15.1?
Updated by duerst (Martin Dürst) about 2 months ago
- Precedes Feature #20724: Update to Unicode 16.0 added
Updated by duerst (Martin Dürst) about 2 months ago
hsbt (Hiroshi SHIBATA) wrote in #note-8:
Unicode 16.0 has been released.
Should we move this instead of 15.1?
I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.
Updated by hsbt (Hiroshi SHIBATA) about 2 months ago
I think it's more prudent to do 15.1 first, then 16.0.
Agreed, thanks!
Updated by hsbt (Hiroshi SHIBATA) about 2 months ago
- Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added