Project

General

Profile

Actions

Feature #19908

open

Update to Unicode 15.1

Added by nobu (Nobuyoshi Nakada) over 1 year ago. Updated 21 days ago.

Status:
Assigned
Target version:
-
[ruby-core:114936]

Description

The Unicode 15.1 is released.

The current enc-unicode.rb seems to fail because of Indic_Conjunct_break properties with values.

I'm not sure how these properties should be handled well.
/\p{InCB_Liner}/ or /\p{InCB=Liner}/ as the comments in that file?
https://github.com/nobu/ruby/tree/unicode-15.1 is the former.


Related issues 4 (2 open2 closed)

Related to Ruby master - Bug #10416: Create mechanism for updating of Unicode data files downstreams when we wantAssignednobu (Nobuyoshi Nakada)Actions
Related to Ruby master - Bug #20150: Memory leak in grapheme clustersClosedActions
Has duplicate Ruby master - Feature #19171: Update Unicode data to Unicode Version 15.1Closedduerst (Martin Dürst)Actions
Precedes Ruby master - Feature #20724: Update to Unicode 16.0Assignedduerst (Martin Dürst)Actions
Actions #1

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

  • Related to Bug #10416: Create mechanism for updating of Unicode data files downstreams when we want added
Actions #2

Updated by hsbt (Hiroshi SHIBATA) about 1 year ago

  • Target version deleted (3.3)

Updated by duerst (Martin Dürst) about 1 year ago

There is a more serious issue than just whether using an '_' or an '=' in the property: Unicode 15.1 makes some serious changes to grapheme clusters.

Our implementation (function 'node_extended_grapheme_cluster' in regparse.c) is based on Unicode 11.0, in particular https://www.unicode.org/reports/tr29/tr29-33.html#Grapheme_Cluster_Boundaries. This is quite a bit different from the current version at https://www.unicode.org/reports/tr29/tr29-43.html#Grapheme_Cluster_Boundaries. One major difference is that for Unicode 11.0, there was a regular expression for grapheme clusters, which I just implemented in the above function. Unicode 15.1 just says that it's possible to use a regular expression, but doesn't give this regular expression.

From reading through https://www.unicode.org/versions/Unicode15.1.0/#Migration, that's the main issue affecting Ruby.

Updated by duerst (Martin Dürst) about 1 year ago

@nobu (Nobuyoshi Nakada):
We have Grapheme_Cluster_Break=...、so I think '=' may be appropriate. But Grapheme_Cluster_Break=... uses a long, explicit name. So shouldn't it be Indic_Cluster_Break=..., not just InCB=...?

Actions #5

Updated by duerst (Martin Dürst) about 1 year ago

  • Related to Bug #20150: Memory leak in grapheme clusters added

Updated by janosch-x (Janosch Müller) about 1 year ago

Is not this the updated regular expression?

 ccs-base :=     [\p{L}\p{N}\p{P}\p{S}\p{Zs}]
 ccs-extend :=  [\p{M}\p{Join_Control}]
 extended_base :=       ccs-base
 | hangul-syllable
-crlf :=        CR LF
+crlf :=        CR LF | CR | LF
 legacy-core := hangul-syllable
 | ri-sequence
 | xpicto-sequence
 legacy-postcore :=    [Extend ZWJ]
 core :=        hangul-syllable
 | ri-sequence
 | xpicto-sequence
+| conjunctCluster
 | [^Control CR LF]
 postcore :=    [Extend ZWJ SpacingMark]
 precore :=     Prepend
 hangul-syllable :=    L* (V+ | LV V* | LVT) T*
 | L+
 | T+
 xpicto-sequence :=     \p{Extended_Pictographic} (Extend* ZWJ \p{Extended_Pictographic})*
+conjunctCluster :=     \p{InCB=Consonant} ([\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Linker} [\p{InCB=Extend} \p{InCB=Linker}]* \p{InCB=Consonant})+

Updated by duerst (Martin Dürst) about 1 year ago

@janosch-x You are correct, thanks! I noticed it a few days ago, but didn't yet get around to write about that here. You beat me to that!

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

Unicode 16.0 has been released.

https://www.unicode.org/versions/Unicode16.0.0/

Should we move this instead of 15.1?

Actions #9

Updated by duerst (Martin Dürst) 4 months ago

Updated by duerst (Martin Dürst) 4 months ago

hsbt (Hiroshi SHIBATA) wrote in #note-8:

Unicode 16.0 has been released.

Should we move this instead of 15.1?

I think it's more prudent to do 15.1 first, then 16.0. I hope to be able to work on this soon. I created a separate issue for 16.0.

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

I think it's more prudent to do 15.1 first, then 16.0.

Agreed, thanks!

Actions #12

Updated by hsbt (Hiroshi SHIBATA) 4 months ago

  • Has duplicate Feature #19171: Update Unicode data to Unicode Version 15.1 added

Updated by ima1zumi (Mari Imaizumi) 21 days ago

@duerst (Martin Dürst)

I'm interested in working on this issue. Are you planning to start it? If not, I'd like to try.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like1Like0Like0Like0