Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <article> <h1>Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <p>2012-04-05T13:10:09Z</p> <ul><li><strong>Category</strong> changed from <i>core</i> to <i>M17N</i></li><li><strong>Status</strong> changed from <i>Open</i> to <i>Assigned</i></li><li><strong>Assignee</strong> set to <i>akr (Akira Tanaka)</i></li></ul><p>Sounds like a bug to me, but no idea what's going on. Tanaka-san, what do you think?</p> </article> <article> <h1>Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <p>2012-04-05T13:43:20Z</p> <ul></ul><p>"\u1036".succ is "\u1000\u1000", not a single character.</p> <p>% ruby -ve 'puts "\u1036".succ.dump'<br> ruby 2.0.0dev (2012-03-16 trunk 35049) [x86_64-linux]<br> "\u{1000}\u{1000}"</p> <p>It is similar that "z".succ is "aa".</p> <p>It is because U+1000 to U+1036 are alphabet characters and<br> U+0fff and U+1037 is not.</p> <p>% ruby -e '0xfff.upto(0x1037) {|c| p ["%x" % c, /[[:alpha:]]/ =~ c.chr("UTF-8")] }'<br> ["fff", nil]<br> ["1000", 0]<br> ...<br> ["1036", 0]<br> ["1037", nil]</p> <p>What I'm not sure is U+1036 is alphabet or not.<br> I think nurse-san or martin-sensei is appropriate for this matter.</p> </article> <article> <h1>Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <p>2012-04-06T23:50:00Z</p> <ul><li><strong>Assignee</strong> changed from <i>akr (Akira Tanaka)</i> to <i>duerst (Martin Dürst)</i></li></ul> </article> <article> <h1>Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <p>2015-09-18T09:18:22Z</p> <ul><li><strong>Status</strong> changed from <i>Assigned</i> to <i>Feedback</i></li></ul><p>Some information gathered during today's commiters' meeting:<br> This is the relevant information from <a href="http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt" class="external">http://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt</a>:</p> <p>1035;MYANMAR VOWEL SIGN E ABOVE;Mn;0;NSM;;;;;N;;;;;<br> 1036;MYANMAR SIGN ANUSVARA;Mn;0;NSM;;;;;N;;;;;<br> 1037;MYANMAR SIGN DOT BELOW;Mn;7;NSM;;;;;N;;;;;<br> 1038;MYANMAR SIGN VISARGA;Mc;0;L;;;;;N;;;;;<br> 1039;MYANMAR SIGN VIRAMA;Mn;9;NSM;;;;;N;;;;;<br> 103A;MYANMAR SIGN ASAT;Mn;9;NSM;;;;;N;;;;;</p> <p>The only difference between U+1036 and U+1037 is the Canonical Combining Class (fourth item, 0 vs. 7).</p> <p>The code chart for Myanmar is at <a href="http://www.unicode.org/charts/PDF/U1000.pdf" class="external">http://www.unicode.org/charts/PDF/U1000.pdf</a>.<br> Relevant information about the script in the Unicode Standard is at <a href="http://www.unicode.org/versions/Unicode8.0.0/ch12.pdf" class="external">http://www.unicode.org/versions/Unicode8.0.0/ch12.pdf</a> (pp. 11ff, in paricular the table at p. 13).</p> <p>The idea behind the behavior of String#succ is to use each character as a digit and circle through the characters in the same alphabet. The simplest case is a..z or A..Z. The implementation works to some extent for many other scripts, but is dependent on things such as whether the characters appear contiguously in the relevant character encoding,...</p> <p>It is unclear what characters 'ideally' should be looped though for Myanmar. For example, the W3C does not (yet?) have an alphabetic list style for Myanmar (see <a href="http://www.w3.org/TR/predefined-counter-styles/#myanmar-styles" class="external">http://www.w3.org/TR/predefined-counter-styles/#myanmar-styles</a>); the same applies for most related scripts (Indic/South East Asian). There are good arguments for looking only through the (base) consonants (U+1000..U+1020). Some variations might include independent vowels, and language-specific variants may include the relevant extension characters.</p> <p>In the current implementation, the behavior observed seems to be a consequence of how the String#succ method uses character data provided by Oniguruma/Onigumo. As the subject of the bug say, the current behavior is indeed surprising. But the current implementation isn't really of any use for any but some very selected scripts, and Myanmar is definitely not among them.</p> <p>Once we have information from some reliable source what characters are most suitable to loop through in Myanmar, we can think about how to fix this problem. So I'm going to set this to "feedback".</p> </article> <article> <h1>Ruby master - Bug #6258: String#succ has suprising behavior for "\u1036" (MYANMAR SIGN ANUSVARA), producing "\u1000" instead of "\u1037"</h1> <p>2019-07-15T04:54:10Z</p> <ul><li><strong>Status</strong> changed from <i>Feedback</i> to <i>Closed</i></li></ul><p>This was fixed between 2.0 and 2.1:</p> <pre><code>$ ruby20 -e 'p "\u1036".succ' "\u1000\u1000" $ ruby21 -e 'p "\u1036".succ' "\u1038" </code></pre> </article> </main></body></html>