https://redmine.ruby-lang.org/https://redmine.ruby-lang.org/favicon.ico?17113305112012-12-15T10:53:53ZRuby Issue Tracking SystemRuby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=347572012-12-15T10:53:53Zdrbrain (Eric Hodel)drbrain@segment7.net
<ul><li><strong>Category</strong> set to <i>core</i></li><li><strong>Target version</strong> set to <i>2.0.0</i></li></ul><p>=begin<br>
Converting any of the regexp special characters could cause a syntax error or warning if the user tries to round-trip the regexp, so I think this is not a bug:</p>
<p>$ ruby20 -ve 'p("\u{5d}", /[\u{5d}]/)'<br>
ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]<br>
"]"<br>
/[\u{5d}]/</p>
<p>$ ruby20 -ve 'p(/[]]/)'<br>
ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]<br>
-e:1: warning: character class has ']' without escape: /[]]/<br>
/[]]/</p>
<p>=end</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=347722012-12-16T03:13:56Zbrixen (Brian Shirai)brixen@gmail.com
<ul></ul><p>I'd argue that's a malformed Regexp and "round-tripping" shouldn't be expected to work.</p>
<p>sasha:rubinius brian$ irb<br>
1.9.3p327 :001 > re = /[\\u{5d}]/<br>
=> /[\\u{5d}]/<br>
1.9.3p327 :002 > re2 = Regexp.new re<br>
=> /[\\u{5d}]/<br>
1.9.3p327 :003 > re3 = Regexp.new re.source<br>
=> /[\\u{5d}]/<br>
1.9.3p327 :004 > "ab]c" =~ re<br>
=> 2<br>
1.9.3p327 :005 > "ab]c" =~ re2<br>
=> 2<br>
1.9.3p327 :006 > "ab]c" =~ re3<br>
=> 2</p>
<p>The consequence of storing the source with escape sequences and the fact that 7-bit clean source even using UTF escapes is encoded as US-ASCII is that the underlying Oniguruma data must be maintained separately and the string potentially unescaped every match. At least, that is the best understanding I have of the MRI source code. AFAIK, this is not defined anywhere.</p>
<p>Thanks,<br>
Brian</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=347842012-12-17T11:12:29Znaruse (Yui NARUSE)naruse@airemix.jp
<ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>Because Regexp Literals are not String Literals, and escapes in them have different meanings.<br>
For example \b, it is word boundary in Regexp but BEL in String.<br>
People will need to distingish word boundary from BEL, so \b must be showed as \b.<br>
\uXXXX follows such style.</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=347852012-12-17T11:38:04Zbrixen (Brian Shirai)brixen@gmail.com
<ul></ul><p>Are you saying you can represent \b as a \u{} escape sequence in a Regexp?</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=347872012-12-17T11:49:57Znaruse (Yui NARUSE)naruse@airemix.jp
<ul></ul><p>brixen (Brian Ford) wrote:</p>
<blockquote>
<p>Are you saying you can represent \b as a \u{} escape sequence in a Regexp?</p>
</blockquote>
<p>No.<br>
(1) \b (word boundary), \s (spaces and tabs) and so on are can't expressed as bytes<br>
(2) so escapes are not converted to bytes, kept as is<br>
(3) \u{} is also escape, so kept as is</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=351822013-01-03T03:37:51Zbrixen (Brian Shirai)brixen@gmail.com
<ul></ul><p>But as my example shows, if the bytes were in a literal String used to create the Regexp, they are already converted. And everything works just fine.</p>
<p>What's the rationale for not converting \u{}? Just because it is <em>an</em> escape sequence doesn't mean it is a <em>Regexp</em> escape sequence. Why are they treated the same? It creates inconsistency between two identical Regexps except that one came from a String or Regexp literal with interpolation.</p> Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literalshttps://redmine.ruby-lang.org/issues/7566?journal_id=351832013-01-03T05:42:39Zphluid61 (Matthew Kerwin)matthew@kerwin.net.au
<ul></ul><p>brixen (Brian Ford) wrote:</p>
<blockquote>
<p>But as my example shows, if the bytes were in a literal String used to create the Regexp, they are already converted. And everything works just fine.</p>
</blockquote>
<p>No it doesn't. There are no literal strings in your example. The closest I can see is you extracting a source string from the Regexp, but I don't think that's doing what you think it is.</p>
<p>irb(main):001:0> re = /[\\u{5d}]/<br>
=> /[\\u{5d}]/<br>
irb(main):002:0> re.source<br>
=> "[\\\u{5d}]"</p>
<p>If you meant this:</p>
<p>irb(main):003:0> s = "[\\u{5d}]"<br>
=> "[\]]"<br>
irb(main):004:0> re2 = Regexp.new s<br>
=> /[]]/</p>
<p>You get an entirely different Regexp. They will both match the string "ab]c" because they both include the ']' character in their character class. Incidentally:</p>
<p>irb(main):005:0> re =~ "ab\c"<br>
=> 2<br>
irb(main):006:0> re2 =~ "ab\c"<br>
=> nil</p>
<blockquote>
<p>What's the rationale for not converting \u{}? Just because it is <em>an</em> escape sequence doesn't mean it is a <em>Regexp</em> escape sequence. Why are they treated the same?</p>
</blockquote>
<p>They aren't. If it helps, consider that <em>no</em> Regexp escape sequences are treated the same as String escapes.</p>
<p>\ is a String literal escape sequence that is interpolated to the byte \x5C<br>
\ is a Regexp literal escape sequence that instructs the engine to match the byte \x5C</p>
<p>\u{} is a String literal escape sequence that is interpolated to a codepoint<br>
\u{} is a Regexp literal escape sequence that instructs the engine to match a codepoint</p>
<p>\b is a String literal that is interpolated to the byte \x08<br>
\b is a Regexp literal that instructs the engine to match a word boundary</p>
<blockquote>
<p>It creates inconsistency between two identical Regexps except that one came from a String or Regexp literal with interpolation.</p>
</blockquote>
<p>No, if the Regexps were identical they would be identical. As you can see above, re and re2 are not identical, and no one should expect them to be.</p>