Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2012-12-15T10:53:53Z</p> <ul><li><strong>Category</strong> set to <i>core</i></li><li><strong>Target version</strong> set to <i>2.0.0</i></li></ul><p>=begin<br> Converting any of the regexp special characters could cause a syntax error or warning if the user tries to round-trip the regexp, so I think this is not a bug:</p> <p>$ ruby20 -ve 'p("\u{5d}", /[\u{5d}]/)'<br> ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]<br> "]"<br> /[\u{5d}]/</p> <p>$ ruby20 -ve 'p(/[]]/)'<br> ruby 2.0.0dev (2012-12-15 trunk 38385) [x86_64-darwin12.2.1]<br> -e:1: warning: character class has ']' without escape: /[]]/<br> /[]]/</p> <p>=end</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2012-12-16T03:13:56Z</p> <ul></ul><p>I'd argue that's a malformed Regexp and "round-tripping" shouldn't be expected to work.</p> <p>sasha:rubinius brian$ irb<br> 1.9.3p327 :001 > re = /[\\u{5d}]/<br> => /[\\u{5d}]/<br> 1.9.3p327 :002 > re2 = Regexp.new re<br> => /[\\u{5d}]/<br> 1.9.3p327 :003 > re3 = Regexp.new re.source<br> => /[\\u{5d}]/<br> 1.9.3p327 :004 > "ab]c" =~ re<br> => 2<br> 1.9.3p327 :005 > "ab]c" =~ re2<br> => 2<br> 1.9.3p327 :006 > "ab]c" =~ re3<br> => 2</p> <p>The consequence of storing the source with escape sequences and the fact that 7-bit clean source even using UTF escapes is encoded as US-ASCII is that the underlying Oniguruma data must be maintained separately and the string potentially unescaped every match. At least, that is the best understanding I have of the MRI source code. AFAIK, this is not defined anywhere.</p> <p>Thanks,<br> Brian</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2012-12-17T11:12:29Z</p> <ul><li><strong>Status</strong> changed from <i>Open</i> to <i>Rejected</i></li></ul><p>Because Regexp Literals are not String Literals, and escapes in them have different meanings.<br> For example \b, it is word boundary in Regexp but BEL in String.<br> People will need to distingish word boundary from BEL, so \b must be showed as \b.<br> \uXXXX follows such style.</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2012-12-17T11:38:04Z</p> <ul></ul><p>Are you saying you can represent \b as a \u{} escape sequence in a Regexp?</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2012-12-17T11:49:57Z</p> <ul></ul><p>brixen (Brian Ford) wrote:</p> <blockquote> <p>Are you saying you can represent \b as a \u{} escape sequence in a Regexp?</p> </blockquote> <p>No.<br> (1) \b (word boundary), \s (spaces and tabs) and so on are can't expressed as bytes<br> (2) so escapes are not converted to bytes, kept as is<br> (3) \u{} is also escape, so kept as is</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2013-01-03T03:37:51Z</p> <ul></ul><p>But as my example shows, if the bytes were in a literal String used to create the Regexp, they are already converted. And everything works just fine.</p> <p>What's the rationale for not converting \u{}? Just because it is <em>an</em> escape sequence doesn't mean it is a <em>Regexp</em> escape sequence. Why are they treated the same? It creates inconsistency between two identical Regexps except that one came from a String or Regexp literal with interpolation.</p> </article> <article> <h1>Ruby master - Bug #7566: Escape (\u{}) forms in Regexp literals</h1> <p>2013-01-03T05:42:39Z</p> <ul></ul><p>brixen (Brian Ford) wrote:</p> <blockquote> <p>But as my example shows, if the bytes were in a literal String used to create the Regexp, they are already converted. And everything works just fine.</p> </blockquote> <p>No it doesn't. There are no literal strings in your example. The closest I can see is you extracting a source string from the Regexp, but I don't think that's doing what you think it is.</p> <p>irb(main):001:0> re = /[\\u{5d}]/<br> => /[\\u{5d}]/<br> irb(main):002:0> re.source<br> => "[\\\u{5d}]"</p> <p>If you meant this:</p> <p>irb(main):003:0> s = "[\\u{5d}]"<br> => "[\]]"<br> irb(main):004:0> re2 = Regexp.new s<br> => /[]]/</p> <p>You get an entirely different Regexp. They will both match the string "ab]c" because they both include the ']' character in their character class. Incidentally:</p> <p>irb(main):005:0> re =~ "ab\c"<br> => 2<br> irb(main):006:0> re2 =~ "ab\c"<br> => nil</p> <blockquote> <p>What's the rationale for not converting \u{}? Just because it is <em>an</em> escape sequence doesn't mean it is a <em>Regexp</em> escape sequence. Why are they treated the same?</p> </blockquote> <p>They aren't. If it helps, consider that <em>no</em> Regexp escape sequences are treated the same as String escapes.</p> <p>\ is a String literal escape sequence that is interpolated to the byte \x5C<br> \ is a Regexp literal escape sequence that instructs the engine to match the byte \x5C</p> <p>\u{} is a String literal escape sequence that is interpolated to a codepoint<br> \u{} is a Regexp literal escape sequence that instructs the engine to match a codepoint</p> <p>\b is a String literal that is interpolated to the byte \x08<br> \b is a Regexp literal that instructs the engine to match a word boundary</p> <blockquote> <p>It creates inconsistency between two identical Regexps except that one came from a String or Regexp literal with interpolation.</p> </blockquote> <p>No, if the Regexps were identical they would be identical. As you can see above, re and re2 are not identical, and no one should expect them to be.</p> </article> </main></body></html>