Bug #20504
openInterpolated string literal in regexp encoding handling
Description
There is some very odd behavior that I'm not sure is intentional or not, so I'm looking for guidance. In here:
# encoding: us-ascii
interp = "\x80"
regexp = /#{interp}/
the regexp
variable is a ascii-8bit regular expression with the byte interpolated into the middle. However, if you inline that interpolation:
# encoding: us-ascii
regexp = /#{"\x80"}/
you get a syntax error, saying it's an invalid multi-byte character. I'm not sure what the rule is here, as it seems inconsistent. Is this the correct behavior?
I would prefer if it would create an ascii-8bit regular expression like the first example, which would be consistent.
Updated by Eregon (Benoit Daloze) 5 months ago
Agreed, the current behavior breaks referential transparency and unexpectedly analyzes string literals inside interpolated parts.
This leads to extra confusion and I would think has no value in real-world usages of interpolated regexps (because it causes an error instead of none).
So I think this is a bug and the implementation should not analyze those parts and consequently the behavior should be the same as with the extra local variable.
Updated by Eregon (Benoit Daloze) 5 months ago
- Tracker changed from Misc to Bug
- Backport set to 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
Updated by kddnewton (Kevin Newton) 5 months ago
I'm fine with it analyzing the string literals, I would just prefer it take the same codepath as the interpolated variable case, in which it would produce an ascii-8bit regular expression as opposed to raising an error.
Updated by mame (Yusuke Endoh) 5 months ago
Discussed at the dev meeting, and @matz (Yukihiro Matsumoto) said /#{"\x80"}/
should not raise a SyntaxError but return a binary encoded regexp object.