Bug #18294
closederror when parsing regexp comment
Added by thyresias (Thierry Lambert) about 3 years ago. Updated over 2 years ago.
Description
The following code generates the error "too short escaped multibyte character"
_re = /
foo # \M-ca
/x
Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?
Updated by duerst (Martin Dürst) about 3 years ago
thyresias (Thierry Lambert) wrote:
The following code generates the error "too short escaped multibyte character"
_re = / foo # \M-ca /x
Removing the \ or doubling it makes the error disappear.
Since this is in comment text, I would expect to be able to type anything there: am I missing something?
I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.
Updated by thyresias (Thierry Lambert) about 3 years ago
duerst (Martin Dürst) wrote in #note-1:
I guess yes. It's somewhat counter-intuitive, but I guess the implementation is handling escapes while it reads the regexp up to the /x, and only then it knows that some parts of it are comments. It would be possible to change the implementation, but I don't know if it's worth it for such an edge case.
You have the same issue with this code, where it knows from the start this is an extended regexp, so I guess you explanation does not hold:
_re = /(?x)
foo # \M-ca
/
ruby
Updated by janosch-x (Janosch Müller) almost 3 years ago
this affects:
- all String escapes that can be invalid (
\x
,\u
,\u{...}
,\M
,\C
,\c
) - only invalid escapes (e.g.
\x7F
is fine) - no Regexp-specific escapes such as
\p{...}
,\g<...>
,\k<...>
- Regexp literals (
SyntaxError
) andRegexp::new
(RegexpError
) - end-of-line comments as well as comment groups (these don't require x-mode)
- all Ruby versions
to give an example that is maybe a bit less edge-casy:
/ C:\\[a-z]{5} # e.g. C:\users /x
# => ^
# => invalid Unicode escape (SyntaxError)
the comment handling in regparse.c
could probably be changed fairly easily, it only happens here and here. i could take this on with a few pointers.
edit: i think the RegexpError
when using Regexp::new
is raised by rb_reg_preprocess
returning 0, before the string is even passed to the Onigmo parsing code in regparse.c
, so it's not yet known at this point which part of the data is a comment and which isn't.
i'm also wondering if the flags here mean that escape sequences in Regexp literals are actually pre-processed by Ruby's main parser? this would make a fix much more involved.
Updated by matz (Yukihiro Matsumoto) almost 3 years ago
I admit this is a bug and it should be fixed. But implementation-wise, it's difficult to fix. Considering the (small) impact of the bug, its priority is low.
We will fix it but it could take a long time.
Matz.
Updated by thyresias (Thierry Lambert) almost 3 years ago
I understand it is not easy to fix, and I sure can live with it.
ありがとうMatzさん ^_^
Updated by jeremyevans0 (Jeremy Evans) over 2 years ago
I've submitted a pull request to fix this: https://github.com/ruby/ruby/pull/5721
The basic approach is skip the parse.y checks for regexps, because regexp does the same checks. Modify the regexp code to pass the regexp options to a couple of internal functions, and in the function that handles the unescaping, recognize #
and ignore characters until the end of the line. This becomes complicated, because #
is allowed as a regular, non-comment character, inside a character class []
. So attempt to handle that.
Additionally, I found this issue is not limited to extended regexps, it affects all regexps using (?#
comments. So try to handle those comments as well by ignoring content inside them.
I'm not sure the pull request handles all cases, and I'm also not sure it doesn't introduce regressions. It would great if a more knowledgeable committer could review it.
The patch is kind of a hack. A better fix would be to integrate the unescaping code with the regexp parsing code, instead of trying to unescape in a first pass, before parsing in a later pass. This would allow the unescaping to be aware of the regexp parser state, making it simple to avoid unescaping when inside a regexp comment.
Updated by jeremyevans (Jeremy Evans) over 2 years ago
- Status changed from Open to Closed
Applied in changeset git|ec3542229b29ec93062e9d90e877ea29d3c19472.
Ignore invalid escapes in regexp comments
Invalid escapes are handled at multiple levels. The first level
is in parse.y, so skip invalid unicode escape checks for regexps
in parse.y.
Make rb_reg_preprocess and unescape_nonascii accept the regexp
options. In unescape_nonascii, if the regexp is an extended
regexp, when "#" is encountered, ignore all characters until the
end of line or end of regexp.
Unfortunately, in extended regexps, you can use "#" as a non-comment
character inside a character class, so also parse "[" and "]"
specially for extended regexps, and only skip comments if "#" is
not inside a character class. Handle nested character classes as well.
This issue doesn't just affect extended regexps, it also affects
"(#?" comments inside all regexps. So for those comments, scan
until trailing ")" and ignore content inside.
I'm not sure if there are other corner cases not handled. A
better fix would be to redesign the regexp parser so that it
unescaped during parsing instead of before parsing, so you already
know the current parsing state.
Fixes [Bug #18294]
Co-authored-by: Nobuyoshi Nakada nobu@ruby-lang.org
Updated by mame (Yusuke Endoh) over 1 year ago
- Related to Bug #19750: Ruby hangs when parsing Regexp with unterminated unicode "/\u{" added