Feature #19694
openAdd Regexp#timeout= setter
Description
Abstract¶
In addition to allowing for a Regexp timeout to be set on individual instances by setting a timeout
argument in Regexp.new
, I'm proposing that we also allow setting the timeout on Regexp objects with a #timeout=
setter.
Background¶
To be able to roll out a global Regexp timeout for a large application, there are inevitably some individual regexes for which a different timeout is appropriate. While the timeout
keyword argument was added to Regexp.new
, this isn't always a viable option.
In the case of regex literal syntax (/ab*/
or %r{ab*}
, for instance), it's not possible to set a timeout at all right now without converting to Regexp.new
, which may be awkward depending on the contents of the regex.
It also is desirable from time to time to be able to set a timeout for a regex object after it's been initialized.
Finally, because we offer a Regexp#timeout
getter, for consistency it would be nice to also offer a setter.
The introduction of a Regexp#timeout=
setter was mentioned as a possible way to set individual timeouts in https://bugs.ruby-lang.org/issues/19104#Specification.
Proposal¶
I propose that we add the method Regexp#timeout=
. It works the same way the timeout
argument works in Regexp.new
, taking either a float or nil.
This makes it relatively easy to add timeouts to specific regex literals (regex literals are frozen by default so you do have to dup
them first):
emoji_filter_pattern = %r{
(?<!#{Regexp.quote(ZERO_WIDTH_JOINER)})
#{EmojiFilter.unicodes_pattern}
(?!#{Regexp.union(EmojiFilter::MODIFIER_CHAR_MAP.keys.map { |k| Regexp.quote k })})
}x.dup
emoji_filter_pattern.timeout = 1.0
emoji_filter_pattern.freeze
Implementation¶
This setter has been implemented in https://github.com/ruby/ruby/pull/7847.
Evaluation¶
It's just a setter, so pretty straightforward in terms of implementation and use.
Discussion¶
It's worth considering other options for overriding Regexp.timeout
. I'd love to see something like the following for overriding regexp timeouts as well:
Regexp.timeout = 1.0
Regexp.with_timeout(5.0) do
evaluate_slower_regexes
end
It's possible to implement something like Regexp.with_timeout
but it's not thread-safe by default since it would involve overwriting Regexp.timeout
.
Summary¶
Regexp instances have a getter for timeout, and adding a corresponding setter adds consistency and will make it easier for developers to adopt adding a global Regexp.timeout
by making it simpler to adjust timeouts on a regex by regex basis.
It's a minor change but the added consistency and flexibility help us optimize for developer happiness.
Updated by austin (Austin Ziegler) over 1 year ago
Since you have to .dup
a regexp literal, it might be better to just call Regexp.new
:
emoji_filter_pattern = %r{
(?<!#{Regexp.quote(ZERO_WIDTH_JOINER)})
#{EmojiFilter.unicodes_pattern}
(?!#{Regexp.union(EmojiFilter::MODIFIER_CHAR_MAP.keys.map { |k| Regexp.quote k })})
}x
emoji_filter_pattern = Regexp.new(emoji_filter_pattern, timeout: 1.0).freeze
I am a little surprised that regex literals are frozen, but Regexp.new regexen aren’t.
Updated by byroot (Jean Boussier) over 1 year ago
I am a little surprised that regex literals are frozen, but Regexp.new regexen aren’t.
It's because:
def foo
/foo/.object_id
end
Always return the same instance. So if it's mutable, it allow to leak state where it's not expected.
Another reason if that deduplicating literal regexp could save a decent amount of memory: #16557. They were frozen to allow deduplicating them, but the implementation is not there yet.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
I agree with @austin, calling new
with timeout:
feels simpler and better than dup
then set.
Updated by Eregon (Benoit Daloze) over 1 year ago
On TruffleRuby all Regexp instances are frozen, notably because they are all globally cached and deduplicated, so this pattern with dup
+ setter cannot work.
Also it seems clearly better that a Regexp timeout is passed when creating the Regexp, not after, semantically.
I do get the point that Regexp.new
is less nice than /.../
.
However if you are on 3.2 and the Regexp is linear, then the timeout is most likely unnecessary, and so in general I think it's more effective to work on making a Regexp linear than to use individual timeouts.
Regexp.with_timeout(5.0) do
could be a way, as you show, although it introduces fiber-local state just to pass probably to a single Regexp, so it feels like too much complexity and footprint for that.
I think Regexp.timeout=
is insufficent anyway to prevent ReDoS, because it needs to be a fairly high value to deal with e.g. variance in timings and CPU usage, scheduling, etc, and if a user manages to hit the timeout on every request they could likely still cause a ReDoS.
I think the real solution is to make as many Regexps as possible linear, and warn when a Regexp is not linear (I still need to file a proposal for that).
Updated by byroot (Jean Boussier) over 1 year ago
I do get the point that Regexp.new is less nice than /.../.
The main issue with Regexp.new
is that it compile the regexp on each invocation, which is slow. So you can't just translate from one syntax to the other, you also have to make sure to keep it in a constant.
On TruffleRuby all Regexp instances are frozen, notably because they are all globally cached and deduplicated, so this pattern with dup + setter cannot work.
I assume you could have an internal regexp object that is deduplicated and immutable, and the actual Regexp
object that is exposed could be a tuple of (internal_regexp, timeout)
.
Updated by Eregon (Benoit Daloze) over 1 year ago
byroot (Jean Boussier) wrote in #note-5:
I assume you could have an internal regexp object that is deduplicated and immutable, and the actual
Regexp
object that is exposed could be a tuple of(internal_regexp, timeout)
.
Not really, it would be very messy and have a performance cost.
The reason is we want to embed Regexp objects in the context-independent AST (for persisting JITed code), and so it must not refer to context-specific state.
We could have two types, one for immutable regexps and one for mutable regexps, we do this for String, but it is a nightmare.
Similarly, Ractor wants to be able to use Regexp objects on any Ractor, the requirement for that is that as many Regexp as possible are immutable.
Probably we should make all Regexps immutable in CRuby too for consistency (both across Rubies and between literal/non-literal) and it would benefit Ractor.
Updated by janosch-x (Janosch Müller) over 1 year ago
I think it is better if no code can mutate the timeout of the Regexps that are passed into it, even if that affected only dupped or non-literal Regexps.
If it is really necessary to set custom timeouts on literals, maybe something like this could work:
regexp = Regexp.with_timeout(2.0) { /foo/ }
regexp.timeout # => 2.0
Updated by Eregon (Benoit Daloze) over 1 year ago
janosch-x (Janosch Müller) wrote in #note-7:
regexp = Regexp.with_timeout(2.0) { /foo/ } regexp.timeout # => 2.0
That, and as proposed in the description, doesn't really work if literal Regexps are created at parse time, before execution.
This is the case on CRuby:
$ ruby --disable-gems -e 'pp ObjectSpace.count_objects; O=Object.new; R=/a/' | grep REGEX
:T_REGEXP=>3,
$ ruby --disable-gems -e 'pp ObjectSpace.count_objects; O=Object.new' | grep REGEX
:T_REGEXP=>2,
and it is the case on TruffleRuby as well.
Also it could be confusing for [2.0, 4.0].each { |t| Regexp.with_timeout(t) { /foo/ } }
(it would either set it to 2.0 or to the global timeout, never to 4.0).
Furthermore, even if that worked, it would then break Regexp interning, where the timeout at one place would affect another literal Regexp with the same pattern.
Basically, I think there is no way besides Regexp.new(pattern, timeout: t)
if you want a custom timeout for a Regexp.
Literal Regexp are created too early to set anything.
And adding state (timeout=) to Regexp feels wrong, since most instances are already immutable, and they might become all immutable.
I would suggest to close this, Regexp.new("a", timeout: 2.0)
already works and I think there is no alternative that works well to set the timeout per Regexp.
Updated by janosch-x (Janosch Müller) over 1 year ago
I guess the only noteworthy argument for a change goes like this:
A custom timeout
only being available on Regexp::new
might lead people to write less performant code.
I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.
Maybe there could be an instance method, Regexp#with_timeout
, that creates and memoizes a copy with a specific timeout? This also sounds a bit hacky, though...
Updated by Eregon (Benoit Daloze) over 1 year ago
- Related to Feature #19720: Warning for non-linear Regexps added
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
janosch-x (Janosch Müller) wrote in #note-9:
I guess the only noteworthy argument for a change goes like this:
A custom
timeout
only being available onRegexp::new
might lead people to write less performant code.
I made a patch to improve Regexp.new(/RE/)
(and Regexp#dup
).
https://github.com/nobu/ruby/tree/re_copy
Comparison against master 441302be1a.
Iteration per second (i/s):
compare-ruby | built-ruby | |
---|---|---|
dup | 16.684k | 817.776k |
- | 49.02x | |
string | 16.520k | 16.538k |
- | 1.00x | |
regexp | 16.365k | 842.451k |
- | 51.48x |
I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.
A regexp with interpolation can be replaced with Regexp.new
at almost same performance.
Updated by byroot (Jean Boussier) over 1 year ago
I made a patch to improve Regexp.new(/RE/) (and Regexp#dup).
Interesting. Given that literal regexp are frozen, and even for unfrozen ones most of their state is immutable, have you considered using Copy on Write at the Ruby object level, like Array#dup
/ String#dup
?
Even if copying the bytes is relatively fast, if used in a tight loop it may cause some malloc
churn.
Updated by Eregon (Benoit Daloze) over 1 year ago
janosch-x (Janosch Müller) wrote in #note-9:
A custom
timeout
only being available onRegexp::new
might lead people to write less performant code.
I think it is very well known and easy to know though profiling that one should always cache the result of Regexp.new
(e.g., in a constant).
The docs should point this out though.
I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.
Actually TruffleRuby inline-caches interpolated Regexp creation, so if it's the same pattern then the same Regexp is used.
In the OP's case it seems easy to store the Regexp in a constant.
Maybe there could be an instance method,
Regexp#with_timeout
, that creates and memoizes a copy with a specific timeout? This also sounds a bit hacky, though...
I think timeouts are inefficient and insufficient to address ReDoS. IMO the real solution is #19720.
Updated by nobu (Nobuyoshi Nakada) over 1 year ago
byroot (Jean Boussier) wrote in #note-12:
I made a patch to improve Regexp.new(/RE/) (and Regexp#dup).
Interesting. Given that literal regexp are frozen, and even for unfrozen ones most of their state is immutable, have you considered using Copy on Write at the Ruby object level, like
Array#dup
/String#dup
?
Do you mean compiled pattern and so on in OnigRegexType
?
They are never changed once initialized until destruction, "Copy-on-Write" won't be a proper word.
Currently timelimit
is embedded at the same level as other fields, so the struct must be reconfigured to share other fields.
Even if copying the bytes is relatively fast, if used in a tight loop it may cause some
malloc
churn.
It is faster than re-parsing the source at least.
Updated by byroot (Jean Boussier) over 1 year ago
They are never changed once initialized until destruction, "Copy-on-Write" won't be a proper word.
Right, I just meant directly sharing that structure, or more accurately having some T_STRUCT
simply keeping a reference to another T_STRUCT
. Making Regexp.dup
and Regexp.new(Regexp)
a simple embeded T_STRUCT
allocation, like for String#dup
.