Project

General

Profile

Actions

Feature #19694

open

Add Regexp#timeout= setter

Added by aharpole (Aaron Harpole) over 1 year ago. Updated over 1 year ago.

Status:
Open
Assignee:
-
Target version:
-
[ruby-core:113660]

Description

Abstract

In addition to allowing for a Regexp timeout to be set on individual instances by setting a timeout argument in Regexp.new, I'm proposing that we also allow setting the timeout on Regexp objects with a #timeout= setter.

Background

To be able to roll out a global Regexp timeout for a large application, there are inevitably some individual regexes for which a different timeout is appropriate. While the timeout keyword argument was added to Regexp.new, this isn't always a viable option.

In the case of regex literal syntax (/ab*/ or %r{ab*}, for instance), it's not possible to set a timeout at all right now without converting to Regexp.new, which may be awkward depending on the contents of the regex.

It also is desirable from time to time to be able to set a timeout for a regex object after it's been initialized.

Finally, because we offer a Regexp#timeout getter, for consistency it would be nice to also offer a setter.

The introduction of a Regexp#timeout= setter was mentioned as a possible way to set individual timeouts in https://bugs.ruby-lang.org/issues/19104#Specification.

Proposal

I propose that we add the method Regexp#timeout=. It works the same way the timeout argument works in Regexp.new, taking either a float or nil.

This makes it relatively easy to add timeouts to specific regex literals (regex literals are frozen by default so you do have to dup them first):

emoji_filter_pattern = %r{
  (?<!#{Regexp.quote(ZERO_WIDTH_JOINER)})
  #{EmojiFilter.unicodes_pattern}
  (?!#{Regexp.union(EmojiFilter::MODIFIER_CHAR_MAP.keys.map { |k| Regexp.quote k })})
}x.dup
emoji_filter_pattern.timeout = 1.0
emoji_filter_pattern.freeze

Implementation

This setter has been implemented in https://github.com/ruby/ruby/pull/7847.

Evaluation

It's just a setter, so pretty straightforward in terms of implementation and use.

Discussion

It's worth considering other options for overriding Regexp.timeout. I'd love to see something like the following for overriding regexp timeouts as well:

Regexp.timeout = 1.0

Regexp.with_timeout(5.0) do
  evaluate_slower_regexes
end

It's possible to implement something like Regexp.with_timeout but it's not thread-safe by default since it would involve overwriting Regexp.timeout.

Summary

Regexp instances have a getter for timeout, and adding a corresponding setter adds consistency and will make it easier for developers to adopt adding a global Regexp.timeout by making it simpler to adjust timeouts on a regex by regex basis.

It's a minor change but the added consistency and flexibility help us optimize for developer happiness.


Related issues 1 (0 open1 closed)

Related to Ruby master - Feature #19720: Warning for non-linear RegexpsRejectedActions

Updated by austin (Austin Ziegler) over 1 year ago

Since you have to .dup a regexp literal, it might be better to just call Regexp.new:

emoji_filter_pattern = %r{
  (?<!#{Regexp.quote(ZERO_WIDTH_JOINER)})
  #{EmojiFilter.unicodes_pattern}
  (?!#{Regexp.union(EmojiFilter::MODIFIER_CHAR_MAP.keys.map { |k| Regexp.quote k })})
}x
emoji_filter_pattern = Regexp.new(emoji_filter_pattern, timeout: 1.0).freeze

I am a little surprised that regex literals are frozen, but Regexp.new regexen aren’t.

Updated by byroot (Jean Boussier) over 1 year ago

I am a little surprised that regex literals are frozen, but Regexp.new regexen aren’t.

It's because:

def foo
  /foo/.object_id
end

Always return the same instance. So if it's mutable, it allow to leak state where it's not expected.

Another reason if that deduplicating literal regexp could save a decent amount of memory: #16557. They were frozen to allow deduplicating them, but the implementation is not there yet.

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

I agree with @austin, calling new with timeout: feels simpler and better than dup then set.

Updated by Eregon (Benoit Daloze) over 1 year ago

On TruffleRuby all Regexp instances are frozen, notably because they are all globally cached and deduplicated, so this pattern with dup + setter cannot work.
Also it seems clearly better that a Regexp timeout is passed when creating the Regexp, not after, semantically.

I do get the point that Regexp.new is less nice than /.../.
However if you are on 3.2 and the Regexp is linear, then the timeout is most likely unnecessary, and so in general I think it's more effective to work on making a Regexp linear than to use individual timeouts.

Regexp.with_timeout(5.0) do could be a way, as you show, although it introduces fiber-local state just to pass probably to a single Regexp, so it feels like too much complexity and footprint for that.


I think Regexp.timeout= is insufficent anyway to prevent ReDoS, because it needs to be a fairly high value to deal with e.g. variance in timings and CPU usage, scheduling, etc, and if a user manages to hit the timeout on every request they could likely still cause a ReDoS.
I think the real solution is to make as many Regexps as possible linear, and warn when a Regexp is not linear (I still need to file a proposal for that).

Updated by byroot (Jean Boussier) over 1 year ago

I do get the point that Regexp.new is less nice than /.../.

The main issue with Regexp.new is that it compile the regexp on each invocation, which is slow. So you can't just translate from one syntax to the other, you also have to make sure to keep it in a constant.

On TruffleRuby all Regexp instances are frozen, notably because they are all globally cached and deduplicated, so this pattern with dup + setter cannot work.

I assume you could have an internal regexp object that is deduplicated and immutable, and the actual Regexp object that is exposed could be a tuple of (internal_regexp, timeout).

Updated by Eregon (Benoit Daloze) over 1 year ago

byroot (Jean Boussier) wrote in #note-5:

I assume you could have an internal regexp object that is deduplicated and immutable, and the actual Regexp object that is exposed could be a tuple of (internal_regexp, timeout).

Not really, it would be very messy and have a performance cost.
The reason is we want to embed Regexp objects in the context-independent AST (for persisting JITed code), and so it must not refer to context-specific state.
We could have two types, one for immutable regexps and one for mutable regexps, we do this for String, but it is a nightmare.

Similarly, Ractor wants to be able to use Regexp objects on any Ractor, the requirement for that is that as many Regexp as possible are immutable.
Probably we should make all Regexps immutable in CRuby too for consistency (both across Rubies and between literal/non-literal) and it would benefit Ractor.

Updated by janosch-x (Janosch Müller) over 1 year ago

I think it is better if no code can mutate the timeout of the Regexps that are passed into it, even if that affected only dupped or non-literal Regexps.

If it is really necessary to set custom timeouts on literals, maybe something like this could work:

regexp = Regexp.with_timeout(2.0) { /foo/ }
regexp.timeout # => 2.0

Updated by Eregon (Benoit Daloze) over 1 year ago

janosch-x (Janosch Müller) wrote in #note-7:

regexp = Regexp.with_timeout(2.0) { /foo/ }
regexp.timeout # => 2.0

That, and as proposed in the description, doesn't really work if literal Regexps are created at parse time, before execution.
This is the case on CRuby:

$ ruby --disable-gems -e 'pp ObjectSpace.count_objects; O=Object.new; R=/a/' | grep REGEX
 :T_REGEXP=>3,
$ ruby --disable-gems -e 'pp ObjectSpace.count_objects; O=Object.new' | grep REGEX     
 :T_REGEXP=>2,

and it is the case on TruffleRuby as well.

Also it could be confusing for [2.0, 4.0].each { |t| Regexp.with_timeout(t) { /foo/ } } (it would either set it to 2.0 or to the global timeout, never to 4.0).

Furthermore, even if that worked, it would then break Regexp interning, where the timeout at one place would affect another literal Regexp with the same pattern.

Basically, I think there is no way besides Regexp.new(pattern, timeout: t) if you want a custom timeout for a Regexp.
Literal Regexp are created too early to set anything.

And adding state (timeout=) to Regexp feels wrong, since most instances are already immutable, and they might become all immutable.

I would suggest to close this, Regexp.new("a", timeout: 2.0) already works and I think there is no alternative that works well to set the timeout per Regexp.

Updated by janosch-x (Janosch Müller) over 1 year ago

I guess the only noteworthy argument for a change goes like this:

A custom timeout only being available on Regexp::new might lead people to write less performant code.

I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.

Maybe there could be an instance method, Regexp#with_timeout, that creates and memoizes a copy with a specific timeout? This also sounds a bit hacky, though...

Actions #10

Updated by Eregon (Benoit Daloze) over 1 year ago

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

janosch-x (Janosch Müller) wrote in #note-9:

I guess the only noteworthy argument for a change goes like this:

A custom timeout only being available on Regexp::new might lead people to write less performant code.

I made a patch to improve Regexp.new(/RE/) (and Regexp#dup).
https://github.com/nobu/ruby/tree/re_copy

Comparison against master 441302be1a.

Iteration per second (i/s):

compare-ruby built-ruby
dup 16.684k 817.776k
- 49.02x
string 16.520k 16.538k
- 1.00x
regexp 16.365k 842.451k
- 51.48x

I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.

A regexp with interpolation can be replaced with Regexp.new at almost same performance.

Updated by byroot (Jean Boussier) over 1 year ago

I made a patch to improve Regexp.new(/RE/) (and Regexp#dup).

Interesting. Given that literal regexp are frozen, and even for unfrozen ones most of their state is immutable, have you considered using Copy on Write at the Ruby object level, like Array#dup / String#dup ?

Even if copying the bytes is relatively fast, if used in a tight loop it may cause some malloc churn.

Updated by Eregon (Benoit Daloze) over 1 year ago

janosch-x (Janosch Müller) wrote in #note-9:

A custom timeout only being available on Regexp::new might lead people to write less performant code.

I think it is very well known and easy to know though profiling that one should always cache the result of Regexp.new (e.g., in a constant).
The docs should point this out though.

I'm not sure this is a strong argument. The big Regexps that take a noteworthy time to compile are often those with interpolation, as seen in the OP, and I assume these aren't so easy to pre-compile or deduplicate anyway.

Actually TruffleRuby inline-caches interpolated Regexp creation, so if it's the same pattern then the same Regexp is used.
In the OP's case it seems easy to store the Regexp in a constant.

Maybe there could be an instance method, Regexp#with_timeout, that creates and memoizes a copy with a specific timeout? This also sounds a bit hacky, though...

I think timeouts are inefficient and insufficient to address ReDoS. IMO the real solution is #19720.

Updated by nobu (Nobuyoshi Nakada) over 1 year ago

byroot (Jean Boussier) wrote in #note-12:

I made a patch to improve Regexp.new(/RE/) (and Regexp#dup).

Interesting. Given that literal regexp are frozen, and even for unfrozen ones most of their state is immutable, have you considered using Copy on Write at the Ruby object level, like Array#dup / String#dup ?

Do you mean compiled pattern and so on in OnigRegexType?
They are never changed once initialized until destruction, "Copy-on-Write" won't be a proper word.
Currently timelimit is embedded at the same level as other fields, so the struct must be reconfigured to share other fields.

Even if copying the bytes is relatively fast, if used in a tight loop it may cause some malloc churn.

It is faster than re-parsing the source at least.

Updated by byroot (Jean Boussier) over 1 year ago

They are never changed once initialized until destruction, "Copy-on-Write" won't be a proper word.

Right, I just meant directly sharing that structure, or more accurately having some T_STRUCT simply keeping a reference to another T_STRUCT. Making Regexp.dup and Regexp.new(Regexp) a simple embeded T_STRUCT allocation, like for String#dup.

Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0Like0