Feature #17206
openIntroduce new Regexp option to avoid global MatchData allocations
Description
Originates from https://bugs.ruby-lang.org/issues/17030
When this option is specified, ruby will not create global MatchData
objects, when not explicitly needed by the method.
If the new option is named f
, we can write as /o/f
, and grep(/o/f)
is faster than grep(/o/)
.
This speeds up not only grep
, but also all?
, any?
, case
and so on.
Many people have written code like this:
IO.foreach("foo.txt") do |line|
case line
when /^#/
# do nothing
when /^(\d+)/
# using $1
when /xxx/
# using $&
when /yyy/
# not using $&
else
# ...
end
end
This is slow, because of the above mentioned problem.
Replacing /^#/
with /^#/f
, and /yyy/
with /yyy/f
will make it faster.
Some benchmarks - https://bugs.ruby-lang.org/issues/17030#note-9 which show 2.5x
to 5x
speedup.
Updated by znz (Kazuhiro NISHIYAMA) about 4 years ago
What does regexp_without_matchdata.match(string)
return when matched?
Updated by fatkodima (Dima Fatko) about 4 years ago
znz (Kazuhiro NISHIYAMA) wrote in #note-1:
What does
regexp_without_matchdata.match(string)
return when matched?
Thats what when not explicitly needed by the method.
part was about: it returns MatchData
in this case, as requested.
Updated by fatkodima (Dima Fatko) about 4 years ago
- Subject changed from Introduce new Regexp option to avoid MatchData allocation to Introduce new Regexp option to avoid global MatchData allocations
Updated by Eregon (Benoit Daloze) about 4 years ago
IMHO hardcoding such knowledge in the pattern feels wrong (vs in the matching method like Regexp#match?
which is fine).
It seems to me that it could cause confusing bugs, e.g. when using /f
in the case
above if a when
clause starts to use one of the $~
-derived variables.
Then it would unexpectedly always be nil
, causing a potentially very subtle bug.
I have a hard time to believe that allocating the MatchData is so expensive.
If that's the case, then there must be a lot of optimization potential for faster allocation of MatchData in CRuby.
What I think rather is this is due to having to set $~ in the caller, and maybe to compute group offsets.
I think it would be worth investigating more in details where does the performance overhead from $~
& friends come from in CRuby.
Updated by scivola20 (sciv ola) about 4 years ago
I believe that people who can use match?
and match
methods properly, can use this new Regexp option properly.
By the way, the total size of $`
, $&
, $'
equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.
Updated by Eregon (Benoit Daloze) about 4 years ago
scivola20 (sciv ola) wrote in #note-5:
I believe that people who can use
match?
andmatch
methods properly, can use this new Regexp option properly.
I disagree, match?
is clear, I think =~
suddenly not setting $~
would be a frequent source of bugs.
By the way, the total size of
$`
,$&
,$'
equals to the size of the target string. Therefore a huge amount of String garbage will be generated, if the text is very large.
They are all based on $~
, isn't it?
I think they only need a copy-on-write copy of the source string (to avoid later mutations affecting them) + the matched offsets.
At least that's what happens in TruffleRuby.
Updated by Eregon (Benoit Daloze) about 4 years ago
I took a quick look, the logic to set $~ is here:
https://github.com/ruby/ruby/blob/148961adcd0704d964fce920330a6301b9704c25/re.c#L1608-L1623
It does not seem so expensive, but the region is allocated which xmalloc() which is probably not so cheap (there is also a rb_gc()
call in there, hopefully it's not hit in practice).
rb_backref_set()
goes through a few indirections (it needs to reach the caller frame typically), but it does not seem too expensive either.
I think it would be valuable to investigate further what's actually expensive for setting $~
and how can that be optimized.
A hacky Regexp flag to manually optimize match/=~/===
calls doesn't seem a good way to me.
The caller code knows if it needs $~, etc, not the Regexp literal.
Updated by scivola20 (sciv ola) about 4 years ago
Sorry. “a huge amount of String garbage” is my misunderstanding.
But I don’t know under what situation this option may cause a bug.