Bug #18572
closedPerformance regression when invoking refined methods
Description
Since Ruby 3.0, defining a refinement for a method slows down its execution even if we do not activate the refinement:
require "benchmark_driver"
source = <<~RUBY
class Hash
def symbolize_keys
transform_keys { |key| key.to_sym rescue key }
end
def refined_symbolize_keys
transform_keys { |key| key.to_sym rescue key }
end
end
module HashRefinements
refine Hash do
def refined_symbolize_keys
raise "never called"
end
end
end
HASH = {foo: 1, bar: 2, baz: 3}
class Foo
def original
end
def refined
end
end
module FooRefinements
refine Foo do
def refined
raise "never called"
end
end
end
FOO = Foo.new
RUBY
Benchmark.driver do |x|
x.prelude %Q{
#{source}
}
x.report "#symbolize_keys original", %{ HASH.symbolize_keys }
x.report "#symbolize_keys refined", %{ HASH.refined_symbolize_keys }
end
Benchmark.driver do |x|
x.prelude %Q{
#{source}
}
x.report "no-op original", %{ FOO.original }
x.report "no-op refined", %{ FOO.refined }
end
The results for Ruby 3.1:
...
Comparison:
#symbolize_keys original: 2372420.1 i/s
#symbolize_keys refined: 1941019.0 i/s - 1.22x slower
...
Comparison:
no-op original: 51790974.2 i/s
no-op refined: 14456518.9 i/s - 3.58x slower
For Ruby 2.6 and 2.7:
Comparison:
#symbolize_keys original: 2278339.7 i/s
#symbolize_keys refined: 2264153.1 i/s - 1.01x slower
...
Comparison:
no-op refined: 64178338.5 i/s
no-op original: 63357980.1 i/s - 1.01x slower
You can find the full code and more results in this gist.
P.S. The problem was originally noticed by @byroot (Jean Boussier), see https://github.com/ruby-i18n/i18n/pull/573
Updated by shugo (Shugo Maeda) over 2 years ago
- Status changed from Open to Assigned
- Assignee set to ko1 (Koichi Sasada)
It seems that the performance regression was introduced by https://github.com/ruby/ruby/commit/b9007b6c548f91e88fd3f2ffa23de740431fa969
$ cat test.rb
class Foo
def original
end
def refined
end
end
module FooRefinements
refine Foo do
def refined
raise "never called"
end
end
end
FOO = Foo.new
t = Time.now
100000.times do
FOO.refined
end
if Time.now - t > 0.007
puts "slow"
exit 1
else
puts "fast"
exit 0
end
$ rubyfarm-bisect -g 537a1cd5a97a8c5e93b64851abaab42812506f66 -b 546730b76b41b142240891cd1bbd7df7990d5239 -t
(snip)
b9007b6c548f91e88fd3f2ffa23de740431fa969 is the first bad commit
commit b9007b6c548f91e88fd3f2ffa23de740431fa969
Author: Koichi Sasada <ko1@atdot.net>
Date: Wed Jan 8 16:14:01 2020 +0900
Introduce disposable call-cache.
This patch contains several ideas:
(1) Disposable inline method cache (IMC) for race-free inline method cache
* Making call-cache (CC) as a RVALUE (GC target object) and allocate new
CC on cache miss.
* This technique allows race-free access from parallel processing
elements like RCU.
(2) Introduce per-Class method cache (pCMC)
* Instead of fixed-size global method cache (GMC), pCMC allows flexible
cache size.
* Caching CCs reduces CC allocation and allow sharing CC's fast-path
between same call-info (CI) call-sites.
(3) Invalidate an inline method cache by invalidating corresponding method
entries (MEs)
* Instead of using class serials, we set "invalidated" flag for method
entry itself to represent cache invalidation.
* Compare with using class serials, the impact of method modification
(add/overwrite/delete) is small.
* Updating class serials invalidate all method caches of the class and
sub-classes.
* Proposed approach only invalidate the method cache of only one ME.
See [Feature #16614] for more details.
class.c | 45 +-
common.mk | 1 +
compile.c | 38 +-
debug_counter.h | 92 ++--
eval.c | 2 +-
ext/objspace/objspace.c | 1 +
gc.c | 204 +++++++-
id_table.c | 2 +-
insns.def | 13 +-
internal/class.h | 2 +
internal/imemo.h | 4 +
internal/vm.h | 41 +-
iseq.c | 17 +
method.h | 11 +-
mjit.c | 19 +-
mjit.h | 29 ++
mjit_compile.c | 42 +-
mjit_worker.c | 30 +-
test/-ext-/tracepoint/test_tracepoint.rb | 12 +-
test/ruby/test_gc.rb | 3 +
test/ruby/test_inlinecache.rb | 64 +++
tool/mk_call_iseq_optimized.rb | 2 +-
tool/ruby_vm/views/_mjit_compile_send.erb | 23 +-
tool/ruby_vm/views/mjit_compile.inc.erb | 2 +-
vm.c | 26 +-
vm_callinfo.h | 235 ++++++++-
vm_core.h | 3 +-
vm_dump.c | 4 +-
vm_eval.c | 50 +-
vm_insnhelper.c | 814 ++++++++++++++++--------------
vm_insnhelper.h | 15 +-
vm_method.c | 630 ++++++++++++++---------
32 files changed, 1606 insertions(+), 870 deletions(-)
create mode 100644 test/ruby/test_inlinecache.rb
bisect run success
Updated by palkan (Vladimir Dementyev) over 1 year ago
Checked Ruby 3.2—better than Ruby 3.1 but still noticeable:
Comparison:
#symbolize_keys original: 4362026.0 i/s
#symbolize_keys refined: 3999755.6 i/s - 1.09x slower
Comparison:
no-op original: 80631273.5 i/s
no-op refined: 29181553.6 i/s - 2.76x slower
Updated by Eregon (Benoit Daloze) over 1 year ago
Interesting, maybe this is partly the cause for the 13 seconds
mentioned in https://shopify.engineering/the-case-against-monkey-patching.
Maybe we should deprecate refinements: my main concern about them is it's a ton of complexity in Ruby implementations (in method lookup and even worse in super method lookup) and it's not really used much or gains much, so it does not seem worth it. It does make every uncached/cache-filling method lookup slower.
It shouldn't make cached method lookup slower though when no refinements are active in that scope, at least it does not currently in TruffleRuby.
For context-independent JITed code there is no clear solution when using refinements, i.e., it might just be uncached lookups (i.e., several hashmap lookups) for calls affected by refinements (calls after using
), which is terrible performance-wise. Maybe there is a complex solution to fix it but I'm not sure it's worth the complexity and it of course increases further the maintenance cost of refinements.
Updated by palkan (Vladimir Dementyev) over 1 year ago
Eregon (Benoit Daloze) wrote in #note-3:
Interesting, maybe this is partly the cause for the
13 seconds
mentioned in https://shopify.engineering/the-case-against-monkey-patching.
As far as I understand, this line is responsible for "the 13 seconds" boot time overhead: https://github.com/ruby/ruby/blob/master/eval.c#L1342 (I was able to achieve similar results by adding tons of using
to the source code). The line was added in the same commit as mentioned above, so the root cause is the same for sure.
Maybe we should deprecate refinements: my main concern about them is it's a ton of complexity in Ruby implementations (in method lookup and even worse in super method lookup) and it's not really used much or gains much, so it does not seem worth it. It does make every uncached/cache-filling method lookup slower.
It shouldn't make cached method lookup slower though when no refinements are active in that scope, at least it does not currently in TruffleRuby.
But that wasn't the case in Ruby <3; refined methods were fast (b/c cached), and had no effect on the original method calls, like today. I believe, something just got lost along the introduction of disposable call-caches.
The problem with "it's not really used much or gains much" is that it took many years for refinements to reach some sort of stability (2.6 and 2.7 releases were pretty good) and finally developers were convinced not to be afraid of the "experimentality" of the feature, and with Ruby 3 we made a huge step back. Especially now, when Shopify engineers claim that "refinements are really slow!" (which is only partially true), we, for sure, should expect the community to turn backs on refinements, again.
It's a two-edged sword 🤷♂️
Updated by Eregon (Benoit Daloze) over 1 year ago
palkan (Vladimir Dementyev) wrote in #note-4:
As far as I understand, this line is responsible for "the 13 seconds" boot time overhead: https://github.com/ruby/ruby/blob/master/eval.c#L1342 (I was able to achieve similar results by adding tons of
using
to the source code). The line was added in the same commit as mentioned above, so the root cause is the same for sure.
Oh yeah, good find, clearing all method lookup caches on every call to using
with ObjectSpace.each_object
is terrible for performance. It likely throws away all JITed code too.
A bit like extend
used to blow up every method lookup inline cache a long time ago.
I guess we need to wait for @ko1 (Koichi Sasada) to fix that.
This is probably fixable but might require some changes to the overall method lookup design, which is then quite involved (from looking at similar things in TruffleRuby).
My concern above is mainly the whole complexity of refinements in Ruby implementations and some parts of the semantics are/were unsound (e.g. super
in refinements can cause infinite loops, this is being addressed by preventing include
/prepend
in a refinement and replacing by import_methods
, a welcome simplification).
Updated by Eregon (Benoit Daloze) over 1 year ago
Potential fix: https://github.com/ruby/ruby/pull/4323
Updated by alanwu (Alan Wu) over 1 year ago
Yes, the perf regression is unfortunate.
The current per-call-site cache (CC
) in the interpreter is keyed on
[recever_class, call_target_method_entry_validity]
-- the lexical scope that
is used to resolve refinement definition for each call site is not cached. The
way the system resolves calls to refinement definitions is through a special
type of method entry (VM_METHOD_TYPE_REFINED
) that triggers a secondary
uncached lookup which takes into account the lexical scope of the call site. Any
refinement definition for a module/class M
, even ones that are never
activated through Module#using
, makes all calls to that method name in M
do this uncached secondary lookup. This is the issue the OP showed in the
no-op {original,refined}
benchmark pair.
The fact that there is a full heap walk during every call to Module#using
is
a separate issue. If you believe what I said about the secondary lookup always
happening for refined method, and believe that the secondary lookup is always
uncached, then this heap walk to invalidate all CC
s should be unnecessary.
This was what I noticed when I sent https://github.com/ruby/ruby/pull/4323 in
2021. The change itself is simple and we can merge it if we can get consensus
that my assertions are correct.
Reworking the cache system to fix the regression for the benchmark in the OP
would be much more invovled.
Maybe we should deprecate refinements: my main concern about them is it's a
ton of complexity in Ruby implementations ...
Their semantics definitely make the minimum required implementation complexity fairly
high, and make cache design tricky. Even the current relatively simple cache
design had bugs (e.g. #17806) when it first landed. I don't think they should
be deprecated, though, since they've been around long enough that removing them
would make many users unhappy. However, I would advise against introducing any new
additional complications to their semantics (e.g. #16461, #19277). We've already seen
how this kind of complexity can make it more challenging for new developments like
Ractors and YJIT.
Updated by Eregon (Benoit Daloze) over 1 year ago
- Related to Bug #14083: Refinement in block calling incorrect method added
Updated by Eregon (Benoit Daloze) over 1 year ago
From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.
Updated by shugo (Shugo Maeda) over 1 year ago
Eregon (Benoit Daloze) wrote in #note-9:
From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.
If it's needed to fix the performance issue, it may be acceptable to raise exceptions in some cases (but I don't know the exact condition of such cases).
@ko1 (Koichi Sasada), @matz (Yukihiro Matsumoto) What do you think?
It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.
Updated by Eregon (Benoit Daloze) over 1 year ago
shugo (Shugo Maeda) wrote in #note-10:
It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.
It would still be possible to use using
outside the module_eval
(e.g. in a module body or at top-level), which seems more correct anyway since refinements should be per constant scope/cref, they are not per block.
Updated by shugo (Shugo Maeda) over 1 year ago
shugo (Shugo Maeda) wrote in #note-10:
Eregon (Benoit Daloze) wrote in #note-9:
From https://bugs.ruby-lang.org/issues/14083#note-3 it seems part of the problem at least is CRuby currently implements what I would call invalid usages of refinements (different set of activates refinements over time for a given call site), and that basically means extra checks and overhead, e.g., for a method which was refined once, even if there are no refinements active in the current scope.
If it's needed to fix the performance issue, it may be acceptable to raise exceptions in some cases (but I don't know the exact condition of such cases).
@ko1 (Koichi Sasada), @matz (Yukihiro Matsumoto) What do you think?It's simple to prohibit using calls in blocks, but it will break backward compatibility, e.g., using in module_eval.
I and @ko1 (Koichi Sasada) discussed it at today's developer meeting, and concluded that there is no need to change the behavior to fix the performance regression.
@Eregon (Benoit Daloze) Could you create another issue if you would like to change the behavior?
Updated by ko1 (Koichi Sasada) over 1 year ago
- Status changed from Assigned to Closed
Applied in changeset git|cfd7729ce7a31c8b6ec5dd0e99c67b2932de4732.
use inline cache for refinements
From Ruby 3.0, refined method invocations are slow because
resolved methods are not cached by inline cache because of
conservertive strategy. However, using
clears all caches
so that it seems safe to cache resolved method entries.
This patch caches resolved method entries in inline cache
and clear all of inline method caches when using
is called.
fix [Bug #18572]
# without refinements
class C
def foo = :C
end
N = 1_000_000
obj = C.new
require 'benchmark'
Benchmark.bm{|x|
x.report{N.times{
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
}}
}
_END__
user system total real
master 0.362859 0.002544 0.365403 ( 0.365424)
modified 0.357251 0.000000 0.357251 ( 0.357258)
# with refinment but without using
class C
def foo = :C
end
module R
refine C do
def foo = :R
end
end
N = 1_000_000
obj = C.new
require 'benchmark'
Benchmark.bm{|x|
x.report{N.times{
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
}}
}
__END__
user system total real
master 0.957182 0.000000 0.957182 ( 0.957212)
modified 0.359228 0.000000 0.359228 ( 0.359238)
# with using
class C
def foo = :C
end
module R
refine C do
def foo = :R
end
end
N = 1_000_000
using R
obj = C.new
require 'benchmark'
Benchmark.bm{|x|
x.report{N.times{
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
obj.foo; obj.foo; obj.foo; obj.foo; obj.foo;
}}
}