Bug #4040
closedSystemStackError with Hash[*a] for Large _a_
Description
=begin
I've been hesitating over whether to file a ticket about this, so please feel free to close if I've made the wrong choice.
I often use Hash[*array.flatten] in IRB to convert arrays of arrays into hashes. Today I noticed that if the array is big enough, this would raise a SystemStackError. Puzzled, I looked deeper. I assumed I was hitting the maximum number of arguments a method's argc can hold, but realised that the minimum size of the array needed to trigger this exception differed depending on whether I used IRB or not. So, presumably this is indeed exhausting the stack...
In IRB, the following is the minimal reproduction of this problem:
Hash[*130648.times.map{ 1 }]; true
I haven't looked for the minimum value needed with ruby -e
, but the following reproduces:
ruby -e 'Hash[*1380888.times.map{ 1 }]'
I suppose this isn't technically a bug, but maybe it offers another argument for either #666 or an extension of #3131.
=end
Updated by duerst (Martin Dürst) almost 14 years ago
=begin
This bug may be related to bug #982.
=end
Updated by naruse (Yui NARUSE) over 13 years ago
- Status changed from Open to Assigned
- Assignee set to ko1 (Koichi Sasada)
Updated by nahi (Hiroshi Nakamura) over 13 years ago
- Target version set to 1.9.3
Updated by ko1 (Koichi Sasada) over 13 years ago
- Target version changed from 1.9.3 to 2.0.0
Let us pending it to next version....
Updated by ko1 (Koichi Sasada) almost 12 years ago
- Target version changed from 2.0.0 to 2.6
We need re-consideration about method invocation to support such cases.
I want to challenge at next version.
Updated by ko1 (Koichi Sasada) almost 8 years ago
Current implementation: now splatting huge parameters (and receive rest arguments) for Ruby methods are fine. However, C methods doesn't support this pattern. It should be fixed.
Updated by ko1 (Koichi Sasada) almost 8 years ago
- Related to Bug #5719: Hash::[] can't handle 100000+ args added
Updated by jeremyevans0 (Jeremy Evans) almost 2 years ago
I submitted a pull request to fix this: https://github.com/ruby/ruby/pull/6816
Updated by ko1 (Koichi Sasada) almost 2 years ago
- Status changed from Assigned to Closed
Applied in changeset git|2e7bceb34ea858649e1f975a934ce1894d1f06a6.
Do not use VM stack for splat arg on cfunc
On the cfunc methods, if a splat argument is given, all array elements
are expanded on the VM stack and it can cause SystemStackError.
The idea to avoid it is making a hidden array to contain all parameters
and use this array as an argv.
This patch is reviesed version of https://github.com/ruby/ruby/pull/6816
The main change is all changes are closed around calling cfunc logic.
Fixes [Bug #4040]
Co-authored-by: Jeremy Evans code@jeremyevans.net
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
- Status changed from Closed to Open
I'm reopening this issue.
Originally, when 2e7bceb34ea858649e1f975a934ce1894d1f06a6 fixed cfuncs to no longer use the VM stack for large array splats, it was thought to have fully fixed this issue, since the issue was fixed for methods defined in Ruby (iseqs) back in Ruby 2.2.
After additional research, I determined that the same issue affects almost all types of method calls, not just iseq and cfunc calls. There were two main types of remaining issues, important cases (where large array splat should work) and pedantic cases (where large array splat raised SystemStackError instead of ArgumentError).
Important cases:
# bmethod
define_method(:a){|*a|}
a(*1380888.times)
# send
def b(*a); end
send(:b, *1380888.times)
# symproc
:b.to_proc.call(self, *1380888.times)
# method to proc
def d; yield(*1380888.times) end
d(&method(:b))
# method_missing
def self.method_missing(*a); end
not_a_method(*1380888.times)
# Also affects use of C-API with a large number of arguments
Pedantic cases:
# iseq with only required or optional arguments
def a; end
a(*1380888.times)
def b(_); end
b(*1380888.times)
def c(_=nil); end
c(*1380888.times)
# attr reader/writer
c = Class.new do
attr_accessor :a
alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)
# Struct aref/aset
c = Struct.new(:a) do
alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)
I have submitted a pull request to fix the issue: https://github.com/ruby/ruby/pull/7522. It's fairly invasive, since the use of the VM stack for handling arguments occurs in multiple different places in the VM, and all the different method call handlers need to be updated to support it. This has a runtime cost, but I'm not sure exactly how much. I would appreciate if someone could do some benchmarking with it. We'll need to decide if fixing this issue is worth the potential performance hit (or someone could potentially optimize the approach I used).
Updated by Eregon (Benoit Daloze) over 1 year ago
My opinion as a TruffleRuby implementer is I don't think TruffleRuby will or can support that.
So at least it should be clear that's not something really supported by the Ruby language in general.
User code shouldn't pass a million arguments, if it's say > 20 arguments or so one should pass an array, splatting on both sides is silly, it just creates extra copies and is slower.
It's a Ruby anti-pattern.
So IMO there is no point to fix or support this, such code doesn't make sense and changing the user code is the best fix, for performance and simplicity.
Updated by Eregon (Benoit Daloze) over 1 year ago
In other words, for efficiency arguments need to be passed on the stack, e.g. for JITed code.
Sure it could be possible to have special handling for "too many arguments" and do something else, but that means making Ruby slower in general, which seems clearly not worth for degenerate cases such as this one.
Updated by Eregon (Benoit Daloze) over 1 year ago
Regarding the original report, there is a clear solution now with to_h
:
1000.times.each_slice(2).to_a.to_h
# or
1000.times.each_slice(2).to_h
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
Considering we fixed this issue for iseq and cfunc methods, probably the main reason to reject fixing this issue for the remaining method types is that doing so potentially lowers performance. To try to counteract this, I've updated my pull request to include an optimization for bmethods that improves performance 38-59% for simple method calls, and up to 180% for method calls with keywords. Hopefully this increase in bmethod performance more than cancels out any performance decreases caused by the large argument splat handling.
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
I've updated my pull request to include additional optimizations for:
- cfunc: 10-15* for f(*a) and 35-40% for f(*a, **kw) if kw is empty
- send: 5-115% depending on type of call
- symproc: 5-100% depending of type of call
- method_missing: 10-115% depending on type of call
The cfunc optimization works by copying the array contents to the stack instead of using CALLER_SETUP_ARG.
The send, symproc, and method_missing optimizations are achieved by avoiding unnecessary use of CALLER_SETUP_ARG.
Hopefully these additional optimizations help offset any performance decrease from the additional checks needed to fix this issue.
Updated by ko1 (Koichi Sasada) over 1 year ago
Quote from devmeeting agenda https://bugs.ruby-lang.org/issues/19525:
The fix results in a minor performance decrease in microbenchmarks.
Could you show more details (results)?
Do you have an analysis which line(s) makes slower?
I don't think this feature should be rejected. It is cool to support this feature (long splat can be accepted by rest argument).
However, personally speaking I feel the proposed patch (https://github.com/ruby/ruby/pull/7522) is too complex for future maintenance comparing with the benefits from the patch.
Now I have no time to review the patch closely and I couldn't confirm this patch has such issue.
So I agree to merge it (and rewrite them if they can be more improved) because it is well tested.
I think it is better to have benchmark measurements on some benchmarks, though.
General comments:
- Some code are duplicated so maybe they can be more shorter.
- (optimization) It is not sure how the optimization target cases are there (optimizations for the minor cases can introduce issues such as i-cache miss, difficulty on future maintenance and so on).
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
ko1 (Koichi Sasada) wrote in #note-17:
Quote from devmeeting agenda https://bugs.ruby-lang.org/issues/19525:
The fix results in a minor performance decrease in microbenchmarks.
Could you show more details (results)?
In terms of existing benchmarks:
- For app_fib benchmark about 1-3% decrease.
- vm_send and vm_send_var benchmark improves 3-5% due to the send optimization.
I'll try to do some more benchmarking tomorrow and report back. Is there a decent real world benchmark in benchmarks
I can use? With the patch set, some microbenchmarks are slower, but some cases I optimized (bmethod/send/symproc/method_missing) are much faster (over 2x). A real world benchmark would be more useful to determine the actual performance differences.
I do most of my development on OpenBSD, which is a bit suboptimal for benchmarking small differences in performance in my experience (possibly due to the additional randomization).
If someone could run yjit-bench on the pull request branch (in interpreter mode), that would be very helpful.
Do you have an analysis which line(s) makes slower?
Unfortunately, I don't. My guess would be it is due to the additional branches in CALLER_SETUP_ARG
and checking for calling->heap_argv
.
I don't think this feature should be rejected. It is cool to support this feature (long splat can be accepted by rest argument).
However, personally speaking I feel the proposed patch (https://github.com/ruby/ruby/pull/7522) is too complex for future maintenance comparing with the benefits from the patch.
Agreed. I wish the patch could be made simpler, but I think most of the complexity of the patch is necessary if we want to fix the bug.
Now I have no time to review the patch closely and I couldn't confirm this patch has such issue.
So I agree to merge it (and rewrite them if they can be more improved) because it is well tested.
OK. Before it is merged, the yjit team needs to make the necessary changes to yjit to support it. Alternatively, they could temporarily disable parts of yjit this breaks, but from talking to @alanwu (Alan Wu), that could result in temporarily disabling a lot of yjit. I think it would be preferable to fix yjit before this is merged.
I think it is better to have benchmark measurements on some benchmarks, though.
I added some benchmarks related to the optimizations I added, and basic results of those benchmarks in the in related commits. As mentioned above, I'll try to do additional benchmarking and report back.
General comments:
- Some code are duplicated so maybe they can be more shorter.
OK. I will review and see if I can eliminate the redundant code.
- (optimization) It is not sure how the optimization target cases are there (optimizations for the minor cases can introduce issues such as i-cache miss, difficulty on future maintenance and so on).
The bmethod/send/symproc/method_missing optimizations are all very large in certain cases and should definitely be included.
The cfunc optimizations are limited to specific cases (*args
or *args, **kw
with empty kw
) and not as large even in those cases (10-15% for *args
, 35-40% for *args, **kw
with empty kw
). I'm guessing they are still a net performance improvement, though.
Updated by ko1 (Koichi Sasada) over 1 year ago
jeremyevans0 (Jeremy Evans) wrote in #note-18:
The bmethod/send/symproc/method_missing optimizations are all very large in certain cases and should definitely be included.
I mean how much "certain cases" are there in apps. Debug counter feature in debug_counter.[ch]
will help to confirm such statistics.
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
I rebased my branch against master, and then ran all of the app_*
benchmarks, here are the results:
app_aobench: +1%
app_erb: 0%
app_factorial: 0%
app_fib: +5%
app_lc_fizzbuzz: 0%
app_mandelbrot: 0%
app_pentomino: -1%
app_raise: 0%
app_strconcat: +8%
app_tak: +4%
app_tarai: +3%
app_uri: -2%
For most of the benchmarks, I ran with --repeat-count 10 --repeat-result best
(some take a long time and I only ran with 1 or 3 instead of 10).
So from this benchmarking, only app_pentomino
and app_uri
are slower, by 1-2%. 5 benchmarks are faster, by up to 8%. 5 benchmarks did not show any performance differences.
app_fib
is showing up 5% faster now, when it was previously showing 1-3% slower. To make sure this wasn't an anomaly, I ran with --repeat-count 25
, and still got the same results. Again, this could just be due to my environment (OpenBSD), as I cannot think of a reason why app_fib
would be faster with the changes.
All of these benchmarks are more of the microbenchmark nature. More realistic benchmarks such as yjit-bench on Linux would be better for testing actual differences in performance.
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
I ran yjit-bench with both the master branch and the PR branch. Here are the results:
Total time spent benchmarking: 5172s
master: ruby 3.3.0dev (2023-04-14T03:43:46Z master 3733ee835b) [x86_64-openbsd7.3]
heap_argv: ruby 3.3.0dev (2023-04-15T06:35:36Z large-array-splat-.. a0eb73211c) [x86_64-openbsd7.3]
-------------- ----------- ---------- -------------- ---------- ---------------- -----------------
bench master (ms) stddev (%) heap_argv (ms) stddev (%) master/heap_argv heap_argv 1st itr
activerecord 150.8 2.4 150.9 2.1 1.00 0.97
erubi_rails 52.6 7.9 52.9 8.1 1.00 1.05
hexapdf 6996.9 1.0 6925.5 0.6 1.01 1.11
liquid-c 177.5 2.0 175.3 1.5 1.01 1.03
liquid-compile 165.6 2.8 165.5 2.0 1.00 1.01
liquid-render 372.2 0.6 374.9 1.6 0.99 1.01
mail 389.6 0.7 394.0 2.0 0.99 1.01
psych-load 6431.2 0.2 6356.4 0.3 1.01 1.01
railsbench 4654.3 0.3 4696.0 0.6 0.99 0.99
ruby-lsp 159.6 6.0 155.6 5.8 1.03 1.05
sequel 215.0 2.6 214.8 0.9 1.00 1.00
binarytrees 840.2 0.3 840.2 0.9 1.00 0.99
chunky_png 2710.0 0.2 2739.4 0.4 0.99 0.98
erubi 732.7 1.6 726.9 1.1 1.01 1.02
etanni 984.5 1.6 974.1 0.5 1.01 1.01
fannkuchredux 4282.9 0.2 4334.9 0.2 0.99 0.99
lee 3625.8 0.4 3594.9 0.3 1.01 1.01
nbody 183.7 0.9 178.7 0.2 1.03 1.03
optcarrot 9673.8 1.0 9626.9 0.9 1.00 1.01
ruby-json 9889.0 0.1 9848.9 0.4 1.00 1.01
rubykon 23063.9 0.5 22953.8 0.3 1.00 1.00
30k_ifelse 3829.2 0.5 3824.4 1.0 1.00 0.99
30k_methods 7761.7 0.2 7665.6 0.2 1.01 1.01
cfunc_itself 327.7 0.3 326.6 0.5 1.00 1.00
fib 466.2 0.3 469.2 0.6 0.99 1.00
getivar 221.7 0.6 222.1 0.3 1.00 1.00
keyword_args 652.6 0.2 653.2 0.3 1.00 1.00
respond_to 893.4 0.2 909.5 0.1 0.98 0.98
setivar 148.3 0.3 143.5 0.2 1.03 1.01
setivar_object 295.0 0.5 291.6 0.5 1.01 1.01
setivar_young 295.1 0.4 291.6 0.6 1.01 1.01
str_concat 231.9 3.2 211.0 2.3 1.10 1.07
throw 40.2 1.3 41.6 9.2 0.97 0.98
-------------- ----------- ---------- -------------- ---------- ---------------- -----------------
Legend:
- master/heap_argv: ratio of master/heap_argv time. Higher is better for heap_argv. Above 1 represents a speedup.
- heap_argv 1st itr: ratio of master/heap_argv time for the first benchmarking iteration.
So it looks like it is slower on 8 benchmarks (6 1% slower, 1 2% slower, 1 3% slower), and faster on 13 benchmarks (9 1% faster, 3 3% faster, 1 10% faster). So on the whole, it looks like a net performance increase.
It would be good to get benchmark results from Linux, so if someone could contribute that, I would appreciate it.
Updated by k0kubun (Takashi Kokubun) over 1 year ago
It would be good to get benchmark results from Linux, so if someone could contribute that, I would appreciate it.
I have a x86_64-linux environment with CPU frequency scaling disabled, so I benchmarked your PR with yjit-bench. I used --category headline
because other benchmarks are less important in practice.
before: ruby 3.3.0dev (2023-04-14T03:43:46Z master 3733ee835b) [x86_64-linux]
after: ruby 3.3.0dev (2023-04-15T06:35:36Z large-array-splat-.. a0eb73211c) [x86_64-linux]
-------------- ----------- ---------- ---------- ---------- ------------ -------------
bench before (ms) stddev (%) after (ms) stddev (%) before/after after 1st itr
activerecord 65.6 0.4 65.8 0.3 1.00 0.99
erubi_rails 18.7 1.3 18.7 12.4 1.00 1.02
hexapdf 2164.2 0.4 2181.5 0.8 0.99 0.98
liquid-c 57.0 1.6 57.0 1.9 1.00 0.99
liquid-compile 52.6 0.6 52.4 0.6 1.00 1.00
liquid-render 139.8 1.3 140.2 1.1 1.00 1.00
mail 118.1 0.1 118.7 0.2 1.00 1.00
psych-load 1707.5 0.2 1750.9 0.1 0.98 0.97
railsbench 1907.4 0.8 1929.4 0.8 0.99 0.99
ruby-lsp 59.5 10.3 59.4 11.9 1.00 0.98
sequel 65.6 0.2 65.6 0.2 1.00 1.00
-------------- ----------- ---------- ---------- ---------- ------------ -------------
I tried running them a few times. In before/after
, psych-load is stably 2% slower. hexapdf and railsbench show a 1% slowdown, which may be insignificant. Other benchmarks seem to have no difference.
Updated by Eregon (Benoit Daloze) over 1 year ago
I rebased my branch against master, and then ran all of the app_* benchmarks, here are the results:
Are the +N% there improvements or regressions? From those numbers it sounds like +
would be regressions (i.e., more time to execute the same thing).
I am thinking a bit more about the implications of this for Ruby implementations and JITs.
Only passing on the stack means not allowed to pass a huge number of arguments (the case on TruffleRuby).
Only passing as a heap array seems inefficient in general (would cause extra allocations, at least in interpreter, for foo(1, 2)
).
I guess one could use 2 different calling conventions, on stack if no rest parameter, on heap if there is a rest parameter. But more calling conventions is a clear cost as it causes extra checks for every call, even more so for polymorphic call site (+ it's messy to do callee-specific logic in the caller).
If supporting to pass both arguments on the stack or in a heap array, then the called method (the callee) will most likely need to branch and find out from where to read arguments.
It seems always an anti-pattern to have the callee need to deal with two calling conventions.
That may actually be easier to deal with in C because a VALUE*
pointer can represent both, then it would be one check on method entry for which pointer and size to use.
In Java, if passing arguments as an Object[] and having hidden arguments at the start of the array, there is no way to share the logic with a Ruby Array from the heap, or it would need some offset for every argument access, which seems very expensive.
I suppose one could technically compile 2 variants of a method, one for on stack and one for heap array, but it seems very expensive from a warmup and memory perspective, and it's again costing more calling conventions.
Also when using array storage strategies, the array might be int[] behind the scenes and then passing it as a single argument vs a splat is so so so much faster.
Basically, I think efficient Ruby implementations and JITs might not want to deal with the complexity of on-heap arguments.
Such usage pattern is intrinsically inefficient.
For example m(:name, *array)
is quite expensive if array is big, m(:name, array)
is strictly better from a performance POV.
m(*array)
can at best be as fast as m(array)
, but can be much worse, e.g. if passed on stack (and < 128 for your PR) or if array
is a int[]
.
Of course CRuby devs will decide what they want here.
The real issue is, if CRuby accepts this:
- There is probably no hope to ever revert that decision and to remove those costs, because some code will likely start to depend on it.
- It might encourage Ruby users to abuse splats more since they seem not much slower than non-splat on CRuby and they don't trigger SystemStackError.
Updated by jeremyevans0 (Jeremy Evans) over 1 year ago
Eregon (Benoit Daloze) wrote in #note-23:
I rebased my branch against master, and then ran all of the app_* benchmarks, here are the results:
Are the +N% there improvements or regressions? From those numbers it sounds like
+
would be regressions (i.e., more time to execute the same thing).
+N% is an improvement in iterations per second, -N% is a decrease in iterations per second.
I am thinking a bit more about the implications of this for Ruby implementations and JITs.
Only passing on the stack means not allowed to pass a huge number of arguments (the case on TruffleRuby).
Only passing as a heap array seems inefficient in general (would cause extra allocations, at least in interpreter, forfoo(1, 2)
).
I guess one could use 2 different calling conventions, on stack if no rest parameter, on heap if there is a rest parameter. But more calling conventions is a clear cost as it causes extra checks for every call, even more so for polymorphic call site (+ it's messy to do callee-specific logic in the caller).
For CRuby, a heap array is used for large array splats (configurable, but currently 129+ elements). Smaller array splats use the VM stack. A heap array is only used for method calls with argument splat, never for other method calls (even if you pass 129+ arguments). On CRuby, there is a minor cost for checking for whether a heap_allocated array was used.
- There is probably no hope to ever revert that decision and to remove those costs, because some code will likely start to depend on it.
@ko1 (Koichi Sasada) has already told me that the heap_argv part of the patch will be reverted if it becomes an significant obstacle to future CRuby optimization work. We'll keep the other optimizations in the pull request in that case.
- It might encourage Ruby users to abuse splats more since they seem not much slower than non-splat on CRuby and they don't trigger SystemStackError.
This is incorrect. Passing arrays via splats is always slower than passing arrays as positional arguments. The pull request makes passing large arrays via splats not trigger SystemStackError. For arrays with 129+ elements, the pull request actually slows down such calls by using a temporary array instead of passing the elements on the VM stack. So if anything, the patch encourages users not to pass large arrays as splats, as doing so is even worse for performance than before.
We may want to consider adding a performance warning for passing large arrays via splats.
Updated by Eregon (Benoit Daloze) over 1 year ago
jeremyevans0 (Jeremy Evans) wrote in #note-24:
So if anything, the patch encourages users not to pass large arrays as splats, as doing so is even worse for performance than before.
Thank you for the precision. Then at least it's not going to encourage large splats due to changed performance, good.
Of course it remains a concern that users/code start to depend on this i.e., on no SystemStackError in calls with a large splat, and then changing that back would be a breaking change.
But maybe it's rare and stays rare enough that it might be possible to change it again (e.g. if valuable enough for performance), unsure.
We may want to consider adding a performance warning for passing large arrays via splats.
That's a good idea.
Updated by jeremyevans (Jeremy Evans) over 1 year ago
- Status changed from Open to Closed
Applied in changeset git|99c6d19e502b5fdadbd367ae4b6bb3fab850fddc.
Generalize cfunc large array splat fix to fix many additional cases raising SystemStackError
Originally, when 2e7bceb34ea858649e1f975a934ce1894d1f06a6 fixed cfuncs to no
longer use the VM stack for large array splats, it was thought to have fully
fixed Bug #4040, since the issue was fixed for methods defined in Ruby (iseqs)
back in Ruby 2.2.
After additional research, I determined that same issue affects almost all
types of method calls, not just iseq and cfunc calls. There were two main
types of remaining issues, important cases (where large array splat should
work) and pedantic cases (where large array splat raised SystemStackError
instead of ArgumentError).
Important cases:
define_method(:a){|*a|}
a(*1380888.times)
def b(*a); end
send(:b, *1380888.times)
:b.to_proc.call(self, *1380888.times)
def d; yield(*1380888.times) end
d(&method(:b))
def self.method_missing(*a); end
not_a_method(*1380888.times)
Pedantic cases:
def a; end
a(*1380888.times)
def b(_); end
b(*1380888.times)
def c(_=nil); end
c(*1380888.times)
c = Class.new do
attr_accessor :a
alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)
c = Struct.new(:a) do
alias b a=
end.new
c.a(*1380888.times)
c.b(*1380888.times)
This patch fixes all usage of CALLER_SETUP_ARG with splatting a large
number of arguments, and required similar fixes to use a temporary
hidden array in three other cases where the VM would use the VM stack
for handling a large number of arguments. However, it is possible
there may be additional cases where splatting a large number
of arguments still causes a SystemStackError.
This has a measurable performance impact, as it requires additional
checks for a large number of arguments in many additional cases.
This change is fairly invasive, as there were many different VM
functions that needed to be modified to support this. To avoid
too much API change, I modified struct rb_calling_info to add a
heap_argv member for storing the array, so I would not have to
thread it through many functions. This struct is always stack
allocated, which helps ensure sure GC doesn't collect it early.
Because of how invasive the changes are, and how rarely large
arrays are actually splatted in Ruby code, the existing test/spec
suites are not great at testing for correct behavior. To try to
find and fix all issues, I tested this in CI with
VM_ARGC_STACK_MAX to -1, ensuring that a temporary array is used
for all array splat method calls. This was very helpful in
finding breaking cases, especially ones involving flagged keyword
hashes.
Fixes [Bug #4040]
Co-authored-by: Jimmy Miller jimmy.miller@shopify.com
Updated by tenderlovemaking (Aaron Patterson) about 1 year ago
Hi,
We (on the YJIT team) have been tracking Ruby performance. We usually look at YJIT performance as compared to the interpreter, but recently we started looking at interpreter performance (it's getting later in the year so we want to make sure things look good for the release). We noticed there was a degradation in interpreter performance in April, and I tracked the issue to this commit.
We were specifically looking at the chunky png benchmark. If you check this graph then click "Time" on the bottom, you can see the increase in time in April.
Running perf stat
on the benchmark shows this commit increases the executed instructions from 232,171,088,239 to 244,008,191,171.
Here is the perf stat
results I got for e7cdce83e8:
Performance counter stats for 'ruby -v benchmark.rb':
18,995.65 msec task-clock # 1.000 CPUs utilized
68 context-switches # 3.580 /sec
1 cpu-migrations # 0.053 /sec
33,003 page-faults # 1.737 K/sec
82,022,881,262 cycles # 4.318 GHz (83.32%)
449,576,108 stalled-cycles-frontend # 0.55% frontend cycles idle (83.32%)
21,629,264,292 stalled-cycles-backend # 26.37% backend cycles idle (83.33%)
232,171,088,239 instructions # 2.83 insn per cycle
# 0.09 stalled cycles per insn (83.35%)
40,092,228,662 branches # 2.111 G/sec (83.37%)
76,492,242 branch-misses # 0.19% of all branches (83.34%)
18.997823376 seconds time elapsed
18.904486000 seconds user
0.091909000 seconds sys
vs perf stat
for 99c6d19e50
Performance counter stats for 'ruby -v benchmark.rb':
19,415.14 msec task-clock # 1.000 CPUs utilized
66 context-switches # 3.399 /sec
2 cpu-migrations # 0.103 /sec
38,513 page-faults # 1.984 K/sec
82,876,137,933 cycles # 4.269 GHz (83.32%)
577,427,117 stalled-cycles-frontend # 0.70% frontend cycles idle (83.33%)
20,414,833,187 stalled-cycles-backend # 24.63% backend cycles idle (83.35%)
244,008,191,171 instructions # 2.94 insn per cycle
# 0.08 stalled cycles per insn (83.37%)
41,385,015,387 branches # 2.132 G/sec (83.35%)
76,702,379 branch-misses # 0.19% of all branches (83.32%)
19.417447024 seconds time elapsed
19.346844000 seconds user
0.068922000 seconds sys
It's cool we could fix a 13 year old bug, but given the "rarity" of this issue I'm not sure it's worth the slowdown? 😅
I will try to find some way to speed this up, but I'm not sure if I'll have time. In case someone has time to look, the way to reproduce this is by checking out yjit-bench then running the chunky-png benchmark like this:
perf stat ruby -v benchmark.rb