Project

General

Profile

Actions

Bug #20710

closed

Reducing Hash allocation introduces large performance degradation (probably related to VWA)

Added by pocke (Masataka Kuwabara) 4 months ago. Updated 4 months ago.

Status:
Closed
Assignee:
-
Target version:
-
ruby -v:
ruby 3.3.4 (2024-07-09 revision be1089c8ec) [arm64-darwin21]
[ruby-core:119000]

Description

I found a surprising performance degradation while developing RBS.
In short, I tried to remove unnecessary Hash allocations for RBS. Then, it made the execution time 2x slower.

VWA for Hash probably causes this degradation. I'd be happy if we could mitigate the impact by updating the memory management strategy.

Reproduce

You can reproduce this problem on a PR in pocke/rbs repository.
https://github.com/pocke/rbs/pull/2
This PR dedups empty Hash objects.

  1. git clone and checkout
  2. bundle install
  3. bundle exec rake compile for C-ext
  4. bundle ruby benchmark/benchmark_new_env.rb

The "before" commit is https://github.com/pocke/rbs/commit/2c356c060286429cfdb034f88a74a6f94420fd21.
The "after" commit is https://github.com/pocke/rbs/commit/bfb2c367c7d3b7f93720392252d3a3980d7bf335.

The benchmark results are the following:

# Before
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      6.426 (±15.6%) i/s -     64.000 in  10.125442s
       new_rails_env      0.968 (± 0.0%) i/s -     10.000 in  10.355738s

# After
$ bundle exec ruby benchmark/benchmark_new_env.rb
(snip)
             new_env      4.371 (±22.9%) i/s -     43.000 in  10.150192s
       new_rails_env      0.360 (± 0.0%) i/s -      4.000 in  11.313158s

The IPS decreased 1.47x for new_env case (parsing small RBS env), and 2.69x for new_rails_env (parsing large RBS env).

Investigation

GC.stat

GC.stat indicates the number of minor GCs increases.

# In the RBS repository
require_relative './benchmark/utils'

tmpdir = prepare_collection!
new_rails_env(tmpdir)
pp GC.stat
# before
{:count=>126,
 :time=>541,
 :marking_time=>496,
 :sweeping_time=>45,
 :heap_allocated_pages=>702,
 :heap_sorted_length=>984,
 :heap_allocatable_pages=>282,
 :heap_available_slots=>793270,
 :heap_live_slots=>787407,
 :heap_free_slots=>5863,
 :heap_final_slots=>0,
 :heap_marked_slots=>757744,
 :heap_eden_pages=>702,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>702,
 :total_freed_pages=>0,
 :total_allocated_objects=>2220605,
 :total_freed_objects=>1433198,
 :malloc_increase_bytes=>5872,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>112,
 :major_gc_count=>14,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>4779,
 :old_objects=>615704,
 :old_objects_limit=>955872,
 :oldmalloc_increase_bytes=>210912,
 :oldmalloc_increase_bytes_limit=>16777216}

# after
{:count=>255,
 :time=>1551,
 :marking_time=>1496,
 :sweeping_time=>55,
 :heap_allocated_pages=>570,
 :heap_sorted_length=>1038,
 :heap_allocatable_pages=>468,
 :heap_available_slots=>735520,
 :heap_live_slots=>731712,
 :heap_free_slots=>3808,
 :heap_final_slots=>0,
 :heap_marked_slots=>728727,
 :heap_eden_pages=>570,
 :heap_tomb_pages=>0,
 :total_allocated_pages=>570,
 :total_freed_pages=>0,
 :total_allocated_objects=>2183278,
 :total_freed_objects=>1451566,
 :malloc_increase_bytes=>1200,
 :malloc_increase_bytes_limit=>16777216,
 :minor_gc_count=>242,
 :major_gc_count=>13,
 :compact_count=>0,
 :read_barrier_faults=>0,
 :total_moved_objects=>0,
 :remembered_wb_unprotected_objects=>0,
 :remembered_wb_unprotected_objects_limit=>5915,
 :old_objects=>600594,
 :old_objects_limit=>1183070,
 :oldmalloc_increase_bytes=>8128,
 :oldmalloc_increase_bytes_limit=>16777216}

Warming up Hashes

The following patch, which creates unnecessary Hash objects before the benchmark, improves the execution time.

diff --git a/benchmark/benchmark_new_env.rb b/benchmark/benchmark_new_env.rb
index 6dd2b73f..a8da61c6 100644
--- a/benchmark/benchmark_new_env.rb
+++ b/benchmark/benchmark_new_env.rb
@@ -4,6 +4,8 @@ require 'benchmark/ips'
 
 tmpdir = prepare_collection!
 
+(0..30_000_000).map { {} }
+
 Benchmark.ips do |x|
   x.time = 10

The results are the following:

# Before
Calculating -------------------------------------
             new_env     10.354 (± 9.7%) i/s -    103.000 in  10.013834s
       new_rails_env      1.661 (± 0.0%) i/s -     17.000 in  10.282490s

# After
Calculating -------------------------------------
             new_env     10.771 (± 9.3%) i/s -    107.000 in  10.010446s
       new_rails_env      1.584 (± 0.0%) i/s -     16.000 in  10.178984s

RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO

The RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO env var also mitigates the performance impact.
In this example, I set RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 (default: 0.20)

# Before
Calculating -------------------------------------
             new_env     10.271 (± 9.7%) i/s -    102.000 in  10.087191s
       new_rails_env      1.529 (± 0.0%) i/s -     16.000 in  10.538043s

# After
$ env RUBY_GC_HEAP_FREE_SLOTS_MIN_RATIO=0.6 bundle exec ruby benchmark/benchmark_new_env.rb
Calculating -------------------------------------
             new_env     11.003 (± 9.1%) i/s -    110.000 in  10.068428s
       new_rails_env      1.347 (± 0.0%) i/s -     14.000 in  11.117665s

Additional Information

  • I applied the same change to Array. But it does not cause this problem.
    • I guess the cause is the difference of the Size Pool. An empty Array uses 40 bytes like the ordinal Ruby object, but an empty Hash uses 160 bytes.
    • The Size Pool for 160 bytes objects has fewer objects than the 40 bytes one. So, reducing allocation affects the performance sensitively.
  • I tried it on Ruby 3.2. This change on Ruby 3.2 does not degrade the execution time.

Acknowledgement

@mame (Yusuke Endoh), @ko1 (Koichi Sasada), and @soutaro (Soutaro Matsumoto) helped the investigation. I would like to thank them.

Updated by peterzhu2118 (Peter Zhu) 4 months ago

Thank you for the detailed report.

I had a discussion about this with @mame (Yusuke Endoh) and @byroot (Jean Boussier) last year in the Ruby core slack: https://ruby.slack.com/archives/C02A3SL0S/p1702910003614609

One of the big differences is that VWA eagerly allocates the AR/ST table for the hash, rather than lazily allocate it. This means that for empty hashes, there is a performance penalty for empty hashes, but the penalty can be reclaimed when we have elements inside of the hash.

Another issue with microbenchmarks is that some of the size pools may be small since there aren't many objects, so it may cause GC to trigger more frequently. We don't see this kind of issues with macrobenchmarks since they keep more objects alive after bootup. I can see that you've observed this issue in your benchmark as well.

In the following microbenchmark, we can see that with hashes <= 8 elements (i.e. AR hashes), the performance in Ruby 3.3 is basically on-par with Ruby 3.2, but when we switch to ST tables (9 elements), we can see that VWA is significantly faster:

ruby 3.2.4 (2024-04-23 revision af471c0e01) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.421699   0.013686   2.435385 (  2.436581)
Hash with 1 elements  2.955857   0.029542   2.985399 (  3.014737)
Hash with 2 elements  2.891668   0.019301   2.910969 (  2.921928)
Hash with 3 elements  2.900170   0.015396   2.915566 (  2.916644)
Hash with 4 elements  2.889895   0.014969   2.904864 (  2.905188)
Hash with 5 elements  2.895059   0.017253   2.912312 (  2.912845)
Hash with 6 elements  2.869016   0.014351   2.883367 (  2.883618)
Hash with 7 elements  2.907134   0.016862   2.923996 (  2.924871)
Hash with 8 elements  2.926749   0.020445   2.947194 (  2.956753)
Hash with 9 elements 19.932546   0.551577  20.484123 ( 20.498173)


ruby 3.3.3 (2024-06-12 revision f1c7b6f435) [arm64-darwin23]
       user     system      total        real
Hash with 0 elements  2.591444   0.023060   2.614504 (  2.616658)
Hash with 1 elements  3.052488   0.030433   3.082921 (  3.102709)
Hash with 2 elements  3.064385   0.027627   3.092012 (  3.106096)
Hash with 3 elements  3.036935   0.023353   3.060288 (  3.063819)
Hash with 4 elements  3.020218   0.022274   3.042492 (  3.043182)
Hash with 5 elements  3.053680   0.025551   3.079231 (  3.083070)
Hash with 6 elements  2.991555   0.023347   3.014902 (  3.017601)
Hash with 7 elements  3.011856   0.026142   3.037998 (  3.041611)
Hash with 8 elements  3.044671   0.033276   3.077947 (  3.109949)
Hash with 9 elements 14.873814   0.400856  15.274670 ( 15.309215)
require "benchmark"

TIMES = 100_000_000

Benchmark.bm do |x|
  x.report("Hash with 0 elements") do
    TIMES.times { {} }
  end

  x.report("Hash with 1 elements") do
    TIMES.times { { a: 0 } }
  end

  x.report("Hash with 2 elements") do
    TIMES.times { { a: 0, b: 0 } }
  end

  x.report("Hash with 3 elements") do
    TIMES.times { { a: 0, b: 0, c: 0 } }
  end

  x.report("Hash with 4 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0 } }
  end

  x.report("Hash with 5 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0 } }
  end

  x.report("Hash with 6 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0 } }
  end

  x.report("Hash with 7 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0 } }
  end

  x.report("Hash with 8 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0 } }
  end

  x.report("Hash with 9 elements") do
    TIMES.times { { a: 0, b: 0, c: 0, d: 0, e: 0, f: 0, g: 0, h: 0, i: 0 } }
  end
end

Updated by byroot (Jean Boussier) 4 months ago

I still think free pages should be in a global pool rather than tied to a specific pool size. I believe that would solve this issue.

And yes we don't see it on macro benchmarks, but it might still cause more frequent GC than necessary.

Updated by mame (Yusuke Endoh) 4 months ago

@peterzhu2118 (Peter Zhu) This is totally different from the issue we talked about in Slack, which was entirely a micro-benchmark of the speed of generating empty hashes. This problem is much more complex.

This problem is that “reducing the number of object creation can slow down execution time”. If my investigation is correct, the mechanism of the problem is as follows.

  • If a program creates more objects, GC often occurs and the heap grows. This decreases the frequency of GC occurrence gradually and and throughput increases.
  • If a program creates fewer objects, only minor GCs occur and the heap does not grow. Therefore, minor GCs continue to occur at a high frequency and throughput does not increase.

Therefore, allocating unnecessary objects can be much faster, ironically.

If my understanding is correct, minor GC grows the heap only the first few times, and no matter how many minor GCs occur after that, the heap will not grow. I feel there is room for improvement in this.

I am even unsure if VWA is really involved in this problem. But we have observed this problem with Hash objects and Ruby 3.3. It does not reproduce on Ruby 3.2. We couldn't reproduce this by reducing an object rather than Hash.

Note that this problem is not in a micro-benchmark. It actually occurs in RBS + Steep, a real-world macro benchmark.

Updated by peterzhu2118 (Peter Zhu) 4 months ago

I implemented @byroot's suggestion in this PR: https://github.com/ruby/ruby/pull/11562

It significantly improves the performance in your benchmark and brings it almost as fast as Ruby 3.2.

Updated by pocke (Masataka Kuwabara) 4 months ago

@peterzhu2118 (Peter Zhu) Thanks for your work! I've confirmed this PR improves the performance in my environment too.

Actions #6

Updated by peterzhu2118 (Peter Zhu) 4 months ago

  • Status changed from Open to Closed

Applied in changeset git|079ef92b5e59b616d670efe81a33e500f2bf6451.


Implement global allocatable slots and empty pages

[Bug #20710]

This commit introduces moves allocatable slots and empty pages from per
size pool to global. This allows size pools to grow globally and allows
empty pages to move between size pools.

For the benchmark in [Bug #20710], this signficantly improves performance:

Before:
    new_env      2.563 (± 0.0%) i/s -     26.000 in  10.226703s
    new_rails_env      0.293 (± 0.0%) i/s -      3.000 in  10.318960s

After:
    new_env      3.781 (±26.4%) i/s -     37.000 in  10.302374s
    new_rails_env      0.911 (± 0.0%) i/s -      9.000 in  10.049337s

In the headline benchmarks on yjit-bench, we see the performance is
basically on-par with before, with ruby-lsp being signficantly faster
and activerecord and erubi-rails being slightly slower:

--------------  -----------  ----------  -----------  ----------  --------------  -------------
bench           master (ms)  stddev (%)  branch (ms)  stddev (%)  branch 1st itr  master/branch
activerecord    452.2        0.3         479.4        0.4         0.96            0.94
chunky-png      1157.0       0.4         1172.8       0.1         0.99            0.99
erubi-rails     905.4        0.3         967.2        0.4         0.94            0.94
hexapdf         3566.6       0.6         3553.2       0.3         1.03            1.00
liquid-c        88.9         0.9         89.0         1.3         0.98            1.00
liquid-compile  93.4         0.9         89.9         3.5         1.01            1.04
liquid-render   224.1        0.7         227.1        0.5         1.00            0.99
lobsters        1052.0       3.5         1067.4       2.1         0.99            0.99
mail            197.1        0.4         196.5        0.5         0.98            1.00
psych-load      2960.3       0.1         2988.4       0.8         1.00            0.99
railsbench      2252.6       0.4         2255.9       0.5         0.99            1.00
rubocop         262.7        1.4         270.1        1.8         1.02            0.97
ruby-lsp        275.4        0.5         242.0        0.3         0.97            1.14
sequel          98.4         0.7         98.3         0.6         1.01            1.00
--------------  -----------  ----------  -----------  ----------  --------------  -------------
Actions

Also available in: Atom PDF

Like0
Like0Like0Like0Like0Like0Like0