Bug #21685: Unnecessary context-switching, especially bad on multi-core machines. - Ruby - Ruby Issue Tracking System

Actions

Copy link

Bug #21685

open

Unnecessary context-switching, especially bad on multi-core machines.

Bug #21685: Unnecessary context-switching, especially bad on multi-core machines.

Added by jpl-coconut (Jacob Lacouture) about 2 months ago. Updated 18 days ago.

Status:

Open

Assignee:

Target version:

ruby -v:

ruby 3.4.7 (2025-10-08 revision 7a5688e2a2) +PRISM [aarch64-linux]

Backport:

3.2: UNKNOWN, 3.3: UNKNOWN, 3.4: UNKNOWN

[ruby-core:123795]

Description

While debugging a performance issue in a large rails application, I wrote a minimal microbenchmark that reproduces the issue. [here] I was surprised to see that the benchmark takes ~3.6sec on a single-core machine, and ~36sec (10x slower) on a machine with 2 or more cores . Initially I thought this was a bug in the implementation of Thread::Queue, but soon realized it relates to how the ruby reschedules threads around system calls.

I prepared a fix in [this branch] which is based off ruby 3.4.7. I can apply the fix to a different branch or to master if that's helpful. The fix simply defers suspending the thread until the syscall has been running for some short interval. I chose 100usec initially, but this could easily be made configurable.

I pasted raw benchmark results below from a single run (though I did many runs and the results are stable). My CPU is an Apple M4.

After the fix:

Single-core performance improves by 55%, from 3.6sec to 2sec.
Adding cores causes performance to be flat (at 2sec), rather than getting 10x slower.
Multi-core context-switch count reduces by 99.995%, from 1.4 million to ~80
system_time/user_time ratio drops from (1.2 - 1.6) to 0.65

Here are the benchmark results before my change:

# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1140773
nonvoluntary_ctxt_switches:	9487
real	0m3.619s
user	0m1.653s
sys	0m1.950s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	1400110
nonvoluntary_ctxt_switches:	3
real	0m36.223s
user	0m9.380s
sys	0m14.927s

And after:

# time taskset --cpu-list 1 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	88
nonvoluntary_ctxt_switches:	899
real	0m2.031s
user	0m1.209s
sys	0m0.743s

# time taskset --cpu-list 1,2 ./ruby qtest_simple.rb
voluntary_ctxt_switches:	75
nonvoluntary_ctxt_switches:	8
real	0m2.062s
user	0m1.279s
sys	0m0.783s

I was concerned these results might still be reflective of a bug in Thread::Queue, so I also came up with a repro that doesn't rely on it. That one is [here].

Results summary:

Single-core performance improves (this time by only 30%)
Multi-core penalty drops from 4x to 0.
No change to context-switching rates.
system_time/user_time ratio drops from (0.5-1) to 0.15

Before fix:

# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.336s
user	0m0.211s
sys	0m0.118s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m1.424s
user	0m0.468s
sys	0m0.496s

After fix:

# time taskset --cpu-list 1 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	59
real	0m0.241s
user	0m0.202s
sys	0m0.032s

# time taskset --cpu-list 1,2 ./ruby mbenchmark.rb
voluntary_ctxt_switches:	60
real	0m0.238s
user	0m0.195s
sys	0m0.035s

Related issues 1 (1 open — 0 closed)

Updated by byroot (Jean Boussier) about 2 months ago Actions
Copy link
#1 [ruby-core:123797]

The fix simply defers suspending the thread until the syscall has been running for some short interval.

That's an idea we discussed in the past with @jhawthorn (John Hawthorn) @tenderlovemaking (Aaron Patterson) and @luke-gru (Luke Gruber). IIRC that's something Go does?

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago Actions
Copy link
#2 [ruby-core:123866]

How can I help move the discussion forward? Is it better if I create a PR and we discuss there?

Updated by jhawthorn (John Hawthorn) about 1 month ago Actions
Copy link
#3 [ruby-core:123867]

@jpl-coconut Please do! This seems like a really good demonstration of the issue and a good start on addressing it. The change is a lot smaller than I expected it to be ❤️. We would only make performance improvements like this to the master branch without a backport to 3.4 or older.

We have an existing "timer thread" which serves some similar functions (thread preemption, waking sleeping threads). Do you think this could be integrated with that?

There's some previous discussion in #20816

Updated by luke-gru (Luke Gruber) about 1 month ago · Edited Actions
Copy link
#4 [ruby-core:123874]

Thanks for taking a look at this and coming up with an implementation, this is great.

I haven't really played around with it much but I did read the code and I have a few thoughts:

There's 1 deferred wait thread per ractor, which isn't ideal but I understand why you did it like that. It does look like it would work well for programs that don't use ractors. I was envisioning a thread that would deal with all ractors at once and instead of sleeping for a fixed time, would loop over a registered list and sleep (cond_wait_timed) for a variable number of microseconds depending on the first thread's registration time.
The deferred wait thread gets joined at ractor free time which isn't great. It would be better (more predictable, more safe) to join the thread at ractor termination. Since we only want 1 of these threads though, it could have a lifecycle similar to the timer thread (created on startup and on fork).
deferred_wait_th_dummy is leaking :)
50 microseconds seems reasonable but I haven't played around with it.

Overall great job, and our team might look at this in the new year. I think it might be okay if we added this to ruby for only the main ractor, since most programs aren't using ractors right now and it does solve a real issue until we come up with a multi-ractor solution. However, it might be too late to get something like this into ruby 4.

If you want to work on it further or collaborate don't hesitate to reach out.

Thanks.

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago 2Actions
Copy link
#5 [ruby-core:123882]

I created a PR with my original changes applied to master. It's here.

The proposal to use a single-thread per process rather than one per ractor involves some complexity, but I think it's simpler than reusing the timer-thread. I'll work on that and update the PR when ready.

Updated by ko1 (Koichi Sasada) 22 days ago · Edited Actions
Copy link
#6 [ruby-core:124084]

Actually MN threads does similar thing.

$ time RUBY_MN_THREADS=1 make run
./miniruby -I../../src/trunk/lib -I. -I.ext/common  -r./x86_64-linux-fake  ../../src/trunk/test.rb
voluntary_ctxt_switches:        56
nonvoluntary_ctxt_switches:     0

real    0m15.720s
user    0m6.858s
sys     0m10.661s

if the working threads are not main thread (main thread doesn't use MN thread).

Thread.new{ # need to run on child threads
  qstress(Queue)
}.join
puts File.read('/proc/self/status').scan(/\w*voluntary.*?$/)

while MN=0 is slow as in description:

$ time RUBY_MN_THREADS=0 make run
./miniruby -I../../src/trunk/lib -I. -I.ext/common  -r./x86_64-linux-fake  ../../src/trunk/test.rb
voluntary_ctxt_switches:        1373526
nonvoluntary_ctxt_switches:     3

real    0m49.729s
user    0m9.855s
sys     0m30.775s

$ time RUBY_MN_THREADS=1 taskset 1 make run
./miniruby -I../../src/trunk/lib -I. -I.ext/common  -r./x86_64-linux-fake  ../../src/trunk/test.rb
voluntary_ctxt_switches:        1
nonvoluntary_ctxt_switches:     3

real    0m5.617s
user    0m2.883s
sys     0m2.541s

But using single CPU is faster, and it may be improved...

Updated by ko1 (Koichi Sasada) 21 days ago Actions
Copy link
#7 [ruby-core:124112]

Previous machine is on WSL.

On another vanilla Linux machine
Ubuntu 24.04/Linux 6.8.0-87-generic/i7-13700H nproc:20:

                  all CPU                                1 CPU
MN=0                        real: 0m8.919s                        real:0m7.874s
         voluntary_ctxt_switches:  1399816     voluntary_ctxt_switches: 1657676
      nonvoluntary_ctxt_switches:       35  nonvoluntary_ctxt_switches:   48541

MN=1
                            real:0m10.118s                        real:0m6.915s
         voluntary_ctxt_switches:  1399058     voluntary_ctxt_switches: 1203349
      nonvoluntary_ctxt_switches:       16  nonvoluntary_ctxt_switches:   20080

MN=1/tasks run on child threads (qstress(Queue) is on a child thread)
                            real: 0m6.692s                        real:0m5.672s
         voluntary_ctxt_switches:        1     voluntary_ctxt_switches:       1
      nonvoluntary_ctxt_switches:        1  nonvoluntary_ctxt_switches:       3

There are difference but not significant on the description.
It seems depends on CPU/Linux versions (scheduler).

BTW single thread overhead is:

ko1@ruby-sp3:~/ruby/build/master$ time make run
./miniruby -I../../src/master/lib -I. -I.ext/common  -r./x86_64-linux-fake  ../../src/master/test.rb

real    0m3.436s
user    0m2.239s
sys     0m1.197s

with the following script:

def qstress_single qclass
  q = qclass.new
  messages = []

  # worker
  (0..1000000).each{|i|
    q.push i
    File.write('/dev/null', '0')

    # master
    m = q.pop
    messages << m
    if messages.count >= 5
      messages.clear
      File.write('/dev/null', '0')
    end

    case m
    when :end
      break
    end
  }
  q.push :end
end

qstress_single

Updated by jpl-coconut (Jacob Lacouture) 21 days ago Actions
Copy link
#8 [ruby-core:124113]

Interesting. I get similar results on both linux and macos, on both benchmarks. But I'm using ARM in both cases, not x86.

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#9

Related to Bug #20816: Potential regression in Ruby 3.3.x (compared with 3.1 and 3.2) regarding fast syscalls and multi-threading. added

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#10 [ruby-core:124163]

The benchmark from #20816 gets a little more pure results as it only deals with a fast syscall in a loop without Queue communication. I adjusted it slightly.

THREADS = (ARGV.first || 10).to_i  # choose thread count from CLI
N = 1_000_000 # to make the test longer or shorter
IMPL = ENV["RACTOR"] == "1" ? Ractor : Thread

puts "Ruby #{RUBY_VERSION} / YJIT: #{RubyVM::YJIT.enabled?} / #{THREADS} #{IMPL}s"
puts RUBY_DESCRIPTION

before = Process.clock_gettime(Process::CLOCK_MONOTONIC)
Array.new(THREADS).map do
  IMPL.new do
    (N/THREADS).times { File.mtime(__FILE__) }
  end
end.each(&:join)
elapsed = Process.clock_gettime(Process::CLOCK_MONOTONIC) - before

puts "#{N} calls completed in #{elapsed.round(3)}s"

Running this in perf stat shows 1,001,669 context-switches, every time we syscall we make a context switch. With M:N threads 193,182 context-switches. With this branch 5,998 context-switches.

It's not totally clear to me why M:N threading helps.

Comparing this branch against current Ruby master, parameterized by 1 or 8 threads, Ractors vs Threads, and M:N scheduling shows the following

    (fastest)                     RUBY_MN_THREADS=0 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 8 ran
    1.00 ± 0.09 times faster than RUBY_MN_THREADS=1 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 8
    1.10 ± 0.08 times faster than RUBY_MN_THREADS=0 RACTOR=1 ruby --disable-gems mtime.rb 8
    1.12 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ruby --disable-gems mtime.rb 1
    1.12 ± 0.07 times faster than RUBY_MN_THREADS=1 RACTOR=1 ruby --disable-gems mtime.rb 8
    1.13 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 1
    1.15 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 8
    1.15 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 1
    1.16 ± 0.05 times faster than RUBY_MN_THREADS=1 RACTOR=0 ruby --disable-gems mtime.rb 1
    1.17 ± 0.05 times faster than RUBY_MN_THREADS=0 RACTOR=0 ~/src/ruby/miniruby --disable-gems mtime.rb 8
    1.22 ± 0.06 times faster than RUBY_MN_THREADS=1 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 1
    1.22 ± 0.06 times faster than RUBY_MN_THREADS=0 RACTOR=1 ~/src/ruby/miniruby --disable-gems mtime.rb 1
    1.23 ± 0.06 times faster than RUBY_MN_THREADS=1 RACTOR=1 ruby --disable-gems mtime.rb 1
    1.24 ± 0.06 times faster than RUBY_MN_THREADS=0 RACTOR=1 ruby --disable-gems mtime.rb 1
    2.53 ± 0.29 times faster than RUBY_MN_THREADS=1 RACTOR=0 ruby --disable-gems mtime.rb 8
    4.23 ± 0.33 times faster than RUBY_MN_THREADS=0 RACTOR=0 ruby --disable-gems mtime.rb 8

Full results https://gist.github.com/jhawthorn/9945282384f005e0deeb2be889853727

These land in a couple groups. This branch with Ractors is clearly the fastest (as expected RUBY_MN_THREADS makes no difference)

Following that it's about 10-20% slower running the benchmark in any of single threaded, in 8 Ractors, or in 1 Ractor (one Ractor seems a little slower)

Then much below that it's 2x slower again to run 8 threads with RUBY_MN_THREADS=1, and 4x slower with RUBY_MN_THREADS=0

I think this technique looks very promising. There are many syscalls that 99% of the time return immediately but we end up being pessimistic and doing expensive synchronization and context switches. It's definitely too close to Ruby 4.0, but I really think we should explore this for 4.1

Updated by ko1 (Koichi Sasada) 19 days ago Actions
Copy link
#11 [ruby-core:124170]

https://github.com/ruby/ruby/pull/15529

It seems enough small, so I want to introduce it to 4.0.

Updated by luke-gru (Luke Gruber) 19 days ago Actions
Copy link
#12 [ruby-core:124171]

I'm with John that I think this is very promising but should not be introduced in ruby 4.0. It needs more testing to make sure it's working correctly and it needs more benchmarks for more realistic workloads. We can't do that in 2 weeks. Any non-minor changes to the scheduler right before release scares me.

Updated by byroot (Jean Boussier) 18 days ago 1Actions
Copy link
#13 [ruby-core:124174]

From: https://github.com/ruby/ruby/pull/15529

The timer thread checks the running threads to provide timeslice every 10ms and finds the thread (th) performing blocking operation.

Isn't 10ms a massive amount of time in that context? I suspect such threshold would ruin the throughput for threaded applications (typically Puma).

The original prototype had a 100usec interval, which seems more reasonable (but already a bit long?).

Actions

Copy link

Also available in: PDF Atom

Project

General

Profile

Ruby

Custom queries

Bug #21685

Unnecessary context-switching, especially bad on multi-core machines.

Updated by byroot (Jean Boussier) about 2 months ago Actions
Copy link
#1 [ruby-core:123797]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago Actions
Copy link
#2 [ruby-core:123866]

Updated by jhawthorn (John Hawthorn) about 1 month ago Actions
Copy link
#3 [ruby-core:123867]

Updated by luke-gru (Luke Gruber) about 1 month ago · Edited Actions
Copy link
#4 [ruby-core:123874]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago 2Actions
Copy link
#5 [ruby-core:123882]

Updated by ko1 (Koichi Sasada) 22 days ago · Edited Actions
Copy link
#6 [ruby-core:124084]

Updated by ko1 (Koichi Sasada) 21 days ago Actions
Copy link
#7 [ruby-core:124112]

Updated by jpl-coconut (Jacob Lacouture) 21 days ago Actions
Copy link
#8 [ruby-core:124113]

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#9

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#10 [ruby-core:124163]

Updated by ko1 (Koichi Sasada) 19 days ago Actions
Copy link
#11 [ruby-core:124170]

Updated by luke-gru (Luke Gruber) 19 days ago Actions
Copy link
#12 [ruby-core:124171]

Updated by byroot (Jean Boussier) 18 days ago 1Actions
Copy link
#13 [ruby-core:124174]

Project

General

Profile

Ruby

Custom queries

Bug #21685

Unnecessary context-switching, especially bad on multi-core machines.

Updated by byroot (Jean Boussier) about 2 months ago ActionsCopy link #1 [ruby-core:123797]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago ActionsCopy link #2 [ruby-core:123866]

Updated by jhawthorn (John Hawthorn) about 1 month ago ActionsCopy link #3 [ruby-core:123867]

Updated by luke-gru (Luke Gruber) about 1 month ago · Edited ActionsCopy link #4 [ruby-core:123874]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago 2ActionsCopy link #5 [ruby-core:123882]

Updated by ko1 (Koichi Sasada) 22 days ago · Edited ActionsCopy link #6 [ruby-core:124084]

Updated by ko1 (Koichi Sasada) 21 days ago ActionsCopy link #7 [ruby-core:124112]

Updated by jpl-coconut (Jacob Lacouture) 21 days ago ActionsCopy link #8 [ruby-core:124113]

Updated by jhawthorn (John Hawthorn) 19 days ago ActionsCopy link #9

Updated by jhawthorn (John Hawthorn) 19 days ago ActionsCopy link #10 [ruby-core:124163]

Updated by ko1 (Koichi Sasada) 19 days ago ActionsCopy link #11 [ruby-core:124170]

Updated by luke-gru (Luke Gruber) 19 days ago ActionsCopy link #12 [ruby-core:124171]

Updated by byroot (Jean Boussier) 18 days ago 1ActionsCopy link #13 [ruby-core:124174]

Updated by byroot (Jean Boussier) about 2 months ago Actions
Copy link
#1 [ruby-core:123797]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago Actions
Copy link
#2 [ruby-core:123866]

Updated by jhawthorn (John Hawthorn) about 1 month ago Actions
Copy link
#3 [ruby-core:123867]

Updated by luke-gru (Luke Gruber) about 1 month ago · Edited Actions
Copy link
#4 [ruby-core:123874]

Updated by jpl-coconut (Jacob Lacouture) about 1 month ago 2Actions
Copy link
#5 [ruby-core:123882]

Updated by ko1 (Koichi Sasada) 22 days ago · Edited Actions
Copy link
#6 [ruby-core:124084]

Updated by ko1 (Koichi Sasada) 21 days ago Actions
Copy link
#7 [ruby-core:124112]

Updated by jpl-coconut (Jacob Lacouture) 21 days ago Actions
Copy link
#8 [ruby-core:124113]

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#9

Updated by jhawthorn (John Hawthorn) 19 days ago Actions
Copy link
#10 [ruby-core:124163]

Updated by ko1 (Koichi Sasada) 19 days ago Actions
Copy link
#11 [ruby-core:124170]

Updated by luke-gru (Luke Gruber) 19 days ago Actions
Copy link
#12 [ruby-core:124171]

Updated by byroot (Jean Boussier) 18 days ago 1Actions
Copy link
#13 [ruby-core:124174]